diff --git a/AI/Day5/README.md b/AI/Day5/README.md new file mode 100644 index 0000000..112c7c0 --- /dev/null +++ b/AI/Day5/README.md @@ -0,0 +1,32 @@ +# ~ PoC AI Pool 2024 ~ + +- ## Day 5: NLP and GNNs + + - ### Module 1: Natural Language Processing + + - **Notebook:** [`nlp.ipynb`](./nlp.ipynb) + + - ### Module 3: GNNs + - [**Write-up**](./gnn/README.md) + +--- + +**The finish line is near !** + +On today's menu, we'll explore the fields of natural language processing and graph neural networks. + +> Here's a list of resources that we believe can be useful to follow along (and that we've ourselves used to learn these topics before being able to write the subjects): + +## Module 1 + +- [Introduction to Natural Language Processing - Data Science Dojo](https://youtube.com/watch?v=s5zuplW8ua8) + +## Module 2 + +## Module 3 + +- [Maxime Labonne](https://mlabonne.github.io/blog/) + - [Hands-On GNNs Using Python](https://mlabonne.github.io/blog/book.html) + - [GNN articles](https://mlabonne.github.io/blog/posts/2022_02_20_Graph_Convolution_Network.html) +- [distil.pub](https://distill.pub/) + - [A Gentle Introduction to Graph Neural Networks](https://distill.pub/2021/gnn-intro/) diff --git a/AI/Day5/gnn/README.md b/AI/Day5/gnn/README.md new file mode 100644 index 0000000..20aaff2 --- /dev/null +++ b/AI/Day5/gnn/README.md @@ -0,0 +1,170 @@ +# ~ PoC AI Pool 2024 ~ + +- ## Day 5: GNNs, NLP, and more + +--- + +One week to learn Deep Learning is absolutely not enough, and while we tried to go as deep as possible and take our time for each subject, the truth is we could dedicate the entire week to just Convolutions or just Reinforcement Learning and we still wouldn't even scratch the surface ! + +The takeaway is, you should definitely take the time to review each subject by yourself after the Pool, if you're passionate about machine learning ! + +This field is highly theoretical, so rushing it isn't going to suffice. Our purpose for this Pool is simply to give you a taste of what AI is about and point to which direction you should go if you decide to go **deeper**. + +> There wasn't much space left to do a whole module on **Graph Neural Networks** but they are still a very interesting (and trending) subject in Deep Learning that we'd like you to know about, so here's a little introduction to **GNNs** to get you started. Feel free to use this as a starting point for your _Rush_ if you want to ! + +--- + +- ### Module 3: Graph Neural Networks + +--- + +#### A. Properties of graph datasets + +![graph](graph.svg) + +> Graphs are composed of nodes and the edges between them. + +![graph](graph2.svg) + +> The edges can be directed or undirected. + +![graph](graph3.svg) + +> There can be loops (a node having an edge to itself) or isolated nodes (a node having 0 edges). + +--- + +One way to represent the information in graph datasets is by using **adjacency matrices**, or $A$. + +![graph](graph4.svg) + +The adjacency matrix $A$ for the above graph, for instance, is : + +
+ +| | A | B | C | +| ----- | --- | --- | --- | +| **A** | 0 | 1 | 1 | +| **B** | 0 | 0 | 0 | +| **C** | 1 | 0 | 1 | + +
+ +We might want to know how many neighbours each node has.\ +We can use the degree matrix, $D$, which looks like this : + +
+ +| | A | B | C | +| ----- | --- | --- | --- | +| **A** | _2_ | 0 | 0 | +| **B** | 0 | _0_ | 0 | +| **C** | 0 | 0 | _2_ | + +
+ +--- + +But nodes can also have features, which is what we might want to use to perform classification tasks or solve other problems with graph datasets. + +You can find a lot of cool visualisations (some are even interactive) [here](https://distill.pub/2021/gnn-intro/) ! This article (actually, [the entire blog](https://distill.pub/)) is in fact a very good, detailed introduction to GNNs, so feel free to read it in its entirety if you have the time ! + +--- + +#### B. Difference between GNNs and regular neural networks + +If we only cared about the node's _features_, neural networks on graph datasets would be the same as any regular MLP (multi-layer perceptron, or a regular NN with Linear transformations: $ wx+b $). + +What's different between graph - and tabular datasets is the information of which nodes are linked to which other nodes and in what way. + +You can draw a parallel to computer vision : you can achieve okay results on MNIST with a simple MLP, but more advanced problems require more specific solutions : which is why we came up with Convolution Neural Networks, which take into account patterns, zones and how which pixels are close to each other. All these informations lead to far better results. + +In the same way, GNNs will take into account the edges between the nodes in order to produce more interesting results. + +So if the formula for a linear layer without biases is +$$ y = xw $$ +You can write the formula for a very basic graph linear layer as +$$ y = Axw $$ +With $A$ being the **adjacency matrix** containing the links between all the nodes in the graph. + +Simply by modifying the MLP formula and introducing topological information through an adjacency matrix, we can achieve better results and have ourselves a very basic graph neural network. + +In [Maxime Labonne's](<(https://mlabonne.github.io/blog/)>) fantastic [book on Graph Neural Networks](https://mlabonne.github.io/blog/book.html) (which we recommend, as it is one of the main sources that helped us write this short introduction to GNNs alongside his [blog](https://mlabonne.github.io/blog/posts/2022_02_20_Graph_Convolution_Network.html) which contain various Jupyter Notebooks on GNNs which we absolutely encourage you to go through), the author draws a comparison between an MLP and this basic implementation of a GNN, by training both models on graph datasets. + +The results speak for themselves: `53.47%` accuracy using an MLP compared to `74.98%` for the basic GNN. + +#### C. Refining the architecture : Graph Convolutional Network (GCN) + +We kept refering to the graph linear layer formula as "_basic_". + +The reason for that is it assumes that all nodes have the same number of edges. If one node is neighbour to all nodes in the graph, while the other nodes only have that one core node as their neighbour, the **embedding**, the value obtained for each indiviudal node, would have greater values than the other embeddings. + +So by combing our adjacency matrix $A$ which tells us which nodes are connected with our degree matrix $D$ which tells us how many connections each node has, we obtain the following formula for a **GCN** layer (Graph Convolutional Network) : + +$$ \~D^{-\frac{1}{2}}\~A^T\~D^{-\frac{1}{2}}XW^T $$ + +> This might just be the most satanic looking formula you've seen all week (and if you inspect the _LaTeX_ notation for this formula, it's even worse), but I assure you, it's actually really simple : +> +> - Our base formula is $AXW$, right ? An MLP combined with our adjacency matrix to take into account topological information. +> - We want to also take into account the differences in neighbour count between our nodes, so we add the inverse of our degree matrix $D$, $D^{-1}$ to normalize our features. +> - This gives us $D^{-1}AXW$. +> - But adding $D$ at the beginning of our series of matrix multiplications would only normalise each row. So to normalize all features we can use $D^{-\frac{1}{2}}AD^{-\frac{1}{2}}XW$. +> - We've simplified the formulas by removing the $^T$ transpose notation, but because of matrix multiplication rules, we need to take the transposes of these matrices, giving us $ D^{-\frac{1}{2}}A^TD^{-\frac{1}{2}}XW^T $. +> - Now all that's left between us and GCNs is the weird little ~ above our adjacency and degree matrices. +> - So far, by multiplying our data $X$ with our adjacency matrix $A$, we are only looking at our nodes' neighbours, not at the actual nodes themselves. +> - So to add the information of the particular nodes to our architecture, we call $\~A$ the adjacency matrix with the added notion of loops, meaning nodes have edges pointing to themselves, just like node `C` does by default in our earlier representations. +> - This gives us the $ \~D^{-\frac{1}{2}}\~A^T\~D^{-\frac{1}{2}}XW^T $ formula for GCN layers. + +For each individual node, the formula for its embedding looks like this : + +$$ embedding\ i = \sum{\frac{1}{\sqrt{deg(i)}\sqrt{deg(j)}}}x_jW^T $$ + +Using the same dataset as earlier to compare the results of basic GNNs with GCNs, the author achieves `80.17%` accuracy compared to the previous `74.98%`. The **standard deviation**, which represents how much the results vary from training to training is also greatly reduced: from more or less `1.50%` to more or less `0.61%` ! + +
+ +![output](output.png) + +
+ +--- + +#### D. Applications of GNNs + +There are many applications of GNNs, such as graph-, node- or edge-level classifications. For example, using a dataset such as [Zachary's Karate Club](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) : + +![alt text](image.png) + +This is the MNIST of Graph Neural Networks. A simple, efficient dataset which provides a fantastic introduction to the possibilities of GNNs.\ +The dataset represents a social network, each node being a member of a karate club which is separated in two factions. + +> You could attempt to classify whether a node belongs to one faction or the other using a GCN in PyTorch !\ +> Here's a link to [an article which guides you through this task](https://mlabonne.github.io/blog/posts/2022_02_20_Graph_Convolution_Network.html) but you should also give it a try by yourself first by looking at the official PyTorch Geometric [getting started article](https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html) ! + +You could also tackle link prediction, which means predicting whether two nodes should be connected or not. + +#### E. LightGCN and recommender systems + +**LightGCN** is an alternate version of GCN which is used for building recommender systems. +Here's a link to an [implementation of this model](https://github.com/PacktPublishing/Hands-On-Graph-Neural-Networks-Using-Python/blob/main/Chapter17/chapter17.ipynb). + +> Try to understand it by looking at other resources you can find on the subject and try to adapt the code in this notebook by using it on another graph dataset, such as [the Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset), for example.\ +> This task could be really fun if you liked data science and `pandas` : since most of the Deep Learning aspect is already done, you can focus on adapting the model to the new data, and to do that you will need to use some of your **data science** skills to make sure you understand the data and _manipulate_ it to suit your needs ! + +--- + +
+ +![that'sallfolks](giphy.gif) + +
+ +I hope this little write-up was worth reading, and maybe it gave you some ideas for your _Rush_ ! + +Thank you for getting to the end of this AI Pool, it means a lot ! And I can't wait to see what projects you're able to build this weekend ! + +> **Remember**: no one appreciates an unfinished mess !\ +> You've only been studying AI for a week now, so don't go crazy trying to create GPT 5 : stay humble, build a small, cool project to show off and most importantly, one that you'll be proud of in the future !\ +> Or if that's really your thing, go crazy and prove me that one week is all you need to build the project of your dreams ! + +#### Good luck ! diff --git a/AI/Day5/gnn/giphy.gif b/AI/Day5/gnn/giphy.gif new file mode 100644 index 0000000..02fbfc7 Binary files /dev/null and b/AI/Day5/gnn/giphy.gif differ diff --git a/AI/Day5/gnn/graph.svg b/AI/Day5/gnn/graph.svg new file mode 100644 index 0000000..6ee8ec4 --- /dev/null +++ b/AI/Day5/gnn/graph.svg @@ -0,0 +1,159 @@ + + + + + + + + + + + + + B + + + + + + + + + + + + + + + + + + + + + + + + C + + + + + + + + + + + + + + + + + + + + + + + + A + + + + + + + + + + + Graph + + + + + + + Node + + + + + + + Edge + \ No newline at end of file diff --git a/AI/Day5/gnn/graph2.svg b/AI/Day5/gnn/graph2.svg new file mode 100644 index 0000000..52245c9 --- /dev/null +++ b/AI/Day5/gnn/graph2.svg @@ -0,0 +1,106 @@ + + + + + + + + + + + + + A + + + + + + B + + + + + + + + + + + + C + + + + + + D + + + + + + + + + + + + + + \ No newline at end of file diff --git a/AI/Day5/gnn/graph3.svg b/AI/Day5/gnn/graph3.svg new file mode 100644 index 0000000..56f2c17 --- /dev/null +++ b/AI/Day5/gnn/graph3.svg @@ -0,0 +1,153 @@ + + + + + + + + + + + + + B + + + + + + + + + + + + + + + + + + + + + + + + C + + + + + + A + + + + + + + + + + + Isolated Node + + + + + + + Loop + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/AI/Day5/gnn/graph4.svg b/AI/Day5/gnn/graph4.svg new file mode 100644 index 0000000..376bfb1 --- /dev/null +++ b/AI/Day5/gnn/graph4.svg @@ -0,0 +1,127 @@ + + + + + + + + + + + + + B + + + + + + + + + + + + + + + + + + + + + + + + C + + + + + + A + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/AI/Day5/gnn/image.png b/AI/Day5/gnn/image.png new file mode 100644 index 0000000..159acb7 Binary files /dev/null and b/AI/Day5/gnn/image.png differ diff --git a/AI/Day5/gnn/output.png b/AI/Day5/gnn/output.png new file mode 100644 index 0000000..a890fc2 Binary files /dev/null and b/AI/Day5/gnn/output.png differ diff --git a/AI/Day5/nlp.ipynb b/AI/Day5/nlp.ipynb new file mode 100644 index 0000000..b73408b --- /dev/null +++ b/AI/Day5/nlp.ipynb @@ -0,0 +1,516 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ~ PoC AI Pool 2024 ~\n", + "- ## Day 5: NLP\n", + " - ### Module 1: Emotion Recognition with NLP\n", + "-----\n", + "Welcome to the final day of your PoC AI Pool !\n", + "\n", + "In this module, we'll see a different way of using PyTorch to to build a Natural Language Processing neural network which is capable of detecting the language of a given sentence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Data Cleaning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import sklearn\n", + "import torch\n", + "import torch.nn as nn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's import the language dataset from the `datasets` package πŸ“¦ :\n", + "\n", + ">Datasets is a library for easily accessing and sharing datasets for Audio πŸ”‰, Computer Vision πŸ‘οΈ , and Natural Language Processing (NLP) πŸ“– tasks.\n", + "\n", + "We will be using the [papluca/language-identification](https://huggingface.co/datasets/papluca/language-identification)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"papluca/language-identification\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The below code will transform your dataset into a pandas Dataframe which we will use for the rest of this module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def filter_dataset(data, languages):\n", + " return data.filter(lambda x: languages.__contains__(x['labels']))\n", + "\n", + "def process_dataset(data):\n", + " return data.map(lambda x: {'data': (x['labels'], x['text'])})['data']\n", + "\n", + "languages = {\n", + " 'fr': 'french',\n", + " 'en': 'english',\n", + " 'es': 'spanish',\n", + " 'de': 'german'\n", + "}\n", + "\n", + "filtered_data = filter_dataset(dataset['train'], list(languages.keys()))\n", + "processed_data = process_dataset(filtered_data)\n", + "\n", + "df = pd.DataFrame(processed_data, columns=[\"languages\", \"text\"])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Your output should look like this:\n", + "\n", + "![](images/expected_output_lang.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1. Cleaning the data 🧹\n", + "\n", + "\n", + "\n", + "First off, you need to clean the data using natural language processing techniques.\n", + "\n", + "However you achieve this, your cleaned data should be available inside a pandas dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As long as you've cleaned it correctly, it doesn't matter what your result is.\n", + "\n", + "As an example, the sentence \"May The Force be with you.\" might become \"may force\" when cleaned.\\\n", + "If your result looks like that, it means you've implemented the cleaning process correctly. πŸ‘" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import nltk\n", + "from nltk.corpus import stopwords\n", + "nltk.download(\"stopwords\")\n", + "nltk.download(\"popular\")\n", + "\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "languages = [languages[language] for language in languages.keys()]\n", + "stop_words = stopwords.words(languages)\n", + "\n", + "def clean(sentence):\n", + " \"\"\"\n", + " You should clean the data inside this function by using\n", + " different nlp techniques.\n", + " \"\"\"\n", + "\n", + " clean_data = sentence\n", + "\n", + " # Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + " #\n", + "\n", + " return clean_data\n", + "\n", + "df[\"clean\"] = df[\"text\"].apply(clean)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 2. Count Vectorizer πŸ’»\n", + "\n", + "\n", + "Now, in order to prepare the data for usage inside a neural network, you need to vectorize each word in the vocabulary and replace all usages inside your data with the corresponding tensors.\n", + "\n", + "- Step 1: Build a vocabulary containing each word in the dataset (each word must only appear once)\n", + "- Step 2: Vectorize each sentence in the dataset πŸ”‘ -> πŸ”’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence.\n", + "- Step 3: Vectorize your labels (for example, you can replace french πŸ‡«πŸ‡· with index 0, spanish πŸ‡ͺπŸ‡Έ with index 1, etc... )\n", + "\n", + "If you implement all of these steps correctly, you will have a vectorized dataset which will be processable inside a neural network ! \n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You might first want to create a vocabulary comprised of all the words in your cleaned data.\n", + "\n", + ">Build a vocabulary containing each word in the dataset (each word must only appear once)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def build_vocab(sentences):\n", + " \"\"\"\n", + " This method should return a vocabulary of all unique words in our dataframe\n", + " \"\"\"\n", + " ### Enter your code here\n", + "\n", + "\n", + " \n", + "\n", + " ###\n", + "\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the `build_vocab()` function is implemented properly, you should be able to run the code below πŸ‘‡ and see how many words were removed thanks to cleaning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vocab_vanilla = build_vocab(df[\"text\"].apply(nltk.word_tokenize))\n", + "vocab = build_vocab(df[\"clean\"])\n", + "\n", + "print(f\"Number of words in unprocessed data: {len(vocab_vanilla)}\")\n", + "print(f\"Number of words in processed data: {len(vocab)}\")\n", + "\n", + "vocab" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, for the fun part: implement the Count Vectorizer\n", + "\n", + ">Vectorize each sentence in the dataset πŸ”‘ -> πŸ”’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "word2idx = {}\n", + "\n", + "for index, word in enumerate(vocab):\n", + " word2idx[word] = index\n", + "\n", + "def vectorize(sentences):\n", + " vectorized = []\n", + "\n", + " ### Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " ###\n", + "\n", + " return vectorized\n", + "\n", + "df[\"vectorized\"] = vectorize(df[\"clean\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now for the label vectorization:\n", + "\n", + ">Vectorize your labels (for example, you can replace french πŸ‡«πŸ‡· with index 0, spanish πŸ‡ͺπŸ‡Έ with index 1, etc... )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Label Vectorizer\n", + "\n", + "languages_dict = {\n", + " \"fr\": 0,\n", + " \"en\": 1,\n", + " \"es\": 2,\n", + " \"de\": 3,\n", + "}\n", + "\n", + "labels = []\n", + "\n", + "# Enter your code here\n", + "\n", + "#\n", + "\n", + "labels" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Neural Network 🧠\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to process the data with PyTorch, let's convert it into tensors:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = torch.FloatTensor(df[\"vectorized\"])\n", + "y = torch.LongTensor(labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, you need to create your neural network and train a model on our data.\n", + "\n", + "- Step 1: Build a network in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) (your model can be simple as long as it does the job)\n", + "- Step 2: Split your data into train and test subsets (you can use [sklearn's method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for this)\n", + "- Step 3: Train a model on your data until you reach a good accuracy (above 90%)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Neural Network\n", + "\n", + "class Network(nn.Module):\n", + " def __init__(self):\n", + " super(Network, self).__init__()\n", + "\n", + " def forward(self, x):\n", + " pass\n", + "\n", + "###\n", + "\n", + "model = Network()\n", + "\n", + "criterion = None\n", + "optimizer = None\n", + "\n", + "from torch.utils.data import Dataset, DataLoader\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "class MyData(Dataset):\n", + " \"\"\"\n", + " This class will be useful when working with batches\n", + " \"\"\"\n", + "\n", + " def __init__(self, x, y):\n", + " self.data = x\n", + " self.target = y\n", + "\n", + " def __getitem__(self, index):\n", + " x = self.data[index]\n", + " y = self.target[index]\n", + "\n", + " return x, y\n", + "\n", + " def __len__(self):\n", + " return len(self.data)\n", + "\n", + "### Training and Testing\n", + "\n", + "def training_loop(x, y):\n", + " x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)\n", + "\n", + " train_dataset = MyData(x_train, y_train)\n", + " test_dataset = MyData(x_test, y_test)\n", + "\n", + " train_dataset = DataLoader(train_dataset, batch_size=32)\n", + " test_dataset = DataLoader(test_dataset, batch_size=32)\n", + "\n", + " # Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " #\n", + "\n", + " train_accuracy = None\n", + " test_accuracy = None\n", + "\n", + " return train_accuracy, test_accuracy\n", + "\n", + "###\n", + "\n", + "# Store the predictions for all of our data as well as the % of training and testing accuracy inside `predictions`, `train_accuracy` and `test_accuracy`\n", + "train_accuracy, test_accuracy = training_loop(x, y)\n", + "\n", + "print(f\"Train accuracy: {train_accuracy}\")\n", + "print(f\"Test accuracy: {test_accuracy}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If all went well, your accuracy should be close to 100%. πŸ’―\n", + "\n", + "Now, let's see how well the model guesses a language:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Prediction\n", + "\n", + "idx2lang = {\n", + " 0: \"fr\",\n", + " 1: \"en\",\n", + " 2: \"es\",\n", + " 3: \"de\",\n", + "}\n", + "\n", + "def predict(x):\n", + " predictions = []\n", + "\n", + " return predictions\n", + "\n", + "predictions = predict(x)\n", + "\n", + "df[\"predictions\"] = predictions\n", + "\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sns.countplot(x='value', hue=\"variable\", data=df[['languages', 'predictions']].melt())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Awesome ! πŸ˜„\n", + "\n", + "You've successfully created a language detection AI using Natural Language Processing and neural networks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def predict_sentence(sentence):\n", + " return predict(vectorize([clean(sentence)]))\n", + "\n", + "predict_sentence(\"J'ai rΓ©ussi Γ  implΓ©menter une intelligence artificielle !\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}