This project demonstrates the use of FastText for creating and exploring word embeddings, with a special focus on Indian food recipes. We'll use pre-trained models for English and Hindi, and then train a custom model on Indian food recipe data.
- Installation
- Downloading Pre-trained Models
- Using Pre-trained FastText Models
- Custom Training on Indian Food Recipes
- Conclusion
- Further Resources
First, install the required libraries:
pip install fasttext pandas
FastText provides pre-trained word vectors for 157 languages. To download the models used in this project:
- Visit the FastText website
- Scroll down to the "Pre-trained word vectors" section
- Download the following files:
- For English:
cc.en.300.bin.gz
- For Hindi:
cc.hi.300.bin.gz
- For English:
After downloading, extract the .bin
files from the .gz
archives.
gunzip cc.en.300.bin.gz
gunzip cc.hi.300.bin.gz
Move the extracted .bin
files to your project directory or a designated models folder.
Load the pre-trained English model and explore word embeddings:
import fasttext
# Load the pre-trained English model
model_en = fasttext.load_model('path/to/cc.en.300.bin')
# Get nearest neighbors for 'good'
print(model_en.get_nearest_neighbors('good'))
# Check the shape of the word vector
print(model_en.get_word_vector("good").shape)
# Get analogies
print(model_en.get_analogies("berlin", "germany", "france"))
Output:
[(0.7517593502998352, 'bad'),
(0.7426098585128784, 'great'),
(0.7299689054489136, 'decent'),
...]
(300,)
[(0.7303731441497803, 'paris'),
(0.6408537030220032, 'france.'),
(0.6393311023712158, 'avignon'),
...]
Load the pre-trained Hindi model and explore word embeddings:
# Load the pre-trained Hindi model
model_hi = fasttext.load_model('path/to/cc.hi.300.bin')
# Get nearest neighbors for "अच्छा" (good)
print(model_hi.get_nearest_neighbors("अच्छा"))
Output:
[(0.6697985529899597, 'बुरा'),
(0.6132625341415405, 'अच्छे'),
(0.608695387840271, 'अच्चा'),
...]
- Load the dataset:
import pandas as pd
import re
df = pd.read_csv("Cleaned_Indian_Food_Dataset.csv")
print(df.shape)
print(df.head(3))
- Define a preprocessing function:
def preprocess(text):
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(r'[ \n]+', ' ', text)
return text.strip().lower()
- Export preprocessed data to a text file:
df.to_csv("food_recipes.txt", columns=["TranslatedInstructions"], header=None, index=False)
Train a custom FastText model on the preprocessed data:
import fasttext
model = fasttext.train_unsupervised("food_recipes.txt")
Explore the custom-trained model:
# Get nearest neighbors for 'paneer'
print(model.get_nearest_neighbors("paneer"))
# Get nearest neighbors for 'halwa'
print(model.get_nearest_neighbors("halwa"))
Output:
[(0.6676578521728516, 'tikka'),
(0.6331593990325928, 'bhurji'),
(0.6316412687301636, 'tikkas'),
...]
[(0.7327786087989807, 'khoya'),
(0.7155830264091492, 'sheera'),
(0.6999987363815308, 'rabri'),
...]
This project demonstrates the versatility of FastText for working with word embeddings. We've shown how to use pre-trained models for English and Hindi, as well as how to train a custom model on domain-specific data. These techniques can be applied to various natural language processing tasks, particularly those involving specialized vocabularies or multilingual contexts.
The custom-trained model now provides word embeddings specifically tailored to Indian food recipes, allowing for more accurate and relevant word associations within this domain. This can be particularly useful for applications such as recipe recommendation systems, ingredient substitution suggestions, or culinary trend analysis.
- FastText Official Website
- FastText GitHub Repository
- Tutorial on Word Embeddings
- Research Paper: Enriching Word Vectors with Subword Information
Happy embedding! 🚀🍽️