Skip to content

Latest commit

 

History

History
83 lines (65 loc) · 3.67 KB

File metadata and controls

83 lines (65 loc) · 3.67 KB

Word2Vec and FastText Word Embeddings with Gensim in Python

Python Version LinkedIn Kaggle GitHub

🚀 Business Objective

In the dynamic field of Natural Language Processing (NLP), deciphering context from textual data stands as a formidable challenge. Word embeddings, providing semantically rich vectors, emerge as the ultimate solution. This project aims to construct domain-specific medical word embeddings using Word2Vec and FastText in Python.

📊 Data Description

Leveraging a clinical trials dataset focused on Covid-19 obtained from Dimensions COVID-19 Publications, Datasets, and Clinical Trials. The dataset comprises 10666 rows and 21 columns, with emphasis on the 'Title' and 'Abstract' columns.

🎯 Aim

The primary objective is to train Skip-gram and FastText models for word embeddings and subsequently develop a search engine alongside a Streamlit UI.

🛠️ Tech Stack

  • Language: Python
  • Libraries: Pandas, NumPy, Matplotlib, Plotly, Gensim, Streamlit, NLTK
  • Environment: Jupyter Notebook

🔍 Approach

  1. Import Essential Libraries.
  2. Read the Dataset.
  3. Pre-process the Data:
    • Remove URLs
    • Convert text to lowercase
    • Remove numerical values
    • Remove punctuation
    • Tokenization
    • Remove stop words
    • Lemmatization
    • Remove '\n' character from columns
  4. Conduct Exploratory Data Analysis (EDA):
    • Word cloud visualization
  5. Train the 'Skip-gram' Model.
  6. Train the 'FastText' Model.
  7. Model Embeddings and Assess Similarity.
  8. Generate PCA Plots for Skip-gram and FastText Models.
  9. Convert Abstract and Title to Vectors using the Skip-gram and FastText Models.
  10. Utilize the Cosine Similarity Function.
  11. Pre-process the Input Query.
  12. Define a Function to Return Top 'n' Similar Results.
  13. Evaluate Results.
  14. Deploy the Streamlit Application.

📝 Project Takeaways

  1. Understanding the business problem.
  2. Grasping the architecture to build the Streamlit application.
  3. Mastery of Word2Vec and FastText models.
  4. Importing datasets and necessary libraries.
  5. Data Pre-processing.
  6. Basic Exploratory Data Analysis (EDA).
  7. Training Skip-gram model with varying parameters.
  8. Training FastText model with varying parameters.
  9. Embedding models understanding and implementation.
  10. Plotting PCA plots.
  11. Obtaining vectors for each attribute.
  12. Executing the Cosine similarity function.
  13. Input query pre-processing.
  14. Result evaluation.
  15. Building a function to return top 'n' similar results for a given query.
  16. Understanding the Streamlit application code.
  17. Deployment of the Streamlit application.

Certainly! Let's make the "Get Connected" section more fun and engaging:

Absolutely! Let's make the "Get Connected" section more enthusiastic and visually appealing, with follow buttons aligned on the left side:

🔗 Get Connected

For more insightful projects and collaboration, connect with me on: