Sentiment Analysis with PySpark and Transformers

This project implements a scalable and efficient sentiment analysis framework using PySpark, Spark NLP, and Hugging Face Transformers. It is designed to process large-scale textual data in a distributed environment, leveraging GPU acceleration, quantized models, and mixed precision techniques for faster and more accurate sentiment analysis. This framework is ideal for handling extensive datasets in real-time and producing detailed sentiment scores at both the sentence and document levels.

Features

Distributed text processing using PySpark for handling large datasets.
Sentence-level sentiment analysis using Spark NLP for granular results.
Supports state-of-the-art transformer models (e.g., RoBERTa) from Hugging Face for sentiment analysis.
Optimized for GPU acceleration, including quantization and mixed precision (bf16/float16) for faster inference.
Dynamic batching and efficient memory management to improve inference speed.
Comprehensive logging and experiment tracking using MLflow.

Architecture Overview

Data Loading: Load data (e.g., product reviews) from Parquet or CSV files into PySpark DataFrames.
Sentence Parsing: Use Spark NLP’s SentenceDetector to split text into sentences for detailed sentiment analysis.
Transformer Model Setup: Load pre-trained or fine-tuned transformer models and tokenizers from Hugging Face, optimized with quantization and mixed precision.
Batching Strategy: Apply dynamic or sequential batching based on the dataset and resource availability.
Sentiment Inference: Process text through the transformer model to generate sentiment scores.
Logging and Monitoring: Track all processes, metrics, and parameters using MLflow for experiment tracking.

Quick Start

Prerequisites

PySpark: Ensure PySpark is installed and set up correctly for distributed processing.
Hugging Face Transformers: Install the Hugging Face transformers library for NLP models.
Spark NLP: Required for sentence parsing and NLP tasks.
MLflow: For tracking experiments and logging metrics.
CUDA: If running on GPU, ensure CUDA is properly configured.

Installation

Clone the repository:

git clone https://github.com/anmolg1997/Spark_cum_GPU_sentiment_analyzer.git
cd Spark_cum_GPU_sentiment_analyzer

Install the required Python packages:
```
pip install -r requirements.txt
```

Configuration

Set up the MODEL_DIRECTORY in the script to the location of your pre-trained transformer model.
Update the MLflow experiment name to your desired experiment folder.
Customize batch size, max sequence length, and other hyperparameters based on your dataset.

Running the Script

Initialize a Spark session and load your data:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sentiment Analysis").getOrCreate()

# Load data into a DataFrame (from Parquet/CSV)
df = spark.read.parquet("/path/to/your/data")

Create an instance of the SentimentAnalyzer class and run the sentiment analysis:

from sentiment_analyzer import SentimentAnalyzer

sentiment_analyzer = SentimentAnalyzer(spark)
result_df = sentiment_analyzer.trigger_SentimentInference(df, text_column="text", sentParse=True)

# Show results
result_df.show(truncate=False)

Once the analysis is complete, the results are saved and logged in MLflow.

Example Usage

if __name__ == "__main__":
    spark = SparkSession.builder.appName("Sentiment Analysis").getOrCreate()
    test_sentiment_df = spark.read.parquet("/mnt/prod/inputs/data_sources/reviews.parquet")
    test_sentiment_df = test_sentiment_df.withColumn("text", F.concat_ws(" . ", "ReviewTitle", "ReviewBody"))
    
    sentiment_analyzer = SentimentAnalyzer(spark)
    result_df = sentiment_analyzer.trigger_SentimentInference(test_sentiment_df, text_column="text", sentParse=True)
    result_df.show(truncate=False)

Performance Optimizations

GPU Acceleration: Automatically detects and leverages GPU for faster inference.
Quantization: Uses 4-bit quantization to reduce the memory footprint and speed up transformer models.
Mixed Precision: Applies mixed precision (bf16/float16) for faster computations without sacrificing accuracy.
Dynamic Batching: Adjusts batch size based on dataset size and system resources for optimized processing.

Logging and Experiment Tracking

This framework integrates MLflow for experiment tracking. Parameters, metrics, and logs (including errors) are automatically recorded for each run.

Customization

You can change the transformer model by updating the model path in the SentimentAnalyzer class.
Modify the batching strategy (dynamic, sequential, or no batching) by adjusting the enable_batching parameter.

Future Work

Support for multiple transformer models for ensemble sentiment analysis.
Extending the framework for multi-class sentiment classification.
Integration with real-time data pipelines for live sentiment analysis.

Contact

For further information or inquiries, feel free to contact Anmol Jaiswal at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Download & Save _ Huggingface Models.py		Download & Save _ Huggingface Models.py
README.md		README.md
RoBERTa - Text Classifier Framework.ipynb		RoBERTa - Text Classifier Framework.ipynb
sentimentAnalyzer (3).py		sentimentAnalyzer (3).py
sentimentAnalyzer_spark_gpu_databricks.py		sentimentAnalyzer_spark_gpu_databricks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis with PySpark and Transformers

Features

Architecture Overview

Quick Start

Prerequisites

Installation

Configuration

Running the Script

Example Usage

Performance Optimizations

Logging and Experiment Tracking

Customization

Future Work

Contact

About

Releases

Packages

Languages

anmolg1997/Spark_cum_GPU_sentiment_analyzer

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis with PySpark and Transformers

Features

Architecture Overview

Quick Start

Prerequisites

Installation

Configuration

Running the Script

Example Usage

Performance Optimizations

Logging and Experiment Tracking

Customization

Future Work

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages