This project involves an End-to-End Text Summarization Web App powered by a custom fine-tuned text summarization model. The app is deployed on AWS (using ECR/EC2), and it allows users to input text. The application generates concise and coherent summaries using a Hugging Face model π€, optimized on domain-specific data for improved accuracy π.
Follow these steps to install and run the project:
-
Clone the repository:
git clone https://github.com/Rahul-404/End-to-end-Text-Summarizer.git cd End-to-end-Text-Summarizer
-
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Ensure that you have the correct AWS credentials configured if using AWS services like ECR/EC2 for deployment. Set up any other environment variables if required.
The project is organized as follows:
End-to-end-Text-Summarizer/
β
βββ src/
β βββ textSummarizer/
β βββ __init__.py # Initialization of the project package
β βββ components/
β β βββ __init__.py # Component initialization
β β βββ data_ingestion.py # Data ingestion logic (if needed)
β β βββ data_transformation.py # Data preprocessing (text cleaning, etc.)
| | βββ data_validation.py
| | βββ data_evaluation.py
| | βββ model_trainer.py # Fine-tuned text summarizer model
β βββ pipeline/
β β βββ __init__.py # Pipeline initialization
β β βββ summarization_pipeline.py # Text summarization pipeline logic
| βββ config/
| | βββ __init__.py
| | βββ configuration.py
| βββ constants/
| | βββ __init__.py
| βββ entity/
| | βββ __init__.py
| βββ logging/
| | βββ __init__.py
| βββ pipeline/
| | βββ __init__.py
| | βββ prediction.py
| | βββ stage_01_data_ingestion.py
| | βββ stage_02_data_validation.py
| | βββ stage_03_data_transformation.py
| | βββ stage_04_model_trainer.py
| | βββ stage_05_model_evaluation.py
| βββ utils
| | βββ __init__.py
| | βββ common.py
β βββ __init__.py
βββ app.py # Main script for running the app
βββ Dockerfile # Docker configuration to containerize the app
βββ requirements.txt # Python dependencies
βββ setup.py # Setup script for packaging
βββ artifacts/ # Directory to store trained models and outputs
βββ README.md # Project documentation
src/textSummarizer/
: The main source code directory where all the core project files are located.components/
: Contains the logic for components like data ingestion, transformation, and model training (if applicable).pipeline/
: Contains scripts that define the text summarization pipeline, handling text input and output flow.exception.py
: Custom exceptions for error handling.logger.py
: Logging utilities to keep track of the application's execution and errors.utils.py
: Utility functions used throughout the project, such as metrics calculation or loading pre-trained models.app.py
: Main entry point to start the application, interact with the summarizer, and handle user input/output.Dockerfile
: Configuration file for containerizing the application using Docker.requirements.txt
: Contains the list of dependencies needed to run the project.setup.py
: Setup script for packaging and installing the project.artifacts/
: Directory for storing models, data, and outputs.
Once the dependencies are installed, you can start the application by running the following command:
python app.py
This will start a local server (usually at http://localhost:8080
), allowing users to interact with the text summarization model via a simple web interface.
- Fine-Tuned Summarization Model: A custom fine-tuned Hugging Face model optimized on domain-specific data to generate accurate summaries.
- Text Input: Users can input long-form text or documents for summarization.
- Concise Summaries: The app generates concise, accurate summaries of the provided text.
- AWS Deployment: The application is deployed on AWS EC2 using Docker, with the model hosted on AWS ECR for scalable production use.
The text summarizer uses a Hugging Face pre-trained model fine-tuned on domain-specific data. Fine-tuning is performed to ensure that the model provides better and more relevant summaries for specific types of content (e.g., news articles, scientific papers, etc.).
- Data Preprocessing: Text is cleaned, tokenized, and prepared for input into the Hugging Face model.
- Model Fine-Tuning: The model is fine-tuned on domain-specific data to improve summarization quality.
- Inference: The fine-tuned model generates the summary for the given input text.
-
Run the app:
- After starting the app using
python app.py
, open your browser and go to the app's URL (usuallyhttp://localhost:8080
).
- After starting the app using
-
Input Text:
- In the provided text box, input the text or document you want to summarize.
-
Generate Summary:
- Click on the "Generate Summary" button, and the app will display the concise summary generated by the model.
We welcome contributions! If you'd like to contribute to this project, please follow these steps:
- Fork the repository.
- Clone your fork to your local machine.
- Create a new branch for your feature or bugfix.
- Make your changes and test them locally.
- Push your changes to your fork.
- Open a pull request with a clear description of your changes.
This project is licensed under the MIT License. See the LICENSE file for more information.
- Hugging Face: For providing pre-trained models and fine-tuning tools for text summarization.
- AWS: For hosting the application on EC2 and managing the containerized model with ECR.
- Libraries Used:
- Transformers for model loading and fine-tuning.
- Flask or Streamlit for web app creation.
- Docker for containerizing the app.
- Pandas and NumPy for data preprocessing and manipulation
- Update config.yaml
- Update params.yaml
- Update entity
- Update the configuration manager in src config
- Update the components
- Update the pipeline
- Update the main.py
- Update the app.py