Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Caption Generation with Audio Outputs (Issue #869) #872

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions Deep_Learning/Image Caption Generation with Audio Output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Image Caption Generator with TTS

This project is a web application that allows users to upload images and generate captions using a pre-trained model. The generated captions can also be converted to speech using Google Text-to-Speech (gTTS), which can be played or downloaded directly from the webpage.

## Features
- Upload an image file and generate a caption using the `Salesforce/blip-image-captioning-base` model.
- Converts the generated caption into audio using Google Text-to-Speech (gTTS).
- Displays the uploaded image along with the generated caption and an audio player to listen to the caption.


## Project Structure

```
project/
├── app.py # Main Flask app
├── static/ # Static files (uploads and audio)
│ ├── uploads/ # Folder for uploaded images
│ └── audio/ # Folder for audio files generated by gTTS
├── templates/
│ └── index.html # HTML file for rendering the webpage
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```

## Installation and Setup

1. **Clone the repository:**

```bash
git clone https://github.com/payal83/image-caption-generator.git
cd image-caption-generator
```

2. **Create a virtual environment:**

```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. **Install dependencies:**

```bash
pip install -r requirements.txt
```

4. **Run the Flask application:**

```bash
python app.py
```

5. **Open your browser and navigate to:**

```
http://127.0.0.1:5000/
```

## Dependencies

This project relies on the following libraries:

- **Flask**: Web framework used to create the application.
- **Pillow**: For image processing.
- **transformers**: Hugging Face transformers library for loading the image captioning model.
- **gTTS**: Google Text-to-Speech library for converting text into audio.
- **Werkzeug**: Used for securing file uploads.

To install the dependencies, use:

```bash
pip install -r requirements.txt
```

## Usage

1. **Upload an Image**:
Upload any image file (e.g., `.jpg`, `.png`) through the web interface.

2. **Generate Caption**:
Once uploaded, the model will generate a caption based on the content of the image.

3. **Play Caption as Audio**:
The caption will also be converted to speech using Google Text-to-Speech (gTTS). An audio player will appear, allowing you to listen to the caption.


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

56 changes: 56 additions & 0 deletions Deep_Learning/Image Caption Generation with Audio Output/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from flask import Flask, render_template, request, url_for
from werkzeug.utils import secure_filename
import os
from PIL import Image
from transformers import pipeline
from gtts import gTTS

app = Flask(__name__)

# Configure upload folder
app.config['UPLOAD_FOLDER'] = 'static/uploads'
app.config['AUDIO_FOLDER'] = 'static/audio'
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # Limit to 16 MB

# Create uploads and audio directories if they don't exist
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
os.makedirs(app.config['AUDIO_FOLDER'], exist_ok=True)

# Initialize the image-to-text pipeline
image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

@app.route('/', methods=['GET', 'POST'])
def index():
caption = ''
image_url = ''
audio_url = ''

if request.method == 'POST' and 'photo' in request.files:
# Process the uploaded photo
photo = request.files['photo']
filename = secure_filename(photo.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
photo.save(filepath)

# Convert the image to RGB and process
image = Image.open(filepath).convert('RGB')

# Generate caption
captions = image_to_text(image)
caption = captions[0]['generated_text']

# Set image URL for display
image_url = url_for('static', filename=f'uploads/{filename}')

# Convert caption to audio using gtts
if caption:
tts = gTTS(text=caption, lang='en')
audio_filename = f"{filename.rsplit('.', 1)[0]}.mp3" # Same name but with .mp3 extension
audio_filepath = os.path.join(app.config['AUDIO_FOLDER'], audio_filename)
tts.save(audio_filepath)
audio_url = url_for('static', filename=f'audio/{audio_filename}')

return render_template('index.html', caption=caption, image_url=image_url, audio_url=audio_url)

if __name__ == '__main__':
app.run(debug=True)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Flask==2.3.2
Pillow==10.0.0
transformers==4.31.0
torch==2.0.1
gTTS==2.3.2
Werkzeug==2.3.6
gunicorn
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Image Caption Generator</title>
<style>
* {
box-sizing: border-box;
margin: 0;
padding: 0;
}

body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background-color: #f4f4f4;
color: #333;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
padding: 20px;
}

.container {
background-color: white;
box-shadow: 0px 4px 10px rgba(0, 0, 0, 0.1);
border-radius: 10px;
padding: 30px;
width: 100%;
max-width: 600px;
}

h1 {
font-size: 2.5em;
color: #333;
text-align: center;
margin-bottom: 20px;
}

form {
display: flex;
flex-direction: column;
gap: 15px;
}

input[type="file"] {
border: 1px solid #ddd;
padding: 10px;
border-radius: 5px;
font-size: 1em;
cursor: pointer;
}

button {
background-color: #4CAF50;
color: white;
padding: 15px 20px;
border: none;
border-radius: 5px;
font-size: 1.2em;
cursor: pointer;
transition: background-color 0.3s ease;
}

button:hover {
background-color: #45a049;
}

h2 {
font-size: 1.8em;
margin-top: 30px;
color: #333;
}

img {
max-width: 100%;
border-radius: 10px;
margin-top: 15px;
}

.caption {
font-size: 1.2em;
margin-top: 10px;
padding: 15px;
background-color: #f9f9f9;
border-left: 4px solid #4CAF50;
border-radius: 5px;
}

audio {
margin-top: 20px;
width: 100%;
}
</style>
</head>
<body>
<div class="container">
<h1>Image Caption Generator</h1>
<form action="/" method="post" enctype="multipart/form-data">
<label for="photo">Upload an image:</label>
<input type="file" name="photo" accept="image/*" required>
<button type="submit">Generate Caption</button>
</form>

{% if caption %}
<h2>Generated Caption:</h2>
<div class="caption">{{ caption }}</div>
{% if image_url %}
<img src="{{ image_url }}" alt="Uploaded Image">
{% endif %}
{% if audio_url %}
<h3>Audio:</h3>
<audio controls>
<source src="{{ audio_url }}" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
{% endif %}
{% endif %}
</div>
</body>
</html>