A FastAPI server that compares the semantic similarity between two strings using a sentence-transformers model.
- Semantic similarity comparison between text strings
- Uses sentence-transformers model (all-MiniLM-L6-v2 by default)
- Automatically utilizes GPU if available
- Simple REST API with JSON request/response
- Model caching to avoid repeated downloads on server restart
- Clone this repository:
git clone https://github.com/yourusername/similarity-scorer.git
cd similarity-scorer
- Install the required packages:
pip install -r requirements.txt
uvicorn main:app --reload
./start.sh
You can customize settings by setting environment variables:
MODEL_NAME=all-mpnet-base-v2 WORKERS=4 ./start.sh
The server will run at: http://127.0.0.1:16000
Interactive API documentation is available at: http://127.0.0.1:16000/docs
docker build -t similarity-scorer .
docker run -p 16000:16000 similarity-scorer
docker-compose up
To run in detached mode:
docker-compose up -d
To stop the service:
docker-compose down
To start the server in production mode:
./start.sh
For memory-constrained environments:
./start_optimized.sh
To stop any running server instances:
./terminate.sh
If regular termination doesn't work, use the force termination script (may require sudo):
sudo ./force_terminate.sh
These scripts will:
- Identify all running processes related to the similarity scorer
- Attempt to terminate them gracefully (regular script) or forcefully (force script)
- Report the status after termination attempt
Compares the semantic similarity between two sentences.
Request Body:
{
"sentence1": "How do I bake a cake?",
"sentence2": "What is the process for making a cake?"
}
Response:
{
"sentence1": "How do I bake a cake?",
"sentence2": "What is the process for making a cake?",
"semantic_similarity": 0.87
}
You can change the model in two ways:
-
By setting the
MODEL_NAME
environment variable:# When running with Python MODEL_NAME=all-mpnet-base-v2 uvicorn main:app --reload # When running with Docker docker run -p 16000:16000 -e MODEL_NAME=all-mpnet-base-v2 similarity-scorer # Or update the environment variable in docker-compose.yml # and then run docker-compose up
-
By directly editing the default in
main.py
:model_name = os.environ.get("MODEL_NAME", "all-mpnet-base-v2")
Note that more accurate models like "all-mpnet-base-v2" require more computational resources but provide better similarity results.
The application is configured to cache downloaded models in the models/
directory. This means:
- The model will only be downloaded once, even if you restart the server multiple times
- Subsequent server startups will be much faster
- When using Docker, the model cache is stored in a named volume for persistence
This is especially important for larger models like "all-mpnet-base-v2" which can be several hundred MB in size.
To optimize memory usage, the application uses a dedicated model service that runs in a separate process:
- Only one copy of the model is loaded in memory regardless of how many Gunicorn workers are running
- All worker processes communicate with the model service via IPC (inter-process communication)
- This significantly reduces memory usage when running with multiple workers
This architecture is particularly beneficial for large models that would otherwise consume several GB of RAM if loaded separately in each worker process.
This application is designed for memory efficiency, especially in environments with limited resources:
- Single Model Instance: The model is loaded only once in a dedicated process and shared across all workers
- Conservative Worker Count: The Gunicorn configuration uses fewer workers than typical to reduce memory usage
- Memory Monitoring: The application logs memory usage at various points to track resource utilization
- Configurable Worker Count: You can set the
WORKERS
environment variable to further limit workers - Worker Lifecycle Management: Workers are restarted after handling a certain number of requests to prevent memory leaks
For environments with very limited memory, use the optimized startup script:
./start_optimized.sh
This script sets environment variables to optimize memory usage and provides memory usage tracking.
The application automatically detects and uses available GPU resources:
On macOS, the application uses:
- Apple's Metal Performance Shaders (MPS) backend on Apple Silicon (M1/M2/M3) Macs
- AMD GPUs on Intel Macs with compatible graphics cards
To check if your Mac is using GPU acceleration:
- Start the server
- Visit
http://127.0.0.1:16000/system-info
to see device information - Look for
"mps_available": true
and"current_device": "mps"
in the response
On systems with NVIDIA GPUs, CUDA will be used automatically.
If running in Docker, make sure to include the GPU runtime configuration as specified in the docker-compose.yml file.