A resume parsing and scoring system built with Streamlit and LLM technology. The system extracts structured information from resumes and provides a comprehensive scoring based on multiple components.
-
Main Information: Uses LLM to extract structured data including:
- Personal Information (name, contact details, etc.)
- Academic Performance
- Technical Skills
- Projects and Internships
-
Extra-Curricular Activities: Uses pattern-based extraction for:
- Leadership Roles
- Awards and Achievements
- Certifications
- Activities and Involvement
- Language Proficiency
The system employs a 100-point scoring mechanism divided into four components:
- Formula:
(CGPA × 0.75) + ((10 - Std Dev) × 0.25)
- Considers both CGPA and consistency across semesters
- Standard deviation of SGPAs used to measure consistency
- Score normalized to maximum of 20%
Points allocated based on number of skills in each category:
- 0 skills: 0 points
- 1-3 skills: 1 point
- 4+ skills: 2 points
Categories:
- Programming Languages
- Frameworks
- Databases
- Other Technologies
- Knowledge Areas
Maximum raw points: 10 (2 points × 5 categories) Final score normalized to 35%
Points per project:
- Base Project: 5 points
- Internship Bonus: +5 points (if company is not "personal", "na", or "n/a")
- Technical Relevance: 0-10 points based on:
- Programming Languages used
- Frameworks utilized
- Databases implemented
- Knowledge areas applied
Score normalized to maximum of 30%
Points allocated based on total number of activities across:
- Leadership Roles
- Awards
- Certifications
- Activities
Scoring scale:
- 0 activities: 0 points
- 1-3 activities: 1 point
- 4+ activities: 2 points
Score normalized to maximum of 15%
- Structured CSV files for each component
- Raw LLM response in JSON format
- Comprehensive metadata with scores
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables in
.env
:
DEEPSEEK_API_KEY=your_api_key
DEEPSEEK_URL=your_api_url
- Run the application:
streamlit run app.py
The system generates the following files under data/parsed_data/{reg_no}/
:
{reg_no}_metadata.csv
: Personal information and overall scores{reg_no}_academic.csv
: Semester-wise academic performance{reg_no}_skills.csv
: Technical skills breakdown{reg_no}_projects.csv
: Project and internship details{reg_no}_extracurricular.csv
: Extra-curricular activities
Raw LLM responses are stored in data/raw_responses/{reg_no}/llm_response.json
-
Interactive Dashboard
- Upload interface for PDF resumes
- Real-time parsing and scoring
- Visual score breakdown
-
Detailed Visualizations
- Academic performance trends
- Score distribution charts
- Skills and project matrices
-
Data Export
- Download parsed data in CSV format
- Access raw LLM responses
- Export structured metadata
-
LLM Integration
- Uses DeepSeek API for main resume parsing
- Structured output using Pydantic models
- Token usage tracking
-
Pattern-Based Extraction
- Custom extractor for extra-curricular activities
- Regex-based section identification
- Smart text cleaning and formatting
-
Scoring System
- Component-wise score calculation
- Normalization for each category
- Weighted aggregation for final score
- Python 3.8+
- Deepseek API credentials
- PDF files in NITK resume format
- Clone the repository:
git clone https://github.com/yourusername/resume-parser.git
cd resume-parser
- Create and activate a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install required packages:
pip install -r requirements.txt
- Set up environment variables:
Create a
.env
file in the project root with:
DEEPSEEK_API_KEY=your_api_key_here
DEEPSEEK_URL=https://api.deepseek.com
Use this option when you have a combined PDF containing multiple resumes and want to:
- Split them into individual PDFs
- Extract information from each resume
- Save the data in CSV format
- Place your combined PDF in the data folder:
mkdir -p data
cp your_combined_resumes.pdf data/resumes_compiled.pdf
- Run the main script:
python -m resume_parser.main
The script will:
- Split the combined PDF into individual resumes in
data/output/pdfs/
- Create parsed data in CSV format in
data/output/parsed_data/
- Each student's data will be in a separate folder named by their registration number
Use this option to analyze individual resumes with a visual interface.
- Run the Streamlit app:
streamlit run app.py
-
Open your browser and go to
http://localhost:8501
-
Use the interface to:
- Upload individual resume PDFs
- View parsed information in a structured format
- See academic performance visualizations
- Get skill summaries
- View project details
-
Sign up for a Deepseek account at https://deepseek.com
-
Create an API key:
- Go to your account settings
- Navigate to API section
- Generate a new API key
-
Add the API key to your
.env
file:
DEEPSEEK_API_KEY=your_api_key_here
DEEPSEEK_URL=https://api.deepseek.com
Run the test suite to verify everything is working:
pytest tests/ -v --cov=resume_parser