Skip to content
This repository was archived by the owner on Feb 11, 2025. It is now read-only.

Commit

Permalink
Final Finishing Touches before Final Presentation.
Browse files Browse the repository at this point in the history
[NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/*.pdf :: Added two new question banks for demo purposes that have been pre-tested to work correctly regarding data extraction.
[NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/*.docx :: Added source files (DOCX) for the question banks that are newly added for demo purposes.
[NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.pdf :: Added a new demo template clearly showcasing the accepted new format for question banks that brings increased readbility for both human and AutoQPGen.
[NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.docx :: Added source file (DOCX) for the demo template question bank.
[NEW] AutoQPGen :: /models/QScanEngine.py :: Introducing game-changing PDF Extraction library called PDFPlumber which employs state-of-the-art algorithms to extract data that is readable from pdf without redundant white spaces or any such unwanted characters and mainly focusing on "STRUCTURED READING" of PDF content just like human readable way.

[FIX] AutoQPGen :: /app.py :: Fixed a fatal error that caused the web app to crash while purging redundant documents.
[FIX] AutoQPGen :: /Assets/TestQBanks/ :: This folder has been removed as result of moving to new format and the older question bank format is unsupported.
[FIX] AutoQPGen :: /models/QScanEngine.py :: Replaced the FITZ data extraction library with the new PDFPlumber library.
[FIX] AutoQPGen :: /models/QScanEngine.py :: Redesigned questionSetter() function to properly sort the operated questions before display and group questions based on Ascending order of Modules.
[FIX] AutoQPGen :: /models/QScanEngine.py :: Rerouted the function extract_text_from_pdf() to now get data using PDFPlumber reflecting new format acceptance.
[FIX] AutoQPGen :: /models/QScanEngine.py :: Reconfigured SpaCy model to better figure out Subject data from question bank.
[FIX] AutoQPGen :: /README.md :: Removed fitz/PyMuPDF dependency issue section as the library is itself no longer used.
[FIX] AutoQPGen :: /README.md :: Added a (PDF Structured Extraction) to the Technologies used section.
[FIX] AutoQPGen :: /README.md :: Fixed typos in setup instructions.
[FIX] AutoQPGen :: /requirements.txt :: Added libraries {PDFPlumber, GoogleGenAI} and removed libraries {PDFkit, fitz, PyPDF2, secrets}.
  • Loading branch information
azuregray committed Feb 4, 2025
1 parent da1bd34 commit 98492b6
Show file tree
Hide file tree
Showing 19 changed files with 17 additions and 30 deletions.
Binary file added Assets/QuestionBank_FORMAT/BCT_QB.docx
Binary file not shown.
Binary file added Assets/QuestionBank_FORMAT/BCT_QB.pdf
Binary file not shown.
Binary file added Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.docx
Binary file not shown.
Binary file added Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.pdf
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed Assets/TestQBanks/StandardQBANK_01.pdf
Binary file not shown.
Binary file removed Assets/TestQBanks/StandardQBANK_02.pdf
Binary file not shown.
Binary file removed Assets/TestQBanks/StandardQBANK_03.pdf
Binary file not shown.
15 changes: 3 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,16 @@
**`( GITHUB )`** **`( PIP )`** **`( MARKDOWN )`**
**`( SHELL SCRIPTING )`** **`( POWERSHELL )`** **`( PYTORCH )`**
**`( TENSORFLOW )`** **`( PDF RENDERING )`** **`( NLP )`**
**`( SPACY )`**
**`( SPACY )`** **`(PDF Structured Extraction)`**

---
## **`DEPLOYMENT & USAGE`**
> 1️⃣ **Step 01**: Please visit the Original Repository [**`AutoQPGen`**](https://github.com/azuregray/AutoQPGen) and find the Green `CODE` button and click on "Downlaoad ZIP".
> 1️⃣ **Step 01**: Please visit the Original Repository [**`AutoQPGen`**](https://github.com/azuregray/AutoQPGen) and find the Green `CODE` button and click on "Download ZIP".
> or just [**`Click here to download`**](https://github.com/azuregray/AutoQPGen/archive/refs/heads/main.zip).
> 2️⃣ **Step 02**: Extract the downloaded `AutoQPGen-main.zip` file into its folder and open the same.
> 3️⃣ **Step 03**: Once you are in the repo folder, Install the requirements.
> 3️⃣ **Step 03**: Once you are in the folder, Install the requirements.
> To take the help of `requirements.txt`, just run this in Terminal:
> `Interpreter: PowerShell`
```
Expand All @@ -59,15 +59,6 @@ python ./app.py
> 7️⃣ **Step 07**: After running the command in `Step 06`, please do a `Ctrl + Click` on the localhost URL where the service is being hosted, which is generated in the same terminal windows running `app.py`.
> For example: `https://127.0.0.1:5000`
---
### **`DEPENDENCY ISSUE`**
> We have noticed that there is a general bug in PyMuPDF library which can give the following error: `Attribute Error : fitz has no attribute open()`. If you faced the same, please do not panic!
> Since fitz is just a wrapper for PyMuPDF library, this error can be fixed easily by force-reinstalling PyMuPDF with the following command in your Terminal:
> `Interpreter: PowerShell`
```
python -m pip install --force-reinstall pymupdf
```
> Refer [PyMuPDF/issues](https://github.com/pymupdf/PyMuPDF/issues/660) for more information on the same.
---
### **`QUICK TIPS`**
> **`01`** - To quickly setup the entire project to get it ready to RUN FRESH, feel free to invoke readyApp() from kickstarter.py:
Expand Down
4 changes: 2 additions & 2 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ def deletePaperRemains(paperId):
generatedDocxFolder = './static/GeneratedDocx'
docxFileFound = search_for_file(generatedDocxFolder, paperId, '.docx')
pdfFileFound = search_for_file(generatedPapersFolder, paperId, '.pdf')
if docxFileFound is not None:
if docxFileFound is not None and os.path.exists(docxFileFound):
os.remove(docxFileFound)
if pdfFileFound is not None:
if pdfFileFound is not None and os.path.exists(pdfFileFound):
os.remove(pdfFileFound)
eventLogger(f'DOCX and PDF Remains of PaperID {paperId} were cleared.')

Expand Down
16 changes: 7 additions & 9 deletions models/QScanEngine.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import fitz
import pdfplumber
import random
import re
import spacy
Expand All @@ -18,7 +18,7 @@ def clean_question_text(question_text):
return finalText

def questionSetter(listOfDictionaries, howMany):
uniqueModNums = list(set(item['modnum'] for item in listOfDictionaries)) # Finding unique module numbers using FUNDAMENTAL PROPERTY OF SETS
uniqueModNums = sorted(list(set(item['modnum'] for item in listOfDictionaries))) # Finding unique module numbers using FUNDAMENTAL PROPERTY OF SETS

sorted_list = sorted(listOfDictionaries, key=lambda x: x['modnum']) # Sorting structured quesitions data based on ModuleNumber values

Expand All @@ -31,11 +31,9 @@ def questionSetter(listOfDictionaries, howMany):
return firstModGroup[:howMany // 2] + secondModGroup[:howMany // 2] # Return ListOfDictionaries required n questions by joining first n/2 items from each list

def extract_text_from_pdf(pdf_path):
with fitz.open(pdf_path) as pdf:
text = ""
for page in pdf:
text += page.get_text()
return text
with pdfplumber.open(pdf_path) as pdf:
text = "\n".join([page.extract_text() for page in pdf.pages])
return text

# Function to extract information from text using spaCy NER model
def extract_info_with_ner(text):
Expand All @@ -46,8 +44,8 @@ def extract_info_with_ner(text):
"facultyName": None,
"qBankContent": []
}
subjectName = re.search(r"Subject Name:\s*(.*)", text)
subjectCode = re.search(r"Subject Code:\s*(\S+)", text)
subjectName = re.search(r"Subject Name\s*:\s*(.*)", text)
subjectCode = re.search(r"Subject Code\s*:\s*(\S+)", text)
semester = re.search(r"SEM\s*:\s*(\S+)", text)
facultyName = re.search(r"Faculty\s*:\s*(.*)", text)

Expand Down
12 changes: 5 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
pdfkit
fitz
PyPDF2
pdfplumber
flask
pathlib
tools
frontend
secrets
spacy
docxtpl
docx2pdf
pywin32
google-generativeai
tk
pywin32
tools
frontend

0 comments on commit 98492b6

Please sign in to comment.