Final Finishing Touches before Final Presentation.

[NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/*.pdf :: Added two new question banks for demo purposes that have been pre-tested to work correctly regarding data extraction. [NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/*.docx :: Added source files (DOCX) for the question banks that are newly added for demo purposes. [NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.pdf :: Added a new demo template clearly showcasing the accepted new format for question banks that brings increased readbility for both human and AutoQPGen. [NEW] AutoQPGen :: /Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.docx :: Added source file (DOCX) for the demo template question bank. [NEW] AutoQPGen :: /models/QScanEngine.py :: Introducing game-changing PDF Extraction library called PDFPlumber which employs state-of-the-art algorithms to extract data that is readable from pdf without redundant white spaces or any such unwanted characters and mainly focusing on "STRUCTURED READING" of PDF content just like human readable way. [FIX] AutoQPGen :: /app.py :: Fixed a fatal error that caused the web app to crash while purging redundant documents. [FIX] AutoQPGen :: /Assets/TestQBanks/ :: This folder has been removed as result of moving to new format and the older question bank format is unsupported. [FIX] AutoQPGen :: /models/QScanEngine.py :: Replaced the FITZ data extraction library with the new PDFPlumber library. [FIX] AutoQPGen :: /models/QScanEngine.py :: Redesigned questionSetter() function to properly sort the operated questions before display and group questions based on Ascending order of Modules. [FIX] AutoQPGen :: /models/QScanEngine.py :: Rerouted the function extract_text_from_pdf() to now get data using PDFPlumber reflecting new format acceptance. [FIX] AutoQPGen :: /models/QScanEngine.py :: Reconfigured SpaCy model to better figure out Subject data from question bank. [FIX] AutoQPGen :: /README.md :: Removed fitz/PyMuPDF dependency issue section as the library is itself no longer used. [FIX] AutoQPGen :: /README.md :: Added a (PDF Structured Extraction) to the Technologies used section. [FIX] AutoQPGen :: /README.md :: Fixed typos in setup instructions. [FIX] AutoQPGen :: /requirements.txt :: Added libraries {PDFPlumber, GoogleGenAI} and removed libraries {PDFkit, fitz, PyPDF2, secrets}.
azuregray · Feb 4, 2025 · 98492b6 · 98492b6
1 parent da1bd34
commit 98492b6
Show file tree

Hide file tree

Showing 19 changed files with 17 additions and 30 deletions.
diff --git a/Assets/QuestionBank_FORMAT/BCT_QB.docx b/Assets/QuestionBank_FORMAT/BCT_QB.docx
diff --git a/Assets/QuestionBank_FORMAT/BCT_QB.pdf b/Assets/QuestionBank_FORMAT/BCT_QB.pdf
diff --git a/Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.docx b/Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.docx
diff --git a/Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.pdf b/Assets/QuestionBank_FORMAT/CN_QB_Mod1+2.pdf
diff --git a/Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.docx b/Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.docx
diff --git a/Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.pdf b/Assets/QuestionBank_FORMAT/Template/QBank-TEMPLATE.pdf
diff --git a/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_01.pdf b/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_01.pdf
diff --git a/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_02.pdf b/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_02.pdf
diff --git a/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_03.pdf b/Assets/TestQBanks/01-OriginalFormat/StandardQBANK_03.pdf
diff --git a/Assets/TestQBanks/02-NewFormat/TestQB_withModules.docx b/Assets/TestQBanks/02-NewFormat/TestQB_withModules.docx
diff --git a/Assets/TestQBanks/02-NewFormat/TestQB_withModules.pdf b/Assets/TestQBanks/02-NewFormat/TestQB_withModules.pdf
diff --git a/Assets/TestQBanks/02-NewFormat/TestQB_withModules_ALT.pdf b/Assets/TestQBanks/02-NewFormat/TestQB_withModules_ALT.pdf
diff --git a/Assets/TestQBanks/StandardQBANK_01.pdf b/Assets/TestQBanks/StandardQBANK_01.pdf
diff --git a/Assets/TestQBanks/StandardQBANK_02.pdf b/Assets/TestQBanks/StandardQBANK_02.pdf
diff --git a/Assets/TestQBanks/StandardQBANK_03.pdf b/Assets/TestQBanks/StandardQBANK_03.pdf
diff --git a/README.md b/README.md
@@ -25,16 +25,16 @@
 **`( GITHUB )`** **`( PIP )`** **`( MARKDOWN )`**  
 **`( SHELL SCRIPTING )`** **`( POWERSHELL )`** **`( PYTORCH )`**  
 **`( TENSORFLOW )`** **`( PDF RENDERING )`** **`( NLP )`**  
-**`( SPACY )`**
+**`( SPACY )`** **`(PDF Structured Extraction)`**
 
 ---
 ## **`DEPLOYMENT & USAGE`**
-> 1️⃣ **Step 01**: Please visit the Original Repository [**`AutoQPGen`**](https://github.com/azuregray/AutoQPGen) and find the Green `CODE` button and click on "Downlaoad ZIP".  
+> 1️⃣ **Step 01**: Please visit the Original Repository [**`AutoQPGen`**](https://github.com/azuregray/AutoQPGen) and find the Green `CODE` button and click on "Download ZIP".  
 > or just [**`Click here to download`**](https://github.com/azuregray/AutoQPGen/archive/refs/heads/main.zip).
 
 > 2️⃣ **Step 02**: Extract the downloaded `AutoQPGen-main.zip` file into its folder and open the same.
 
-> 3️⃣ **Step 03**: Once you are in the repo folder, Install the requirements.  
+> 3️⃣ **Step 03**: Once you are in the folder, Install the requirements.  
 > To take the help of `requirements.txt`, just run this in Terminal:  
 > `Interpreter: PowerShell`
 ```
@@ -59,15 +59,6 @@ python ./app.py
 > 7️⃣ **Step 07**: After running the command in `Step 06`, please do a `Ctrl + Click` on the localhost URL where the service is being hosted, which is generated in the same terminal windows running `app.py`.  
 > For example: `https://127.0.0.1:5000`
 
----
-### **`DEPENDENCY ISSUE`**
-> We have noticed that there is a general bug in PyMuPDF library which can give the following error: `Attribute Error : fitz has no attribute open()`. If you faced the same, please do not panic!  
-> Since fitz is just a wrapper for PyMuPDF library, this error can be fixed easily by force-reinstalling PyMuPDF with the following command in your Terminal:  
-> `Interpreter: PowerShell`
-```
-python -m pip install --force-reinstall pymupdf
-```
-> Refer [PyMuPDF/issues](https://github.com/pymupdf/PyMuPDF/issues/660) for more information on the same.
 ---
 ### **`QUICK TIPS`**
 > **`01`** - To quickly setup the entire project to get it ready to RUN FRESH, feel free to invoke readyApp() from kickstarter.py:  

diff --git a/app.py b/app.py
@@ -90,9 +90,9 @@ def deletePaperRemains(paperId):
     generatedDocxFolder = './static/GeneratedDocx'
     docxFileFound = search_for_file(generatedDocxFolder, paperId, '.docx')
     pdfFileFound = search_for_file(generatedPapersFolder, paperId, '.pdf')
-    if docxFileFound is not None:
+    if docxFileFound is not None and os.path.exists(docxFileFound):
         os.remove(docxFileFound)
-    if pdfFileFound is not None:
+    if pdfFileFound is not None and os.path.exists(pdfFileFound):
         os.remove(pdfFileFound)
     eventLogger(f'DOCX and PDF Remains of PaperID {paperId} were cleared.')
 

diff --git a/models/QScanEngine.py b/models/QScanEngine.py
@@ -1,4 +1,4 @@
-import fitz
+import pdfplumber
 import random
 import re
 import spacy
@@ -18,7 +18,7 @@ def clean_question_text(question_text):
     return finalText
 
 def questionSetter(listOfDictionaries, howMany):
-    uniqueModNums = list(set(item['modnum'] for item in listOfDictionaries)) # Finding unique module numbers using FUNDAMENTAL PROPERTY OF SETS
+    uniqueModNums = sorted(list(set(item['modnum'] for item in listOfDictionaries))) # Finding unique module numbers using FUNDAMENTAL PROPERTY OF SETS
 
     sorted_list = sorted(listOfDictionaries, key=lambda x: x['modnum']) # Sorting structured quesitions data based on ModuleNumber values
 
@@ -31,11 +31,9 @@ def questionSetter(listOfDictionaries, howMany):
     return firstModGroup[:howMany // 2] + secondModGroup[:howMany // 2] # Return ListOfDictionaries required n questions by joining first n/2 items from each list
 
 def extract_text_from_pdf(pdf_path):
-    with fitz.open(pdf_path) as pdf:
-        text = ""
-        for page in pdf:
-            text += page.get_text()
-    return text
+    with pdfplumber.open(pdf_path) as pdf:
+        text = "\n".join([page.extract_text() for page in pdf.pages])
+        return text
 
 # Function to extract information from text using spaCy NER model
 def extract_info_with_ner(text):
@@ -46,8 +44,8 @@ def extract_info_with_ner(text):
         "facultyName": None,
         "qBankContent": []
     }
-    subjectName = re.search(r"Subject Name:\s*(.*)", text)
-    subjectCode = re.search(r"Subject Code:\s*(\S+)", text)
+    subjectName = re.search(r"Subject Name\s*:\s*(.*)", text)
+    subjectCode = re.search(r"Subject Code\s*:\s*(\S+)", text)
     semester = re.search(r"SEM\s*:\s*(\S+)", text)
     facultyName = re.search(r"Faculty\s*:\s*(.*)", text)
 

diff --git a/requirements.txt b/requirements.txt
@@ -1,13 +1,11 @@
-pdfkit
-fitz
-PyPDF2
+pdfplumber
 flask
 pathlib
-tools
-frontend
-secrets
 spacy
 docxtpl
 docx2pdf
+pywin32
+google-generativeai
 tk
-pywin32
+tools
+frontend