Knowing in which page a string is found #788
-
Hi all, as school we are given a collated/merged document related to students's reports. Each report is two or three pages long (sometimes four) and hence without automation the only way to split them correctly is to go throug hthe document and annotate the pages where each document starts. We have already extract the full text and thanks God students's names are not split on multiple lines and hence we can search for them. Each student's name is located at the beginning of the his/her won report and hence by knowing which page each name is located in we may find the series of starting page that may help us in correctly splitting the big document into single ones. Of course we have already asked no to get such a long doc but it seems the format is not under the sender's control. Is there a way in general to know the page where a string in a pdf doc is found? In case which functions should we use? TIA |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Hi, there are several alternatives for how to do that. Let's see how we can find the best way to get your task done. Let me confirm first, that I understood your problem:
So far ok? If that is the case, you do not need to extract text, actually. names_list = ["name1", "name2", ...] # list of students' names
doc = fitz.open("large-report.pdf")
for i, name in enumerate(names_list):
for page in doc:
if page.searchFor(name) != []: # name is on that page!
names_list[i] = (page.number, name) # replace entry augmented by page number
break # we are not interested in other pages now
names_list.sort() # sort by page number
names_count = len(names_list)
for i in range(names_count):
pdfout = fitz.open()
f = names_list[i][0] # start page number
name = names_list[i][1] # student name
if i < names_count - 1: # not the last item
t = names_list[i+1][0] - 1 # last page for this student
else:
t = names_count - 1
pdfout.insertPDF(doc, from_page=f, to_page=t)
pdfout.save(name + ".pdf") # store sub doc under student's name |
Beta Was this translation helpful? Give feedback.
-
Ok, thanks for your clarifications and assertions. Here is a script that works this way:
|
Beta Was this translation helpful? Give feedback.
Hi,
there are several alternatives for how to do that. Let's see how we can find the best way to get your task done. Let me confirm first, that I understood your problem:
So…