forked from petermr/semanticClimate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PDF2TXT
29 lines (22 loc) · 1.22 KB
/
PDF2TXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
*Work of 03.06.2022
*What did I do?
Cloned pdf2txt.py in GitBash Terminal
Downloaded the IPCC report in pdf format from web.
Run the following commands in the terminal
mkdir Chapter01
cp Chapter01.pdf Chapter01/fulltext.pdf
cd Chapter01
pdf2txt.py -o fulltext.txt fulltext.pdf
ls
fulltext.pdf fulltext.txt
If you do not get fulltext.txt after running the ls command then, install pdfminer.six and execute the following commads in the Command Prompt:
cd pdfminer.six
python tools/pdf2txt.py "Copy path.pdf" -o "Copy path.txt"
*What are the results?
A folder was created names Chapter04 in the path specified in Terminal. 2 files named Chapter04 were saved. One in .pdf format & another in .txt format.
*What do they mean?
This means the pdf2txt works perfectly in Windows11, 8GB RAM environment and the code used here has successfully converted the pdf file to fulltext.
It saved both of the copies in the specified folder.
*Are there any errors or problems?
The pdf has numbered lines. The txt file first converts the number of the lines and then converts the text. So, it becomes jumbled up. The same thing happens for
tables and diagrams of the pdf.