-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro.qmd
101 lines (56 loc) · 7.38 KB
/
intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Introduction {.unnumbered}
Welcome to the course "Turning PDFs into Research Data".
BERD Academy is part of BERD\@NFDI; funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -- 460037581
If not indicated otherwise, the contents of this course are licensed under CC BY 4.0 NC
## Topics
- Methods for extracting text and files from websites using tools such as **Selenium** and how to avoid common pitfalls.
- Methods for extracting text from images, such as scans of written documents.
- Exploring technologies that can help automate data extraction from harvested text and a critical review of common data quality issues.
## Format
This is an online course.
- **Week 1**: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is web scraping and OCR (\~45 min). Interactive Online Session (\~60 min).
- **Week 2**: Applying last week's lessons to the example coding exercise or your own project (\~30 min). Interactive Online Session (\~60 min).
- **Week 3**: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is NLP and common data extract issues (\~30 min). Interactive Online Session (\~60 min).
- **Week 4**: Applying last week's lessons to the example coding exercise or your own project (\~30 min). Interactive Online Session (\~60 min).
## Weekly Meetings
The course includes 4 live Online Meetings, in which you will discuss the week's contents with the instructor and fellow participants:
- Meeting 1: **Aug 27**, 2024, 4:30pm -- 5:30pm CEST
- Meeting 2: **Sep 03**, 2024, 4:30pm -- 5:30pm CEST
- Meeting 3: **Sep 10**, 2024, 4:30pm -- 5:30pm CEST
- Meeting 4: **Sep 17**, 2024, 4:30pm -- 5:30pm CEST
## Prerequisites
- Basic programming knowledge (R, python, ...)
- Note that the course will be in Python, but if you only know R, this is still ok! The code examples are simple and will run entirely on [Google Colab](https://colab.research.google.com/), meaning you will not have to install anything. This course will make a good opportunity to try Python for the first time and you can also try the self-paced [BERD introduction to Python course](https://www.berd-nfdi.de/berd-academy/data-science-with-python/).
- Willingness to learn new technical skills
- A [Google Account](https://support.google.com/accounts/answer/27441?hl=en)
## About the Instructor
[John 'Jack' Collins](https://www.uni-mannheim.de/gess/programs/cdss/our-students/2022/john-jack-collins/) is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor's of Sociology with Honours from the Australian National University. Jack has a Master's degree in Data Science from James Cook University. His Master's project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master's studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.
## What to prepare
- If you want to code in Python, you will need a Google account so you can use [Google Colab](https://colab.research.google.com/). You may also need to use your Google Account to open an account with [Llama AI](https://console.llama-api.com/). We will do this together during the course if necessary, so no need to prepare beforehand.
## Course Materials
- [An example single PDF](https://github.com/BERD-NFDI/turning-pdfs-into-research-data.io/blob/e52796ed0800105ded224ec691ddaf0c415f128d/docs/Gemeinde_Kirchheim_Sportsfield.pdf)
- For those students who don;t have a particular project of their own, here's a PDF you could practise with: <https://arxiv.org/abs/2205.14135>
- [A zipped file of many examples PDFs](https://github.com/BERD-NFDI/turning-pdfs-into-research-data.io/blob/e52796ed0800105ded224ec691ddaf0c415f128d/docs/pdfs.zip)
## Readings and external resources
- (optional content) If you want to understand Large Language Models (LLM) on a more technical level, this makes an excellent visual overview of the mathematics involved.
3Blue1Brown (Director). (2024, April 1). But what is a GPT? Visual intro to transformers \| Chapter 5, Deep Learning. <https://www.youtube.com/watch?v=wjZofJX0v4M>
- (optional content) If you want a more advanced discussion of web scraping, this is a dedicated textbook (not necessary for this course).
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining \| Wiley. (n.d.). Wiley.Com. Retrieved 3 June 2024, from <https://www.wiley.com/en-us/Automated+Data+Collection+with+R%3A+A+Practical+Guide+to+Web+Scraping+and+Text+Mining-p-9781118834817>
- The applied examples from this course are based on work by a team of researchers from this project.
DSSGxMunich/land-sealing-dataset-and-analysis: This repo offers the code developed during DSSGx Munich 2023 for producing and analysing a comprehensive dataset about land parcels from the state of NRW in Germany. (n.d.). Retrieved 3 June 2024, from <https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis>
- Using an AI 'BERT' to classify open-ended text with some pre-training of BERT involved.
Gweon, H., & Schonlau, M. (2024). Automated classification for open-ended questions with BERT. Journal of Survey Statistics and Methodology, 12(2), 493--504. <https://doi.org/10.1093/jssam/smad015>
- Comparing LLMs to dedicated Named Entity Extraction Deep Learners.
How effective are large language models in named entity extraction compared to traditional machine learning algorithms? \| 10 Answers from Research papers. (n.d.). SciSpace - Question. Retrieved 3 June 2024, from <https://typeset.io/questions/how-effective-are-large-language-models-in-named-entity-p29rfup468>
- This researcher conducted the web scraping project which supplies many examples in this course.
ldmnch. (2024). Ldmnch/bavaria-building-plans-digitalization \[Jupyter Notebook\]. <https://github.com/ldmnch/bavaria-building-plans-digitalization> (Original work published 2023)
- Comparison of several OCR tools.
OCR in 2024: Benchmarking Text Extraction/Capture Accuracy. (n.d.). AIMultiple: High Tech Use Cases & Tools to Grow Your Business. Retrieved 3 June 2024, from <https://research.aimultiple.com/ocr-accuracy/>
- Free web article providing detailed code examples of web scraping in Python.
Python, R. (n.d.). A Practical Introduction to Web Scraping in Python -- Real Python. Retrieved 3 June 2024, from <https://realpython.com/python-web-scraping-practical-introduction/>
- Website designed to help developers practice web scraping.
Test Sites \| Web Scraper. (n.d.). Retrieved 3 June 2024, from <https://webscraper.io/test-sites>
- Comparison of several industry solutions for extracting data from open text.
Top AI Apps & Tools for Document data extraction. (n.d.). Deepgram. Retrieved 3 June 2024, from <https://deepgram.com/ai-apps>
- Introduction to 'Nougat' an OCR model especially for parsing tables and equations in academic papers.
<https://facebookresearch.github.io/nougat/>