Copy this and run on your own notebook.
Changelog:
[0.2] 23-09-2024:
- Prioritize <main> to extract
- Added summarizer function
- Extract only H1, H2, and H3 for precision
[0.1] 21-09-2024:
- Initial build
- Using custom header to mimic real browser, to avoid blocking
- Crawl URLs and get the content structure, then export it to .txt file