GrabLinks

Synopsis

grablinks.py is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.

Requirements

An installation of Python 3 (any version above 3.5 should do fine). Additionally the 3rd-party Python modules requests and beautifulsoup4 are required. Both modules can be easily installed with Python's package manager pip, e.g.:

pip --install requests --user
pip --install beautifulsoup4 --user

Usage

usage: grablinks.py [-h] [-V] [--insecure] [-f FORMATSTR] [--fix-links]
                    [-c CLASS] [-s SEARCH] [-x REGEX]
                    URL

Extracts, and optionally filters, all links (`<a href=""/>') from a remote
HTML document.

positional arguments:
  URL                   a fully qualified URL to the source HTML document

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version number and exit
  --insecure            disable verification of SSL/TLS certificates (e.g. to
                        allow self-signed certificates)
  -f FORMATSTR, --format FORMATSTR
                        a format string to wrap in the output: %url% is
                        replaced by found URL entries; %text% is replaced with
                        the text content of the link; other supported
                        placeholders for generated values: %id%, %guid%, and
                        %hash%
  --fix-links           try to convert relative and fragmental URLs to
                        absolute URLs (after filtering)

filter options:
  -c CLASS, --class CLASS
                        only extract URLs from href attributes of <a>nchor
                        elements with the specified class attribute content.
                        Multiple classes, separated by space, are evaluated
                        with an logical OR, so any <a>nchor that has at least
                        one of the classes will match.
  -s SEARCH, --search SEARCH
                        only output entries from the extracted result set, if
                        the search string occurs in the URL
  -x REGEX, --regex REGEX
                        only output entries from the extracted result set, if
                        the URL matches the regular expression

Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues

Usage Examples

# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…

# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱

# extract/ handle links like
# <a href="https://example.com/crypricnumber">properfilename.dat</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
#       recommended to verify this before executing things automatically

History

1.6	2-Dec-2023	Added '--insecure' argument to disable SSL/TLS certificate verification Added support for '%text%' placeholder in format string (<a>text</a>)
1.5	24-Nov-2022	Added a (fixed) timeout to the remote request.
1.4	30-May-2022	Improved handling of passing multiple classes to '--class'.
1.3	6-Feb-2021	Fix: handling of common edge cases when '--fix-links' is used.
1.2	16-Aug-2020	Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result.
1.1	7-Jun-2020	Initial public source code release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History

Files

README.md

Latest commit

History

README.md

File metadata and controls

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History