Skip to content

A simple, lightweight, yet powerful recursive XPath web scraper that uses pure JSON for both input & output, no coding necessary.

License

Notifications You must be signed in to change notification settings

codefortallahassee/microwebscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Micro Web Scraper

A lightweight recursive HTML web scraper that takes a very simple JSON file containing keys & xpaths and packages the response in JSON.

Usage: scraper [OPTIONS]

Options:
  -c, --config FILENAME       name of JSON config file containing request &
                              xpath info
  -i, --indent INTEGER RANGE  indent size for output
  -t, --tidy                  tidy HTML (normalizes space & indent)
  -u, --url TEXT              URL of HTML page
  -v, --verbose               display the results for each scraper step
  -x, --xpath TEXT            XPATH expression
  -p, --page FILENAME         name of file containing HTML content
  --raw                       bypass parser, output raw HTML
  --help                      Show this message and exit.

keys that startwith '_' are arguments/parameters for the Python requests module. "_url" is a required key.

In it's simpliest form:

{
    "key1": "xpath expression"
}

would return

{
    "key1": value
}

If the XPath expression returns a list we can run additional xpath expressions on each item in the returned list.

{
    "key-a": ["xpath expression that returns multiple items", {
        "nested-key-1": "another xpath expression",
        "nested-key-2": "another xpath expression",
    }],
    "key-b": "xpath expression"
}

would return

{
    "key-a": [
        {
            "nested-key-1": value-1-1,
            "nested-key-2": value-1-2,
        },
        {
            "nested-key-1": value-2-1,
            "nested-key-2": value-2-2,
        }
    ]
}

It's recursive so you can continue nesting as deep as needed. Check out the working demos in the "examples" folder.

About

A simple, lightweight, yet powerful recursive XPath web scraper that uses pure JSON for both input & output, no coding necessary.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published