Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typing and testing improvements #46

Merged
merged 29 commits into from
Aug 13, 2024
Merged
Changes from 2 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
861e7e3
Added user agent argument (#43)
freddyheppell Aug 7, 2024
aef2f45
types
freddyheppell Aug 8, 2024
bd160ef
ruff
freddyheppell Aug 8, 2024
c411f3e
docs improvements
freddyheppell Aug 8, 2024
e169b5e
update docs
freddyheppell Aug 8, 2024
5c0fcf0
add py.typed
freddyheppell Aug 8, 2024
1acf74a
add type check workflow
freddyheppell Aug 8, 2024
a185ab3
fix paramspec in py3.9
freddyheppell Aug 8, 2024
a53b9d9
add tests for picker select helpers
freddyheppell Aug 8, 2024
0610aa7
more tests for download/parse/extract
freddyheppell Aug 8, 2024
ead3a1e
fix flaky test
freddyheppell Aug 8, 2024
277b5a3
ruff
freddyheppell Aug 9, 2024
db55c70
wpapi crawler tests
freddyheppell Aug 9, 2024
cea25be
remove wpapi caching
freddyheppell Aug 9, 2024
ac8d7ad
test get_obj_list and remove search
freddyheppell Aug 9, 2024
fa95874
more tests
freddyheppell Aug 9, 2024
e4de46a
add e2e tests
freddyheppell Aug 12, 2024
1e13177
add extra info to e2e test docs
freddyheppell Aug 12, 2024
438922f
Merge branch 'release/1.0.4' into typing
freddyheppell Aug 12, 2024
5e49e8e
remove unused UA argument on downloader
freddyheppell Aug 12, 2024
76e4329
ruff
freddyheppell Aug 12, 2024
7825c1d
fix ua null behaviour
freddyheppell Aug 12, 2024
8f10459
ruff
freddyheppell Aug 12, 2024
55cd131
Improve download error handling on first request
freddyheppell Aug 12, 2024
11b4ee7
Revert type change to set_current_lang and instead concat inside extr…
freddyheppell Aug 12, 2024
9b48a24
Update changelog
freddyheppell Aug 12, 2024
87df809
fix incorrect docstring
freddyheppell Aug 12, 2024
ccae085
remove incorrect raises of not v2 error
freddyheppell Aug 12, 2024
730b5d8
requestsession tests
freddyheppell Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api/downloader.md
Original file line number Diff line number Diff line change
@@ -15,3 +15,5 @@
members: false

::: wpextract.download.AuthorizationType

::: wpextract.download.requestsession.DEFAULT_UA
8 changes: 8 additions & 0 deletions docs/usage/download.md
Original file line number Diff line number Diff line change
@@ -59,6 +59,9 @@ $ wpextract download TARGET OUT_JSON
`--max-redirects MAX_REDIRECTS`
: Maximum number of redirects before giving up (default: 20)

`--user-agent USER_AGENT`
: User agent to use for requests. Default is a recent version of Chrome on Linux (see [`requestsession.DEFAULT_UA`][wpextract.download.requestsession.DEFAULT_UA])

**logging**

`--log FILE`, `-l FILE`
@@ -109,6 +112,11 @@ We would also suggest enabling the following options, with consideration for how
- `--wait` to space out requests
- `--random-wait` to vary the time between requests to avoid patterns

You may also wish to consider:

- The reputation of the IP used to make requests. IPs in ranges belonging to common VPS providers, e.g. DigitalOcean or AWS, may be more likely to be rate limited.
- `--user-agent` to set a custom user agent. The default is a recent version of Chrome on Linux, but this may become outdated. If using authentication, this may need to match the user agent of the browser used to log in.

### Error Handling

If an HTTP error occurs, the command will retry the request up to `--max-retries` times, with the backoff set by `--backoff-factor`. If the maximum number of retries is reached, the command will output the error, stop collecting the given data type, and start collecting the following data type. This is because it's presumed that if a given page is non-functional, the following one will be too.
7 changes: 7 additions & 0 deletions src/wpextract/cli/_download.py
Original file line number Diff line number Diff line change
@@ -99,6 +99,11 @@ def validate_wait(ctx: Context, param: Parameter, value: Any) -> Any:
help="Maximum number of redirects before giving up",
show_default=True,
)
@optgroup.option(
"--user-agent",
type=str,
help="User-Agent string to use for requests. Set to a recent version of Chrome on Linux by default.",
)
@logging_options
def download(
target: str,
@@ -115,6 +120,7 @@ def download(
max_retries: int,
backoff_factor: float,
max_redirects: int,
user_agent: Optional[str],
log: Optional[Path],
verbose: bool,
) -> None:
@@ -152,6 +158,7 @@ def download(
max_retries=max_retries,
backoff_factor=backoff_factor,
max_redirects=max_redirects,
user_agent=user_agent,
)

with setup_tqdm_redirect(log is None):
7 changes: 5 additions & 2 deletions src/wpextract/download/requestsession.py
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@
from requests.models import Response
from requests.sessions import _Data as RequestDataType

DEFAULT_UA = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
DEFAULT_UA = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"


class ConnectionCouldNotResolve(Exception):
@@ -208,6 +208,7 @@ def __init__(
max_retries: int = 10,
backoff_factor: float = 0.1,
max_redirects: int = 20,
user_agent: str = DEFAULT_UA,
):
"""Create a new request session.

@@ -221,6 +222,7 @@ def __init__(
max_retries: the maximum number of retries before failing
backoff_factor: Factor to wait between successive retries
max_redirects: maximum number of redirects to follow
user_agent: User agent to use for requests. Set to [`DEFAULT_UA`][wpextract.download.requestsession.DEFAULT_UA] by default.
"""
self.s = requests.Session()
if proxy is not None:
@@ -237,6 +239,7 @@ def __init__(
self.timeout = timeout
self._mount_retry(backoff_factor, max_redirects, max_retries)
self.waiter = RequestWait(wait, random_wait)
self.user_agent = user_agent

def _mount_retry(
self, backoff_factor: float, max_redirects: int, max_retries: int
@@ -302,7 +305,7 @@ def do_request(
Returns:
the Response object
"""
headers = {"User-Agent": DEFAULT_UA}
headers = {"User-Agent": self.user_agent}
response = None
try:
if method == "post":
2 changes: 2 additions & 0 deletions src/wpextract/downloader.py
Original file line number Diff line number Diff line change
@@ -25,6 +25,7 @@ def __init__(
data_types: list[str],
session: Optional[RequestSession] = None,
json_prefix: Optional[str] = None,
user_agent: Optional[str] = None,
) -> None:
"""Initializes the WPDownloader object.

@@ -34,6 +35,7 @@ def __init__(
data_types: set of data types to download
session : request session. Will be created from default constructor if not provided.
json_prefix: prefix to prepend to JSON file names
user_agent: User agent to use for requests. See [`RequestSession`][wpextract.download.requestsession.RequestSession].
"""
self.target = target
self.out_path = out_path
18 changes: 18 additions & 0 deletions tests/cli/test_download.py
Original file line number Diff line number Diff line change
@@ -14,6 +14,14 @@ def mock_cls_invoke(mocker, runner, datadir, args=None):
return dl_mock, result


def mock_cls_invoke_req_sess(mocker, runner, datadir, args=None):
rq_mock = mocker.patch("wpextract.download.RequestSession")

dl_mock, result = mock_cls_invoke(mocker, runner, datadir, args)

return rq_mock, dl_mock, result


def test_default_args(mocker, runner, datadir):
dl_mock, result = mock_cls_invoke(mocker, runner, datadir)
assert result.exit_code == 0
@@ -40,3 +48,13 @@ def test_wait_random_validation(mocker, runner, datadir):
mocker, runner, datadir, ["--random-wait", "--wait", "1"]
)
assert result.exit_code == 0


def test_custom_ua(mocker, runner, datadir):
req_mock, dl_mock, result = mock_cls_invoke_req_sess(
mocker, runner, datadir, ["--user-agent", "test"]
)
assert result.exit_code == 0

req_mock.assert_called_once()
assert req_mock.call_args.kwargs["user_agent"] == "test"