Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless crawler #310

Merged
merged 40 commits into from
Sep 20, 2022
Merged
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2b8683a
wip: headless crawler
devl00p Jul 21, 2022
e3073c2
missing dependency
devl00p Jul 22, 2022
a603c23
fix most tests
devl00p Jul 23, 2022
b9ac5eb
fix more tests
devl00p Jul 23, 2022
2f8ca4d
remove our asyncmock as we removed python3.7 support and 3.8 has buil…
devl00p Jul 23, 2022
1ba7112
fix annoying warning in the mod_xxe test (... not awaited)
devl00p Jul 23, 2022
25c48d1
improving stop of the headless crawler
devl00p Jul 23, 2022
ddc7fcd
manage several headless modes, cli option
devl00p Jul 24, 2022
49182f0
Loading cookies inside intercepting explorer.
devl00p Jul 25, 2022
c8d9564
fixing beginner level error in CrawlerConfiguration (:sweating:) + style
devl00p Jul 26, 2022
6130465
ignore intercepted CONNECT requests + increase delay before reading t…
devl00p Jul 28, 2022
2f82d73
add --wait option for headless mode + force Request objects in "start…
devl00p Jul 29, 2022
174d779
fix test and style
devl00p Jul 29, 2022
8cb0ee2
fix style (again)
devl00p Jul 29, 2022
f583709
extract more urls
devl00p Aug 4, 2022
703762b
integrate exclusions for headless (had to do it in both crawler and m…
devl00p Aug 6, 2022
55d16b3
fix on urls with fragments
devl00p Aug 8, 2022
8fd2404
need two separate exclusion lists in intercepting_explorer.py
devl00p Aug 8, 2022
305c73f
Use link_depth in headless crawler. Real values won't appear in outpu…
devl00p Aug 8, 2022
2001dce
brings some limits into the intercepting explorer
devl00p Aug 10, 2022
520ae44
lock pyasn1 version
devl00p Aug 11, 2022
c587aaf
prevent downloading files from the headless browser by checking the m…
devl00p Aug 23, 2022
202a4bc
fix setup.py
devl00p Aug 23, 2022
4c91696
style
devl00p Aug 23, 2022
38c2d04
prevent out of scope redirections
devl00p Aug 24, 2022
a2e20a4
click on some buttons
devl00p Aug 25, 2022
88b168f
Use a headless browser to detect technologies a website is using + re…
devl00p Aug 30, 2022
37065d7
fix some tests
devl00p Aug 31, 2022
9808582
Fix one wappalizer related test, figured out the "find implied softwa…
devl00p Aug 31, 2022
e29af3a
fix more wappalyzer tests
devl00p Sep 1, 2022
a065c45
figured out reason behind last test failures (i18n)
devl00p Sep 1, 2022
07ff825
refactoring
devl00p Sep 2, 2022
d14bcf2
pin aiohttp version / use headless browser in mod_wapp only if --head…
devl00p Sep 5, 2022
d0b5760
fix tests for mod_wapp
devl00p Sep 5, 2022
9a4b36a
check geckodriver presence before activating headless mode
devl00p Sep 9, 2022
01d9360
headless mode: allows Firefox to connect to 127.0.0.1 using the proxy
devl00p Sep 12, 2022
a2161d9
headless mode: remove the buttons that are added by firefox when read…
devl00p Sep 13, 2022
4f317eb
catch asyncio.TimeoutError due to arsenic issue
devl00p Sep 14, 2022
772ef37
remove content-disposition header when set (intercepting mode)
devl00p Sep 15, 2022
dfb8df7
put back use of HTTP redirection urls + raise usage error is auth-typ…
devl00p Sep 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
style
devl00p committed Aug 23, 2022
commit 4c916961ee68d26cafb0ae160b559116d23fb29c
8 changes: 3 additions & 5 deletions wapitiCore/net/intercepting_explorer.py
Original file line number Diff line number Diff line change
@@ -215,13 +215,11 @@ async def launch_headless_explorer(
# The headless browser will be configured to use the MITM proxy
# The intercepting will be in charge of generating Request objects.
# This is the only way as a headless browser can't provide us response headers.
proxy = f"127.0.0.1:{proxy_port}"
proxy_settings = {
"proxyType": "manual",
"httpProxy": proxy,
"sslProxy": proxy
"httpProxy": f"127.0.0.1:{proxy_port}",
"sslProxy": f"127.0.0.1:{proxy_port}"
}
service = services.Geckodriver()
browser = browsers.Firefox(
proxy=proxy_settings,
acceptInsecureCerts=True,
@@ -241,7 +239,7 @@ async def launch_headless_explorer(
excluded_requests = list(excluded_requests)

try:
async with get_session(service, browser) as headless_client:
async with get_session(services.Geckodriver(), browser) as headless_client:
while to_explore and not stop_event.is_set():
request = to_explore.popleft()
excluded_requests.append(request)