Skip to content

Commit

Permalink
Additional features. Meta, cb_kwags; close_page hendeling; dont_recap…
Browse files Browse the repository at this point in the history
…tcha meta-flag (#18)

* added comment

* Added new action (recaptcha_solver), wrote down the descriptions and checked grammar

* Example of using the recaptcha_solver action

* Deleted one example web-site

* Improved RecaptchaSolver and example of usage

* Implemented RecaptchaSolver, changed returning response to PuppeteerJsonResponse. Added comments and description to classes, checked grammar.

* Fully implemented auto recaptcha solver middleware. Added example of usage and fixed screenshot action.

* RECAPTCHA_* settings prefix, usual filename of saved html, 'w'->'wb', changed names of parse-functions in recaptcha.py, changed RecaptchaSolver documentation, additional settings to middleware, provided False instead of close_page attribute

* Extended 1 example into 2. Checked grammar + beautified examples

* Replace method for response.py

* Fixed replace method, added meta to PuppeteerRequest,
added generativo request to PuppeteerResponse.
Simplified _gen_response method

* Backward compatibility for ServiceMiddleware

* Documentation for changes

* Fixed meta

* Added bytes for body

* Added bytes for body

* Added bytes for body

* No encodings

* Trying to provide meta!

* Transfering and accumulating meta

* CustomJsAction context_id and page_id fix

* Some fixed with meta and special key to meta

* Changes to page_idS

* Independent requests for recaptcha_solver

* Meta documentation

* Proper replace methods and attributes

* Added attribute to HtmlResponse, handled close_page problem

* deleted prints

* deleted comments

* Click in submit_selectors,
added attributes to responses and simplified recaptcha_solving

* deleted prints

* Fixed replace for JsonResponse

* Beatifying crawling and additional parameter to recaptcha_solver action

* Beautifying in retrying

* Code beautifying and skipping some actions as they do not produce Recaptcha

* Grammar and additional comments

* Some grammar, new ActionRequest class, TextResponse inheritance, new representation of request and responses

* Changed PuppeteerJsonResponse's init for better handling its data, thus changed response generation in ServiceMiddleware. Deleted exceptional replace method of PuppeteerJsonResponse.

* Setup changes

* Headers fix

* Setup fix

* Version updating, dependencies setting and deleting `|` for Python 3.6 support

* Meta added and comments updated

* changed _form_response signature

* Updated comment to CustomJsAction

* Changed page_id getting

* Fixed additional request in response_data

* Fixed replacing request
  • Loading branch information
MatthewZMSU authored Aug 22, 2023
1 parent 3977243 commit 29a832c
Show file tree
Hide file tree
Showing 8 changed files with 268 additions and 104 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,9 @@ via `include_headers` attribute in request or globally with `PUPPETEER_INCLUDE_H
Available values are True (all headers), False (no headers) or list of header names.
By default, only cookies are sent.

You would also like to send meta with your request. By default, you are not allowed to do this
in order to sustain backward compatibility. You can change this behaviour by setting `PUPPETEER_INCLUDE_META` to True.

## Automatic recaptcha solving

Enable PuppeteerRecaptchaDownloaderMiddleware to automatically solve recaptcha during scraping. We do not recommend
Expand All @@ -113,6 +116,8 @@ DOWNLOADER_MIDDLEWARES = {
Note that the number of RecaptchaMiddleware has to be lower than ServiceMiddleware's.
You must provide some settings to use the middleware:
```Python
PUPPETEER_INCLUDE_META = True # Essential to send meta

RECAPTCHA_ACTIVATION = True # Enables the middleware
RECAPTCHA_SOLVING = False # Automatic recaptcha solving
RECAPTCHA_SUBMIT_SELECTORS = { # Selectors for "submit recaptcha" button
Expand All @@ -122,6 +127,9 @@ RECAPTCHA_SUBMIT_SELECTORS = { # Selectors for "submit recaptcha" button
If you set RECAPTCHA_SOLVING to False the middleware will try to find captcha
and will notify you about number of found captchas on the page.

If you don't want the middleware to work on specific request you may provide special meta key: `'dont_recaptcha': True`.
In this case RecaptchaMiddleware will just skip the request.

## TODO

- [x] skeleton that could handle goto, click, scroll, and actions
Expand Down
2 changes: 2 additions & 0 deletions examples/spiders/auto_recaptcha.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ class AutoRecaptchaSpider(scrapy.Spider):
'scrapypuppeteer.middleware.PuppeteerRecaptchaDownloaderMiddleware': 1041,
'scrapypuppeteer.middleware.PuppeteerServiceDownloaderMiddleware': 1042
},
'PUPPETEER_INCLUDE_META': True,

'RECAPTCHA_ACTIVATION': True,
'RECAPTCHA_SOLVING': True,
'RECAPTCHA_SUBMIT_SELECTORS': {
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
scrapy
scrapy>=2.6
16 changes: 13 additions & 3 deletions scrapypuppeteer/actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ class RecaptchaSolver(PuppeteerServiceAction):
solve_recaptcha - bool = True: enables automatic solving of recaptcha on the page if found.
If false is provided recaptcha will still be detected on the page but not solved.
You can get info about found recaptchas via return value.
close_on_empty - bool = False: whether to close page or not if there was no captcha on the page.
Response for this action is PuppeteerJsonResponse. You can get the return values
via self.data['recaptcha_data'].
Expand All @@ -246,12 +247,17 @@ class RecaptchaSolver(PuppeteerServiceAction):
"""
endpoint = 'recaptcha_solver'

def __init__(self, solve_recaptcha: bool = True, **kwargs):
def __init__(self,
solve_recaptcha: bool = True,
close_on_empty: bool = False,
**kwargs):
self.solve_recaptcha = solve_recaptcha
self.close_on_empty = close_on_empty

def payload(self):
return {
'solve_recaptcha': self.solve_recaptcha
'solve_recaptcha': self.solve_recaptcha,
'close_on_empty': self.close_on_empty
}


Expand All @@ -261,7 +267,11 @@ class CustomJsAction(PuppeteerServiceAction):
:param str js_action: JavaScript function.
Expected signature: ``async function action(page, request)``
Expected signature: ``async function action(page, request)``.
JavaScript function should not return object with attributes
of ``scrapypuppeteer.PuppeteerJsonResponse``.
Otherwise, undefined behaviour is possible.
Response for this action contains result of the function.
Expand Down
Loading

0 comments on commit 29a832c

Please sign in to comment.