- Realistic HTTP Requests:
- Mimics browser headers for undetected scraping, adapting to the requested file type
- Tracks dynamic headers such as
Referer
andHost
- Masks the TLS fingerprint of HTTP requests using the curl_cffi package
- Faster and Easier Parsing:
- Automatically extracts metadata (title, description, author, etc.) from HTML-based responses
- Methods to extract all webpage and image URLs
- Seamlessly converts responses into Lxml and BeautifulSoup objects
$ pip install stealth_requests
Stealth-Requests mimics the API of the requests package, allowing you to use it in nearly the same way.
You can send one-off requests like such:
import stealth_requests as requests
resp = requests.get('https://link-here.com')
Or you can use a StealthSession
object which will keep track of certain headers for you between requests such as the Referer
header.
from stealth_requests import StealthSession
with StealthSession() as session:
resp = session.get('https://link-here.com')
When sending a request, or creating a StealthSession
, you can specify the type of browser that you want the request to mimic - either chrome
, which is the default, or safari
. If you want to change which browser to mimic, set the impersonate
argument, either in requests.get
or when initializing StealthSession
to safari
or chrome
.
This package supports Asyncio in the same way as the requests
package:
from stealth_requests import AsyncStealthSession
async with AsyncStealthSession(impersonate='safari') as session:
resp = await session.get('https://link-here.com')
or, for a one-off request, you can make a request like this:
import stealth_requests as requests
resp = await requests.get('https://link-here.com', impersonate='safari')
The response returned from this package is a StealthResponse
, which has all of the same methods and attributes as a standard requests response object, with a few added features. One of these extra features is automatic parsing of header metadata from HTML-based responses. The metadata can be accessed from the meta
property, which gives you access to the following metadata:
- title:
str | None
- author:
str | None
- description:
str | None
- thumbnail:
str | None
- canonical:
str | None
- twitter_handle:
str | None
- keywords:
tuple[str] | None
- robots:
tuple[str] | None
Here's an example of how to get the title of a page:
import stealth_requests as requests
resp = requests.get('https://link-here.com')
print(resp.meta.title)
To make parsing HTML faster, I've also added two popular parsing packages to Stealth-Requests - Lxml and BeautifulSoup4. To use these add-ons you need to install the parsers
extra:
$ pip install stealth_requests[parsers]
To easily get an Lxml tree, you can use resp.tree()
and to get a BeautifulSoup object, use the resp.soup()
method.
For simple parsing, I've also added the following convenience methods, from the Lxml package, right into the StealthResponse
object:
text_content()
: Get all text content in a responsexpath()
Go right to using XPath expressions instead of getting your own Lxml tree.
If you would like to get all of the webpage URLs (a
tags) from an HTML-based response, you can use the links
property. If you'd like to get all image URLs (img
tags) you can use the images
property from a response object.
import stealth_requests as requests
resp = requests.get('https://link-here.com')
for image_url in resp.images:
# ...
In some cases, it’s easier to work with a webpage in Markdown format rather than HTML. After making a GET request that returns HTML, you can use the resp.markdown()
method to convert the response into a Markdown string, providing a simplified and readable version of the page content!
markdown()
has two optional parameters:
content_xpath
An XPath expression, in the form of a string, which can be used to narrow down what text is converted to Markdown. This can be useful if you don't want the header and footer of a webpage to be turned into Markdown.ignore_links
A boolean value that tells Html2Text whether it should include any links in the output of the Markdown.