Create scheme.json file with the parsing scheme in the repository root directory.
+
+
+
Run make up in this directory.
+
+
+
The output will be saved as output/results.json file.
+
+
+
Docker Make Targets
+
makebuild# Build Docker image
+
+makeup# Start containers using Docker Compose
+
+makedown# Stop and remove containers using Docker Compose
+
+makerestart# Restart containers using Docker Compose
+
+makelogs# View logs of the containers
+
+makeshell# Open a shell in the running container
+
+makeclean# Remove all stopped containers, unused networks, and dangling images
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/features/extractors/index.html b/features/extractors/index.html
index 9280cff..e6008ae 100644
--- a/features/extractors/index.html
+++ b/features/extractors/index.html
@@ -16,7 +16,7 @@
-
+
@@ -479,6 +479,27 @@
+
+
+
+
+
+
+
Thanks for considering contributing to Parsera! This project is in the early stage of development, so any help will be highly appreciated. You can start from looking through existing issues, or directly asking about the most helpful contributions on Discord.
The best way to ask a question, report a bug, or submit feature request is to submit an Issue. It's much better than asking about it in email or Discord since conversation becomes publicly available and easy to navigate.
"},{"location":"contributing/#pull-requests","title":"Pull requests","text":""},{"location":"contributing/#installation-and-setup","title":"Installation and setup","text":"
Fork the repository on GitHub and clone your fork locally.
Next, install dependencies using poetry:
# Clone your fork and cd into the repo directory\ngit clone git@github.com:<your username>/parsera.git\ncd parsera\n\n# If you don't have poetry install it first:\n# https://python-poetry.org/docs/\n# Then:\npoetry install\n# If you are using VS Code you can get python venv path to switch:\npoetry which python\n# To activate virtual environment with installation run:\npoetry shell\n
Now you have a virtual environment with Parsera and all necessary dependencies installed."},{"location":"contributing/#code-style","title":"Code style","text":"
The project uses black and isort for formatting. Set up them in your IDE or run this before committing:
make format\n
"},{"location":"contributing/#commit-and-push-changes","title":"Commit and push changes","text":"
Commit your changes and push them to your fork, then create a pull request to the Parsera's repository.
Thanks a lot for helping improve Parsera!
"},{"location":"getting-started/","title":"Welcome to Parsera","text":"
Parsera is a lightweight Python library for scraping websites with LLMs. You can run clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.
By default, proxy_country is random, it's recommended to set proxy_country parameter to a specific country in the request since a page could not be available from all locations. Here you can find a full list of proxy countries available.
You can also explore Swagger doc of the API following this link: https://api.parsera.org/docs#/.
You can use the proxy_country parameter to set a proxy country. The default is random, and it's recommended to change it since your page could not be available from all locations.
To scrape the page from the United States you have to set proxy_country to UnitedStates:
"},{"location":"api/proxy/#list-of-proxy-countries","title":"List of proxy countries","text":"
Send a GET request to this URL https://api.parsera.org/v1/proxy-countries, to get the list of countries programmatically.
Here is the list of countries available:
Random Country - random
Afghanistan - Afghanistan
Albania - Albania
Algeria - Algeria
Argentina - Argentina
Armenia - Armenia
Aruba - Aruba
Australia - Australia
Austria - Austria
Azerbaijan - Azerbaijan
Bahamas - Bahamas
Bahrain - Bahrain
Bangladesh - Bangladesh
Belarus - Belarus
Belgium - Belgium
Bosnia and Herzegovina - BosniaandHerzegovina
Brazil - Brazil
British Virgin Islands - BritishVirginIslands
Brunei - Brunei
Bulgaria - Bulgaria
Cambodia - Cambodia
Cameroon - Cameroon
Canada - Canada
Chile - Chile
China - China
Colombia - Colombia
Costa Rica - CostaRica
Croatia - Croatia
Cuba - Cuba
Cyprus - Cyprus
Chechia - Chechia
Denmark - Denmark
Dominican Republic - DominicanRepublic
Ecuador - Ecuador
Egypt - Egypt
El Salvador - ElSalvador
Estonia - Estonia
Ethiopia - Ethiopia
Finland - Finland
France - France
Georgia - Georgia
Germany - Germany
Ghana - Ghana
Greece - Greece
Guatemala - Guatemala
Guyana - Guyana
Hashemite Kingdom of Jordan - HashemiteKingdomofJordan
Hong Kong - HongKong
Hungary - Hungary
India - India
Indonesia - Indonesia
Iran - Iran
Iraq - Iraq
Ireland - Ireland
Israel - Israel
Italy - Italy
Jamaica - Jamaica
Japan - Japan
Kazakhstan - Kazakhstan
Kenya - Kenya
Kosovo - Kosovo
Kuwait - Kuwait
Latvia - Latvia
Liechtenstein - Liechtenstein
Luxembourg - Luxembourg
Macedonia - Macedonia
Madagascar - Madagascar
Malaysia - Malaysia
Mauritius - Mauritius
Mexico - Mexico
Mongolia - Mongolia
Montenegro - Montenegro
Morocco - Morocco
Mozambique - Mozambique
Myanmar - Myanmar
Nepal - Nepal
Netherlands - Netherlands
New Zealand - NewZealand
Nigeria - Nigeria
Norway - Norway
Oman - Oman
Pakistan - Pakistan
Palestine - Palestine
Panama - Panama
Papua New Guinea - PapuaNewGuinea
Paraguay - Paraguay
Peru - Peru
Philippines - Philippines
Poland - Poland
Portugal - Portugal
Puerto Rico - PuertoRico
Qatar - Qatar
Republic of Lithuania - RepublicOfLithuania
Republic of Moldova - RepublicOfMoldova
Romania - Romania
Russia - Russia
Saudi Arabia - SaudiArabia
Senegal - Senegal
Serbia - Serbia
Seychelles - Seychelles
Singapore - Singapore
Slovakia - Slovakia
Slovenia - Slovenia
Somalia - Somalia
South Africa - SouthAfrica
South Korea - SouthKorea
Spain - Spain
Sri Lanka - SriLanka
Sudan - Sudan
Suriname - Suriname
Sweden - Sweden
Switzerland - Switzerland
Syria - Syria
Taiwan - Taiwan
Tajikistan - Tajikistan
Thailand - Thailand
Trinidad and Tobago - TrinidadandTobago
Tunisia - Tunisia
Turkey - Turkey
Uganda - Uganda
Ukraine - Ukraine
United Arab Emirates - UnitedArabEmirates
United Kingdom - UnitedKingdom
United States - UnitedStates
Uzbekistan - Uzbekistan
Venezuela - Venezuela
Vietnam - Vietnam
Zambia - Zambia
"},{"location":"features/custom-models/","title":"Custom models","text":""},{"location":"features/custom-models/#run-with-custom-model","title":"Run with custom model","text":"
You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:
With ParseraScript class you can execute custom playwright scripts during scraping. There are 2 types of code you can run:
initial_script which is executed during the first run of ParseraScript, useful when you need to log in to access the data.
playwright_script which runs during every run call, which allows to do custom actions before data is extracted, useful when data is hidden behind some button.
"},{"location":"features/custom-playwright/#example-log-in-and-load-data","title":"Example: log in and load data","text":"
You can log in to parsera.org and get credits amount with the following code:
from playwright.async_api import Page\nfrom parsera import ParseraScript\n\n# Define the script to execute during the session creation\nasync def initial_script(page: Page) -> Page:\n await page.goto(\"https://parsera.org/auth/sign-in\")\n await page.wait_for_load_state(\"networkidle\")\n await page.get_by_label(\"Email\").fill(EMAIL)\n await page.get_by_label(\"Password\").fill(PASSWORD)\n await page.get_by_role(\"button\", name=\"Sign In\", exact=True).click()\n await page.wait_for_selector(\"text=Playground\")\n return page\n\n# This script is executed after the url is opened\nasync def repeating_script(page: Page) -> Page:\n await page.wait_for_timeout(1000) # Wait one second for page to load\n return page\n\nparsera = ParseraScript(model=model, initial_script=initial_script)\nresult = await parsera.arun(\n url=\"https://parsera.org/app\",\n elements={\n \"credits\": \"number of credits\",\n },\n playwright_script=repeating_script,\n)\n
The page is fetched via the ParseraScript.loader, which contains the playwright instance.
from parsera import ParseraScript\n\nparsera = ParseraScript(model=model)\n\n## You can manually initialize playwright session and modify it:\nawait parsera.new_session()\nawait parsera.loader.load_content(url=url)\n\n## After page is loaded you can access playwright elements, like Page:\nparsera.loader.page.getByRole('button').click()\n\n## Next you cun run extraction process\nresult = await parsera.arun(\n url=extraction_url,\n elements=elements_dict,\n)\n
Where proxy_settings contains your proxy credentials.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to Parsera","text":"
Parsera is a lightweight Python library for scraping websites with LLMs.
There are 2 ways of using Parsera:
Install the library and run it locally, it is great for smaller-scale extraction and experiments.
Use an API that provides a more scalable way of data extraction out of the box. Also, it contains some extra features like a built-in proxy.
Thanks for considering contributing to Parsera! This project is in the early stage of development, so any help will be highly appreciated. You can start from looking through existing issues, or directly asking about the most helpful contributions on Discord.
The best way to ask a question, report a bug, or submit feature request is to submit an Issue. It's much better than asking about it in email or Discord since conversation becomes publicly available and easy to navigate.
"},{"location":"contributing/#pull-requests","title":"Pull requests","text":""},{"location":"contributing/#installation-and-setup","title":"Installation and setup","text":"
Fork the repository on GitHub and clone your fork locally.
Next, install dependencies using poetry:
# Clone your fork and cd into the repo directory\ngit clone git@github.com:<your username>/parsera.git\ncd parsera\n\n# If you don't have poetry install it first:\n# https://python-poetry.org/docs/\n# Then:\npoetry install\n# If you are using VS Code you can get python venv path to switch:\npoetry which python\n# To activate virtual environment with installation run:\npoetry shell\n
Now you have a virtual environment with Parsera and all necessary dependencies installed."},{"location":"contributing/#code-style","title":"Code style","text":"
The project uses black and isort for formatting. Set up them in your IDE or run this before committing:
make format\n
"},{"location":"contributing/#commit-and-push-changes","title":"Commit and push changes","text":"
Commit your changes and push them to your fork, then create a pull request to the Parsera's repository.
Thanks a lot for helping improve Parsera!
"},{"location":"getting-started/","title":"Welcome to Parsera","text":"
Parsera is a lightweight Python library for scraping websites with LLMs. You can run clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.
By default, proxy_country is random, it's recommended to set proxy_country parameter to a specific country in the request since a page could not be available from all locations. Here you can find a full list of proxy countries available.
You can also explore Swagger doc of the API following this link: https://api.parsera.org/docs#/.
You can use the proxy_country parameter to set a proxy country. The default is random, and it's recommended to change it since your page could not be available from all locations.
To scrape the page from the United States you have to set proxy_country to UnitedStates:
"},{"location":"api/proxy/#list-of-proxy-countries","title":"List of proxy countries","text":"
Send a GET request to this URL https://api.parsera.org/v1/proxy-countries, to get the list of countries programmatically.
Here is the list of countries available:
Random Country - random
Afghanistan - Afghanistan
Albania - Albania
Algeria - Algeria
Argentina - Argentina
Armenia - Armenia
Aruba - Aruba
Australia - Australia
Austria - Austria
Azerbaijan - Azerbaijan
Bahamas - Bahamas
Bahrain - Bahrain
Bangladesh - Bangladesh
Belarus - Belarus
Belgium - Belgium
Bosnia and Herzegovina - BosniaandHerzegovina
Brazil - Brazil
British Virgin Islands - BritishVirginIslands
Brunei - Brunei
Bulgaria - Bulgaria
Cambodia - Cambodia
Cameroon - Cameroon
Canada - Canada
Chile - Chile
China - China
Colombia - Colombia
Costa Rica - CostaRica
Croatia - Croatia
Cuba - Cuba
Cyprus - Cyprus
Chechia - Chechia
Denmark - Denmark
Dominican Republic - DominicanRepublic
Ecuador - Ecuador
Egypt - Egypt
El Salvador - ElSalvador
Estonia - Estonia
Ethiopia - Ethiopia
Finland - Finland
France - France
Georgia - Georgia
Germany - Germany
Ghana - Ghana
Greece - Greece
Guatemala - Guatemala
Guyana - Guyana
Hashemite Kingdom of Jordan - HashemiteKingdomofJordan
Hong Kong - HongKong
Hungary - Hungary
India - India
Indonesia - Indonesia
Iran - Iran
Iraq - Iraq
Ireland - Ireland
Israel - Israel
Italy - Italy
Jamaica - Jamaica
Japan - Japan
Kazakhstan - Kazakhstan
Kenya - Kenya
Kosovo - Kosovo
Kuwait - Kuwait
Latvia - Latvia
Liechtenstein - Liechtenstein
Luxembourg - Luxembourg
Macedonia - Macedonia
Madagascar - Madagascar
Malaysia - Malaysia
Mauritius - Mauritius
Mexico - Mexico
Mongolia - Mongolia
Montenegro - Montenegro
Morocco - Morocco
Mozambique - Mozambique
Myanmar - Myanmar
Nepal - Nepal
Netherlands - Netherlands
New Zealand - NewZealand
Nigeria - Nigeria
Norway - Norway
Oman - Oman
Pakistan - Pakistan
Palestine - Palestine
Panama - Panama
Papua New Guinea - PapuaNewGuinea
Paraguay - Paraguay
Peru - Peru
Philippines - Philippines
Poland - Poland
Portugal - Portugal
Puerto Rico - PuertoRico
Qatar - Qatar
Republic of Lithuania - RepublicOfLithuania
Republic of Moldova - RepublicOfMoldova
Romania - Romania
Russia - Russia
Saudi Arabia - SaudiArabia
Senegal - Senegal
Serbia - Serbia
Seychelles - Seychelles
Singapore - Singapore
Slovakia - Slovakia
Slovenia - Slovenia
Somalia - Somalia
South Africa - SouthAfrica
South Korea - SouthKorea
Spain - Spain
Sri Lanka - SriLanka
Sudan - Sudan
Suriname - Suriname
Sweden - Sweden
Switzerland - Switzerland
Syria - Syria
Taiwan - Taiwan
Tajikistan - Tajikistan
Thailand - Thailand
Trinidad and Tobago - TrinidadandTobago
Tunisia - Tunisia
Turkey - Turkey
Uganda - Uganda
Ukraine - Ukraine
United Arab Emirates - UnitedArabEmirates
United Kingdom - UnitedKingdom
United States - UnitedStates
Uzbekistan - Uzbekistan
Venezuela - Venezuela
Vietnam - Vietnam
Zambia - Zambia
"},{"location":"features/custom-models/","title":"Custom models","text":""},{"location":"features/custom-models/#run-with-custom-model","title":"Run with custom model","text":"
You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:
With ParseraScript class you can execute custom playwright scripts during scraping. There are 2 types of code you can run:
initial_script which is executed during the first run of ParseraScript, useful when you need to log in to access the data.
playwright_script which runs during every run call, which allows to do custom actions before data is extracted, useful when data is hidden behind some button.
"},{"location":"features/custom-playwright/#example-log-in-and-load-data","title":"Example: log in and load data","text":"
You can log in to parsera.org and get credits amount with the following code:
from playwright.async_api import Page\nfrom parsera import ParseraScript\n\n# Define the script to execute during the session creation\nasync def initial_script(page: Page) -> Page:\n await page.goto(\"https://parsera.org/auth/sign-in\")\n await page.wait_for_load_state(\"networkidle\")\n await page.get_by_label(\"Email\").fill(EMAIL)\n await page.get_by_label(\"Password\").fill(PASSWORD)\n await page.get_by_role(\"button\", name=\"Sign In\", exact=True).click()\n await page.wait_for_selector(\"text=Playground\")\n return page\n\n# This script is executed after the url is opened\nasync def repeating_script(page: Page) -> Page:\n await page.wait_for_timeout(1000) # Wait one second for page to load\n return page\n\nparsera = ParseraScript(model=model, initial_script=initial_script)\nresult = await parsera.arun(\n url=\"https://parsera.org/app\",\n elements={\n \"credits\": \"number of credits\",\n },\n playwright_script=repeating_script,\n)\n
The page is fetched via the ParseraScript.loader, which contains the playwright instance.
from parsera import ParseraScript\n\nparsera = ParseraScript(model=model)\n\n## You can manually initialize playwright session and modify it:\nawait parsera.new_session()\nawait parsera.loader.load_content(url=url)\n\n## After page is loaded you can access playwright elements, like Page:\nparsera.loader.page.getByRole('button').click()\n\n## Next you cun run extraction process\nresult = await parsera.arun(\n url=extraction_url,\n elements=elements_dict,\n)\n
"},{"location":"features/docker/","title":"Docker","text":""},{"location":"features/docker/#running-in-docker","title":"Running in Docker","text":"
You can get access to the CLI or development environment using Docker.
Create scheme.json file with the parsing scheme in the repository root directory.
Run make up in this directory.
The output will be saved as output/results.json file.
"},{"location":"features/docker/#docker-make-targets","title":"Docker Make Targets","text":"
make build # Build Docker image\n\nmake up # Start containers using Docker Compose\n\nmake down # Stop and remove containers using Docker Compose\n\nmake restart # Restart containers using Docker Compose\n\nmake logs # View logs of the containers\n\nmake shell # Open a shell in the running container\n\nmake clean # Remove all stopped containers, unused networks, and dangling images\n