Skip to content

[components] Scrapeless - add new actions #17086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions components/scrapeless/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Overview

Scrapeless – your go-to platform for powerful, compliant web data extraction. With tools like Universal Scraping API, Scrapeless makes it easy to access and gather data from complex sites. Focus on insights while we handle the technical hurdles. Scrapeless – data extraction made simple.

# Example Use Cases

1. **Scraping API**: Endpoints for fresh, structured data from 100+ popular sites.
2. **Universal Scraping API**: Access any website at scale and say goodbye to blocks.
3. **Crawler**: Extract data from single pages or traverse entire domains.

# Getting Started

## Generating an API Key

1. If you are not a member of Scrapeless, you can sign up for a free account at [Scrapeless](https://app.scrapeless.com/passport/register).
2. Once registered, you can go to the API Key Management page to generate an API Key in the app settings.
89 changes: 89 additions & 0 deletions components/scrapeless/actions/crawler/crawler.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import scrapeless from "../../scrapeless.app.mjs";

export default {
key: "scrapeless-crawler",
name: "Crawler",
description: "Crawl any website at scale and say goodbye to blocks. [See the documentation](https://apidocs.scrapeless.com/api-17509010).",
version: "0.0.1",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
description: "Please select a API server to use",
default: "crawl",
options: [
{
label: "Crawl",
value: "crawl",
},
{
label: "Scrape",
value: "scrape",
},
],
reloadProps: true,
},
},
async run({ $ }) {
const {
scrapeless, apiServer, ...inputProps
} = this;

const browserOptions = {
"proxy_country": "ANY",
"session_name": "Crawl",
"session_recording": true,
"session_ttl": 900,
};

let response;

if (apiServer === "crawl") {
response =
await scrapeless._scrapelessClient().scrapingCrawl.crawl.crawlUrl(inputProps.url, {
limit: inputProps.limitCrawlPages,
browserOptions,
});
}

if (apiServer === "scrape") {
response =
await scrapeless._scrapelessClient().scrapingCrawl.scrape.scrapeUrl(inputProps.url, {
browserOptions,
});
}

if (response?.status === "completed" && response?.data) {
$.export("$summary", `Successfully retrieved crawling results for ${inputProps.url}`);
return response.data;
} else {
throw new Error(response?.error || "Failed to retrieve crawling results");
}
},
async additionalProps() {
const { apiServer } = this;

const props = {};

if (apiServer === "crawl" || apiServer === "scrape") {
props.url = {
type: "string",
label: "URL to Crawl",
description: "If you want to crawl in batches, please refer to the SDK of the document",
};
}

if (apiServer === "crawl") {
props.limitCrawlPages = {
type: "integer",
label: "Number Of Subpages",
default: 5,
description: "Max number of results to return",
};
}

return props;
},
};
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ export default {
key: "scrapeless-get-scrape-result",
name: "Get Scrape Result",
description: "Retrieve the result of a completed scraping job. [See the documentation](https://apidocs.scrapeless.com/api-11949853)",
version: "0.0.1",
version: "0.0.2",
type: "action",
props: {
scrapeless,
Expand Down
124 changes: 124 additions & 0 deletions components/scrapeless/actions/scraping-api/scraping-api.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
import scrapeless from "../../scrapeless.app.mjs";
import { log } from "../../common/utils.mjs";
export default {
key: "scrapeless-scraping-api",
name: "Scraping API",
description: "Endpoints for fresh, structured data from 100+ popular sites. [See the documentation](https://apidocs.scrapeless.com/api-12919045).",
version: "0.0.1",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
default: "googleSearch",
description: "Please select a API server to use",
options: [
{
label: "Google Search",
value: "googleSearch",
},
],
reloadProps: true,
},
},
async run({ $ }) {
const {
scrapeless, apiServer, ...inputProps
} = this;

const MAX_RETRIES = 3;
// 10 seconds
const DELAY = 1000 * 10;
const { run } = $.context;

let submitData;
let job;

// pre check if the job is already in the context
if (run?.context?.job) {
job = run.context.job;
}

if (apiServer === "googleSearch") {
submitData = {
actor: "scraper.google.search",
input: {
q: inputProps.q,
hl: inputProps.hl,
gl: inputProps.gl,
},
};
}

if (!submitData) {
throw new Error("No actor found");
}
// 1. Create a new scraping job
if (!job) {
job = await scrapeless._scrapelessClient().deepserp.createTask({
actor: submitData.actor,
input: submitData.input,
});

if (job.status === 200) {
$.export("$summary", "Successfully retrieved scraping results");
return job.data;
}

log("task in progress");
}

// 2. Wait for the job to complete
if (run.runs === 1) {
$.flow.rerun(DELAY, {
job,
}, MAX_RETRIES);
} else if (run.runs > MAX_RETRIES ) {
throw new Error("Max retries reached");
} else if (job && job?.data?.taskId) {
const result = await scrapeless._scrapelessClient().deepserp.getTaskResult(job.data.taskId);
if (result.status === 200) {
$.export("$summary", "Successfully retrieved scraping results");
return result.data;
} else {
$.flow.rerun(DELAY, {
job,
}, MAX_RETRIES);
}
} else {
throw new Error("No job found");
}

},
async additionalProps() {
const { apiServer } = this;

const props = {};

if (apiServer === "googleSearch") {
props.q = {
type: "string",
label: "Search Query",
description: "Parameter defines the query you want to search. You can use anything that you would use in a regular Google search. e.g. inurl:, site:, intitle:. We also support advanced search query parameters such as as_dt and as_eq.",
default: "coffee",
};

props.hl = {
type: "string",
label: "Language",
description: "Parameter defines the language to use for the Google search. It's a two-letter language code. (e.g., en for English, es for Spanish, or fr for French).",
default: "en",
};

props.gl = {
type: "string",
label: "Country",
description: "Parameter defines the country to use for the Google search. It's a two-letter country code. (e.g., us for the United States, uk for United Kingdom, or fr for France).",
default: "us",
};
}

return props;
},
};
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ export default {
key: "scrapeless-submit-scrape-job",
name: "Submit Scrape Job",
description: "Submit a new web scraping job with specified target URL and extraction rules. [See the documentation](https://apidocs.scrapeless.com/api-11949852)",
version: "0.0.1",
version: "0.0.2",
type: "action",
props: {
scrapeless,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import scrapeless from "../../scrapeless.app.mjs";
import { countryOptions } from "../../common/constants.mjs";

export default {
key: "scrapeless-universal-scraping-api",
name: "Universal Scraping API",
description: "Access any website at scale and say goodbye to blocks. [See the documentation](https://apidocs.scrapeless.com/api-11949854).",
version: "0.0.1",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
default: "webUnlocker",
description: "Please select a API server to use",
options: [
{
label: "Web Unlocker",
value: "webUnlocker",
},
],
reloadProps: true,
},
},
async run({ $ }) {
const {
scrapeless,
apiServer, ...inputProps
} = this;

if (apiServer === "webUnlocker") {
const response = await scrapeless._scrapelessClient().universal.scrape({
actor: "unlocker.webunlocker",
input: {
url: inputProps.url,
headless: inputProps.headless,
js_render: inputProps.jsRender,
},
proxy: {
country: inputProps.country,
},
});

$.export("$summary", "Successfully retrieved scraping results for Web Unlocker");
return response;
}
},
async additionalProps() {
const { apiServer } = this;

const props = {};

if (apiServer === "webUnlocker") {
props.url = {
type: "string",
label: "Target URL",
description: "Parameter defines the URL you want to scrape.",
};

props.jsRender = {
type: "boolean",
label: "Js Render",
default: true,
};

props.headless = {
type: "boolean",
label: "Headless",
default: true,
};

props.country = {
type: "string",
label: "Country",
default: "ANY",
options: countryOptions.map((country) => ({
label: country.label,
value: country.value,
})),
};
}

return props;
},
};
Loading