Scraping Service is a REST API for scraping dynamic websites using Node.js, Puppeteer and Cheerio. It works in serverless environments such as Vercel.
Made by the team at Weld (www.weldyourownapp.com), the #codefree web/app creation tool:
Start Scraping Service in development mode:
API=dom yarn dev
# you can replace `dom` with: dom-simple (just fetch, no Chromium), image, meta, page
or in production mode:
yarn start
Server will default to http://localhost:3036
API
: dom-simple/dom/image/meta/page – for testing only. See /app/controllers/api folderMAX_BROWSER_THREADS
: default 3 Puppeteer browsersRENDER_TIMEOUT
: default 20000 millisecsPORT
: server portNODE_ENV
: Node.js environment
yarn test
Do a HTTP GET:
http://localhost:3036/api/dom?url=https://news.ycombinator.com&selector=.title+a
or simple with just Fetch:
http://localhost:3036/api/dom-simple?url=https://news.ycombinator.com&selector=.title+a
Results:
{
"time": 792,
"results": [
{
"selector": ".title a",
"count": 61,
"items": [
"Ask a Female Engineer: Thoughts on the Google Memo",
(more items...)
]
}
]
}
Parameters:
url
(required)selector
is a JQuery style selector, defaults tobody
. You can use multiple selectors separated by comma, which leads to more items in theresults
array. Use$
instead of#
for element ID selectors.time
e.g.time=2000
adds extra loading time before accessing DOM. Usetime=networkidle0
to wait until network requests are idle.deep
set totrue
to get recursive object trees, not just first-level text contents.complete
set totrue
to get complete HTML tags, not just text contents.useIndex
set totrue
to use element index instead of class/id.
http://localhost:3036/api/page?url=https://www.weldyourownapp.com
Results:
{
"url": "http://www.tomsoderlund.com",
"length": 13560,
"content": "<html>...</html>"
}
Parameters:
url
(required)time
e.g.&time=2000
adds extra loading time before accessing page content. Default is 100.bodyOnly=true
skips the head of the page
http://localhost:3036/api/meta?url=https://www.weldyourownapp.com
Results:
{
"url":"https://www.weldyourownapp.com",
"general":{
"appleTouchIcons":[
{
"href":"/images/apple-touch-icon.png"
}
],
"icons":[
{
"href":"/images/apple-touch-icon.png"
}
],
"canonical":"http://www.weldyourownapp.com/",
"description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
"title":"Weld - The Visual CMS"
},
"openGraph":{
"site_name":"Weld - The Visual CMS",
"title":"Weld - The Visual CMS",
"description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
"locale":"en_US",
"url":"http://www.weldyourownapp.com/",
"image":{
"url":"https://s3-eu-west-1.amazonaws.com/weld-design-kit/weld-logo-square.png"
}
},
"twitter":{
"title":"Weld - The Visual CMS",
"description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
"card":"summary",
"url":"http://www.weldyourownapp.com/",
"site":"@Weld_io",
"creator":"@Weld_io",
"image":"https://s3-eu-west-1.amazonaws.com/weld-design-kit/weld-logo-square.png"
}
}
http://localhost:3036/api/image?url=https://www.weldyourownapp.com
url
(required)format
:jpeg
(default) orpng
width
: default 800height
: default 450dpr
: deviceScaleFactor, default is 1.0. Note you can use this as a zoom factor; the browser canvas has the same size, but the output image has different size.time
: milliseconds ornetworkidle0
Built on Node.js, Express, Puppeteer, Cheerio, html-metadata.
See vercel.json
– set up as serverless API controllers.
Stack: Heroku-18
Buildpacks:
heroku create MYAPPNAME heroku config:set NODE_ENV=production
heroku buildpacks:add --index 1 https://buildpack-registry.s3.amazonaws.com/buildpacks/jontewks/puppeteer.tgz