Major performance improvement on crawling processing pending tasks concurrently. Now you can get all Next.js SSG pages on initial crawl for websites that do not expose links and have dynamic event listeners for routing.

fix loop blocking tasks
improve crawl performance processing tasks concurrent
fix page absolute link joining
add wait_for_dom to target element updates chrome
add alert polyfill blocking prevention
add missing chrome navigate request timeout for http future
add ignore assets when crawling http
add with_block_assets builder config for Server response non html
perf(chrome): add skip other resources
feat(page): add nextjs build ssg path handling

Full Changelog: v2.11.0...v2.11.20

Assets 2

23 Oct 13:43

j-mendez

v2.10.27

692c7b0

v2.10.27

Whats Changed

fix protocol handling valid links to crawl
fix subdomains and tld handling matching
add empty server response retry
add initial request status code storing
fix auto-encoding detection for html
fix openai compile and fs compile
add layui ui js frameworks and smartmode handling jquery
chore(transforms): add optional ignore tags
chore(budget): fix whitelist/blacklist budgeting
chore(smart): fix whitelist/blacklist establish
chore(openai): add json_schema option gpt configs

Full Changelog: v2.10.6...v2.10.27

Assets 2

22 Oct 10:47

j-mendez

v2.10.6

5368bf9

v2.10.6

Whats Changed

add html lang auto encoding handling to improve detection
add exclude_selector and root_selector transformations output formats
add bin file handling to prevent SOF transformations
chore(chrome): fix window navigator stealth handling
chore: fix subdomains and tld handling
chore(chrome): add automation all routes handling

Full Changelog: v2.9.15...v2.10.6

Assets 2

09 Oct 13:50

j-mendez

v2.9.15

9f55927

v2.9.15

Whats Changed

add XPath data extraction support spider_utils
add XML return format for spider_transformations
chore(transformations): add root selector across formats #219
Example getting data via xpath.

    let map = QueryCSSMap::from([(
        "list",
        QueryCSSSelectSet::from(["//*[@class='list']"]),
    )]);
    let data = css_query_select_map_streamed(
        r#"<html><body><ul class="list"><li>Test</li></ul></body></html>"#,
        &build_selectors(map),
    )
    .await;

    assert!(!data.is_empty(), "Xpath extraction failed");

Full Changelog: v2.8.28...v2.9.15

Assets 2

05 Oct 14:55

j-mendez

v2.8.29

f0a6b36

v2.8.29

Whats Changed

Fix request interception remote connections. Intercept builder now uses spider::features::chrome_common::RequestInterceptConfiguration and adds more control.

chrome performance improvement reducing dup events
chore(chrome): add set extra headers
chore(smart): add http fallback chrome smart mode request
chore(chrome): add spoofed plugins
chore(real-browser): add mouse movement waf
chore(chrome): patch logs stealth mode
chore(page): fix url join empty slash
chore(chrome): fix return page response headers and cookies
chore(page): add empty page validation
chore(config): add serializable crawl configuration
chore(retry): add check 502 notfound retry

Full Changelog: v2.7.1...v2.8.29

Assets 2

30 Sep 23:14

j-mendez

v2.7.1

ac019b9

v2.7.1

Whats Changed

add chrome remote connection proxy ability.
add context handling and disposing chrome.
chore(chrome): fix concurrent pages opening remote ws connections
chore(chrome): add cookie setting browser
chore(chrome): fix connecting to chrome when using a LB
feat(website): add retry and rate limiting handling

Full Changelog: v2.6.15...v2.7.1

Assets 2

22 Sep 00:27

j-mendez

v2.6.15

bb8493b

v2.6.15

fix parsing links for top level redirected domains
add website.with_preserve_host_header
default tls reqwest_native_tls_native_roots

Full Changelog: v2.5.2...v2.6.15

Assets 2

21 Sep 12:29

j-mendez

v2.6.2

200bc48

HTML Transformations

Whats Changed

We Open Sourced our transformation utils for Spider cloud that provides high performance output to markdown, text, and other formats.

You can install spider_transformations on it's own or use the feature flag transformations when installing spider_utils.

use spider::tokio;
use spider::website::Website;
use spider_utils::spider_transformations::transformation::content::{
    transform_content, ReturnFormat, TransformConfig,
};
use tokio::io::AsyncWriteExt;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(0).unwrap();
    let mut stdout = tokio::io::stdout();

    let mut conf = TransformConfig::default();
    conf.return_format = ReturnFormat::Markdown;

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let markup = transform_content(&res, &conf, &None, &None);

            let _ = stdout
                .write_all(format!("- {}\n {}\n", res.get_url(), markup).as_bytes())
                .await;
        }
        stdout
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    website.unsubscribe();
    let duration = start.elapsed();
    let mut stdout = join_handle.await.unwrap();

    let _ = stdout
        .write_all(
            format!(
                "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                duration,
                website.get_links().len()
            )
            .as_bytes(),
        )
        .await;
}

Full Changelog: v2.5.2...v2.6.2

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whats Changed

Fix smart mode re-rendering and performance

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Releases: spider-rs/spider

v2.13.5

Whats Changed

v2.12.12

Fix smart mode re-rendering and performance

v2.11.20

Whats Changed

v2.10.27

Whats Changed

v2.10.6

Whats Changed

v2.9.15

Whats Changed

v2.8.29

v2.7.1

Whats Changed

v2.6.15

HTML Transformations

Whats Changed