Skip to content

Releases: spider-rs/spider

v2.13.5

07 Nov 21:24
Compare
Choose a tag to compare

Whats Changed

  • perf reduce cpu usage for streaming rewriter

Full Changelog: v2.12.12...v2.13.5

v2.12.12

05 Nov 03:09
Compare
Choose a tag to compare

Fix smart mode re-rendering and performance

  • fix smart mode re-rendering inline js detection
  • perf improve smart mode parsing
  • fix encoding smart mode html
  • add pin html pre-parsing
  • add chrome status code check for performing full actions

Full Changelog: v2.11.20...v2.12.12

v2.11.20

31 Oct 20:28
Compare
Choose a tag to compare

Whats Changed

Major performance improvement on crawling processing pending tasks concurrently. Now you can get all Next.js SSG pages on initial crawl for websites that do not expose links and have dynamic event listeners for routing.

  • fix loop blocking tasks
  • improve crawl performance processing tasks concurrent
  • fix page absolute link joining
  • add wait_for_dom to target element updates chrome
  • add alert polyfill blocking prevention
  • add missing chrome navigate request timeout for http future
  • add ignore assets when crawling http
  • add with_block_assets builder config for Server response non html
  • perf(chrome): add skip other resources
  • feat(page): add nextjs build ssg path handling

Full Changelog: v2.11.0...v2.11.20

v2.10.27

23 Oct 13:43
Compare
Choose a tag to compare

Whats Changed

  • fix protocol handling valid links to crawl
  • fix subdomains and tld handling matching
  • add empty server response retry
  • add initial request status code storing
  • fix auto-encoding detection for html
  • fix openai compile and fs compile
  • add layui ui js frameworks and smartmode handling jquery
  • chore(transforms): add optional ignore tags
  • chore(budget): fix whitelist/blacklist budgeting
  • chore(smart): fix whitelist/blacklist establish
  • chore(openai): add json_schema option gpt configs

Full Changelog: v2.10.6...v2.10.27

v2.10.6

22 Oct 10:47
Compare
Choose a tag to compare

Whats Changed

  1. add html lang auto encoding handling to improve detection
  2. add exclude_selector and root_selector transformations output formats
  3. add bin file handling to prevent SOF transformations
  4. chore(chrome): fix window navigator stealth handling
  5. chore: fix subdomains and tld handling
  6. chore(chrome): add automation all routes handling

Full Changelog: v2.9.15...v2.10.6

v2.9.15

09 Oct 13:50
Compare
Choose a tag to compare

Whats Changed

  • add XPath data extraction support spider_utils
  • add XML return format for spider_transformations
  • chore(transformations): add root selector across formats #219
    Example getting data via xpath.
    let map = QueryCSSMap::from([(
        "list",
        QueryCSSSelectSet::from(["//*[@class='list']"]),
    )]);
    let data = css_query_select_map_streamed(
        r#"<html><body><ul class="list"><li>Test</li></ul></body></html>"#,
        &build_selectors(map),
    )
    .await;

    assert!(!data.is_empty(), "Xpath extraction failed");

Full Changelog: v2.8.28...v2.9.15

v2.8.29

05 Oct 14:55
Compare
Choose a tag to compare

Whats Changed

Fix request interception remote connections. Intercept builder now uses spider::features::chrome_common::RequestInterceptConfiguration and adds more control.

  • chrome performance improvement reducing dup events
  • chore(chrome): add set extra headers
  • chore(smart): add http fallback chrome smart mode request
  • chore(chrome): add spoofed plugins
  • chore(real-browser): add mouse movement waf
  • chore(chrome): patch logs stealth mode
  • chore(page): fix url join empty slash
  • chore(chrome): fix return page response headers and cookies
  • chore(page): add empty page validation
  • chore(config): add serializable crawl configuration
  • chore(retry): add check 502 notfound retry

Full Changelog: v2.7.1...v2.8.29

v2.7.1

30 Sep 23:14
Compare
Choose a tag to compare

Whats Changed

  • add chrome remote connection proxy ability.
  • add context handling and disposing chrome.
  • chore(chrome): fix concurrent pages opening remote ws connections
  • chore(chrome): add cookie setting browser
  • chore(chrome): fix connecting to chrome when using a LB
  • feat(website): add retry and rate limiting handling

Full Changelog: v2.6.15...v2.7.1

v2.6.15

22 Sep 00:27
Compare
Choose a tag to compare
  • fix parsing links for top level redirected domains
  • add website.with_preserve_host_header
  • default tls reqwest_native_tls_native_roots

Full Changelog: v2.5.2...v2.6.15

HTML Transformations

21 Sep 12:29
Compare
Choose a tag to compare

Whats Changed

We Open Sourced our transformation utils for Spider cloud that provides high performance output to markdown, text, and other formats.

You can install spider_transformations on it's own or use the feature flag transformations when installing spider_utils.

use spider::tokio;
use spider::website::Website;
use spider_utils::spider_transformations::transformation::content::{
    transform_content, ReturnFormat, TransformConfig,
};
use tokio::io::AsyncWriteExt;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(0).unwrap();
    let mut stdout = tokio::io::stdout();

    let mut conf = TransformConfig::default();
    conf.return_format = ReturnFormat::Markdown;

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let markup = transform_content(&res, &conf, &None, &None);

            let _ = stdout
                .write_all(format!("- {}\n {}\n", res.get_url(), markup).as_bytes())
                .await;
        }
        stdout
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    website.unsubscribe();
    let duration = start.elapsed();
    let mut stdout = join_handle.await.unwrap();

    let _ = stdout
        .write_all(
            format!(
                "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                duration,
                website.get_links().len()
            )
            .as_bytes(),
        )
        .await;
}

Full Changelog: v2.5.2...v2.6.2