Skip to content
This repository has been archived by the owner on Mar 25, 2024. It is now read-only.

Releases: KWARC/llamapun

upgrades and statement extraction for arXMLiv 08.2019

29 Sep 21:02
Compare
Choose a tag to compare

arXMLiv 08.2019 release

19 Sep 00:45
Compare
Choose a tag to compare

Tagged release used to extract the token model and embeddings for the arXMLiv 08.2019 corpus.

Paragraph dataset extraction, first public release

03 Jun 19:16
Compare
Choose a tag to compare

A derivative dataset from arXMLiv 08.2018, intended for "statement classification" of paragraphs has been generated via what is now the 0.3.2 release of llamapun.

For details see #34

Parallel primitives

16 Apr 19:28
Compare
Choose a tag to compare

It is now possible to use llamapun while fully utilizing available CPU cores (configurable as is in rayon).

Most of the examples are now refactored to the parallel primitives, and can see a 20x speedup on high-end chips with 16+ cores. A pass over arXMLiv 08.2018 now takes between 2-3 hours for a lightweight task (frequency reports, token models, etc) on such hardware.

The library also uses the parallel-friendly RoNode libxml struct, which allows for additional gains when iterating over the DOM.

Example from corpus_mathml_stats:

use llamapun::parallel_data::Corpus;
// ...
let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
  document
  .get_math_nodes()
  .into_par_iter()
  .map(|math| {
    let mut catalog = HashMap::new();
    dfs_record(math, &open_ended, &mut catalog);
    catalog
  })
  .reduce(HashMap::new, |mut map1, map2| {
    for (k, v) in map2 {
      let entry = map1.entry(k).or_insert(0);
      *entry += v;
    }
    map1
  })
});

For details, consult #29

AMS labeled dataset, 08.2018

27 Sep 05:47
Compare
Choose a tag to compare

Eliminated memory leaks related to libxml use, this release has been used to generate the AMS paragraph dataset induced by the arXMLiv 08.2018 HTML5 corpus.

arXMLiv 08.2018 release

24 Sep 17:52
Compare
Choose a tag to compare

Changes for generating the arXMLiv 08.2018 token models:

  • Update dependencies
  • Improve corpus_token_model generation to include math lexemes
  • Improve paragraph iterator to skip over paragraphs containing ltx_ERROR markup
  • improve sentence tokenization to treat words with any capital letters as potential sentence breakers
  • word lexemes now properly attach 's possessives

Public arXMLiv dataset release

22 Jan 21:55
9cb93d3
Compare
Choose a tag to compare

This release is tagged to mark the library version used for generating the Corpus Token Model for arXMLiv 08.2017 dataset, to be released 02.2018.

A major upgrade is merging @jfschaefer 's pattern-matching component as described in #8

The release also includes a refresh of the dependencies for 2018, and minor bug fixes. llamapun still requires a nightly release of Rust to build and run.