Skip to content

Data Science & Image Processing amalgam library in C/C++

License

Notifications You must be signed in to change notification settings

GerHobbelt/owemdjee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
Sorry, we had to truncate this directory to 1,000 files. 421 entries were omitted from the list.
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

owemdjee

Data Science & Image Processing amalgam library in C/C++.

This place is a gathering spot & integration workplace for the C & C++ libraries we choose to use. Think "Faรงade Pattern" and you're getting warm. ๐Ÿ˜‰ The heavy data lifting will be done in the referenced libraries, while this lib will provide some glue and common ground for them to work in/with.

Reason for this repo

git submodules hasn't been the most, ah, "user-friendly" methods to track and manage a set of libraries that you wish to track at source level.

A few problems have been repeatedly observed over our lifetime with git:

  • when it so happens that the importance & interest in a submoduled library is perhaps waning and you want to migrate to another, you can of course invoke git to ditch the old sow and bring in the shiny new one, but that stuff gets quite finicky when you are pedalling back & forth through your commit tree when, e.g. bughunting or maintenance work on a release branch which isn't up to snuff with the fashion kids yet.

    Yup, that's been much less of a problem since about 2018, but old scars need more than a pat on the arm to heal, if you get my drift.

  • folks haven't always been the happy campers they were supposed to be when they're facing a set of submodules and want to feel safe and sure in their "knowledge" that each library X is at commit Y, when the top of the module tree is itself at commit Z, for we are busy producing a production release, perhaps? That's a wee bit stressful and there have been enough "flukes" with git to make that a not-so-ironclad-as-we-would-like position.

    Over time, I've created several bash shell scripts to help with that buzzin' feelin' of absolute certainty. Useful perhaps, but the cuteness of those wears off pretty darn quickly when many nodes in the submodule tree start cluttering their git repo with those.

And?

This repo is made to ensure we have a single point of reference for all the data munching stuff, at least.

We don't need to git submodule add all those data processing libs in our applications this way, as this is a single submodule to bother that project with. The scripts and other material in here will provide the means to ensure your build and test tools can quickly and easily ensure that everyone in here is at the commit spot they're supposed to be.

And when we want to add another lib about data/image processing, we do that in here, so the application-level git repo sees a very stable singular submodule all the time: this repo/lib, not the stuff that will change over time as external libs gain and loose momentum over time. (We're talking multiyear timespans here!)

Critique?

It's not the most brilliant solution to our problems, as this, of course, becomes a single point of failure that way, but experience in the past with similar "solutions" has shown that it's maybe not always fun, but at least we keep track of the management crap in one place and that was worth it, every time.

And why not do away with git submodule entirely and use packages instead? Because this stuff is important enough that other, quite painful experience has shown us that (binary & source) packages are a wonder and a hassle too: I'ld rather have my code tracked and tagged at source level all the way because that has reduced several bug situations from man-weeks to man-hours: like Gentoo, compile it all, one compiler only. Doesn't matter if the bug is in your own code or elsewhere, there are enough moments like that where one is helped enormously by the ability to step through and possibly tweak a bit of code here or there temporarily to help the debugging process that I, at least, prefer full source code.

And that's what this repo is here to provide: the source code gathered and ready for use on our machines.

Why is this repo a solution? And does it scale?

The worst bit first: it scales like rotten eggs. The problem there is two-fold: first, there's (relatively) few people who want to track progress at the bleeding edge, so tooling is consequently limited in power and availability, compared to conservative approaches (count the number of package managers lately?).

Meanwhile, I'm in a spot where I want to ride the bleeding edge, at least most of the time, and I happen to like it that way: my world is much more R&D than product maintenance, so having a means to track, relatively easy, the latest developments in subjects and materiel of interest is a boon to me. Sure, I'll moan and rant about it once in a while, but if I wanted to really get rid of the need to be flexible and adapt to changes, sometimes often, I'ld have gone with the conservative stability of package managers and LTS releases already. Which I've done for other parts of my environment, but do not intend to do for the part which is largely covered by this repo: source libraries which I intend to use or am using already in research tools I'm developing for others and myself.

For that purpose, this repo is a solution, though -- granted -- a sub-optimal one in that it doesn't scale very well. I don't think there's any automated process available to make this significantly faster and more scalable anyway: the fact that I'm riding the bleeding edge and wish to be able to backpedal at will when the latest change of direction or state of affairs of a component is off the rails (from my perspective at least), requires me to be flexible and adaptable to the full gamut of change. There are alternative approaches, also within the git world, but they haven't shown real appeal vs. old skool git submodules -- which is cranky at times and a pain in the neck when you want to ditch something but still need it in another dev branch, moan moan moan, but anyway... -- so here we are.

Side note: submodules which have been picked up for experimentation and inspection but have been deleted from this A list later on are struck through in the overview below: the rationale there is that we can thus still observe why we struck it off the list, plus never make the mistake of re-introducing it after a long time, forgetting that we once had a look already, without running into the struck-through entry and having to re-evaluate the reason at least, before we re-introduce an item.


Intent

Inter-process communications (IPC)

Lowest possible run-time cost, a.k.a. "run-time overhead": the aim is to have IPC which does not noticably impact UX (User Experience of the application: responsiveness / UI) on reeasonably powered machines. (Users are not expected to have the latest or fastest hardware.)

As at least large images will be transfered (PDF page renders) we need to have a binary-able protocol.

Programming Languages used: intent and purposes

We expect to use these languages in processes which require this type of IPC:

  • C / C++ (backend No.1)

    • PDF renderer (mupdf)
    • metadata & annotations extractor (mupdf et al)
    • very probably also the database interface (SQLite)
    • [page] image processing (leptonica, openCV, ImageMagick?, what-ever turns out to be useful and reasonable to integrate (particularly between PDF page renderer and OCR engine to help us provide a user-tunable PDF text+metadata extractor
    • OCR (tesseract)
    • "A.I."-assisted tooling to help process and clean PDFs: cover pages, abstract/summary extraction for meta-research, etc. (think ngrams, xdelta, SVM, tensors, author identification, document categorization, document similarity / [near-]duplicate / revision detection, tagging, ...)
    • document identifier key generator a.k.a. content hasher for creating unique key for each document, which can be used as database record index, etc.
      • old: Qiqqa SHA1B
      • new: BLAKE3+Base36
  • C# ("business logic" / "middleware": the glue logic)

  • Java (SOLR / Lucene: our choice for the "full text search database" ~ backend No.2)

  • JavaScript (UI, mostly. Think electron, web browser, Chromelyalso, WebView2plus, that sort of thing)

Here we intend to use the regular SOLR APIs, which does not require specialized binary IPC.

We may probably choose to use a web-centric UI approach where images are compressed and cached in the backend, while being provided as <picture> or <img> tag references (URLs) in the HTML generated by the backend. However, we keep our options open ATM as furtheer testing is expected to hit a few obstacles there (smart caching required as we will be processing lots of documents in "background bulk processes" alongside the browsing and other more direct user activity) so a websocket or similar push technology may be employed: there we may benefit from dedicated IPC for large binary and text data transfers.

Scripting the System: Languages Considered for Scripting by Users

Python has been considered. Given its loud presence in the AI communities, we still may integrate it one day. However, personally I'm not a big fan of the language and don't use it unless it's prudent to do, e.g. when extending or tweaking previous works produced by others. Also, it turns out, it's not exactly easy to integrate (CPython) and I don't see a need for it beyond this one project / product: Qiqqa.

I've looked at Lua for a scripting language suitable for users (used quite a lot in the gaming industries and elsewhere); initial trials to get something going did not uncover major obstacles, but the question "how do I debug Lua scripts?" does not produce any viable project / product that goes beyond the old skool printf-style debugging method. Not a prime candidate therefor, as we expect that users will pick this up, when they like it, and grow the user scripts to unanticipated size and complexity: I've seen this happen multiple times in my career. Lua does not provide a scalable growth path from my perspective due to the lack of a decent, customizable, debugger.

Third candidate is JavaScript. While Artifex/mupdf comes with mujs, which is a simple engine it suffers from two drawbacks: it's ES5 and also does not provide a debugger mechanism beyond old skool print. Nice for nerds, but this is user-facing and thus not a viable option.

The other JavaScript engines considered are of varying size, performance and complexity. Some of them offer ways to integrate them with the [F12] Chrome browser Developer Tools debugger, which would be very nice to have available. The road traveled there, along the various JavaScript engines is this:

  • cel-cpp ๐Ÿ“ ๐ŸŒ -- C++ Implementations of the Common Expression Language. For background on the Common Expression Language see the cel-spec repo. Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.

  • cel-spec ๐Ÿ“ ๐ŸŒ -- Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.

  • chibi-scheme ๐Ÿ“ ๐ŸŒ -- Chibi-Scheme is a very small library intended for use as an extension and scripting language in C programs. In addition to support for lightweight VM-based threads, each VM itself runs in an isolated heap allowing multiple VMs to run simultaneously in different OS threads.

  • cppdap ๐Ÿ“ ๐ŸŒ -- a C++11 library ("SDK") implementation of the Debug Adapter Protocol, providing an API for implementing a DAP client or server. cppdap provides C++ type-safe structures for the full DAP specification, and provides a simple way to add custom protocol messages.

  • cpython ๐Ÿ“ ๐ŸŒ -- Python version 3. Note: Building a complete Python installation requires the use of various additional third-party libraries, depending on your build platform and configure options. Not all standard library modules are buildable or useable on all platforms.

  • duktape ๐Ÿ“ ๐ŸŒ -- Duktape is an embeddable Javascript engine, with a focus on portability and compact footprint. Duktape is ECMAScript E5/E5.1 compliant, with some semantics updated from ES2015+, with partial support for ECMAScript 2015 (E6) and ECMAScript 2016 (E7), ES2015 TypedArray, Node.js Buffer bindings and comes with a built-in debugger.

  • exprtk ๐Ÿ“ ๐ŸŒ -- C++ Mathematical Expression Toolkit Library is a simple to use, easy to integrate and extremely efficient run-time mathematical expression parsing and evaluation engine. The parsing engine supports numerous forms of functional and logic processing semantics and is easily extensible.

  • guile ๐Ÿ“ ๐ŸŒ -- Guile is Project GNU's extension language library. Guile is an implementation of the Scheme programming language, packaged as a library that can be linked into applications to give them their own extension language. Guile supports other languages as well, giving users of Guile-based applications a choice of languages.

  • harbour-core ๐Ÿ“ ๐ŸŒ -- Harbour is the free software implementation of a multi-platform, multi-threading, object-oriented, scriptable programming language, backward compatible with Clipper/xBase. Harbour consists of a compiler and runtime libraries with multiple UI and database backends, its own make system and a large collection of libraries and interfaces to many popular APIs.

  • itcl ๐Ÿ“ ๐ŸŒ -- Itcl is an object oriented extension for Tcl.

  • jimtcl ๐Ÿ“ ๐ŸŒ -- the Jim Interpreter is a small-footprint implementation of the Tcl programming language written from scratch. Currently Jim Tcl is very feature complete with an extensive test suite (see the tests directory). There are some Tcl commands and features which are not implemented (and likely never will be), including traces and Tk. However, Jim Tcl offers a number of both Tcl8.5 and Tcl8.6 features ({*}, dict, lassign, tailcall and optional UTF-8 support) and some unique features. These unique features include [lambda] with garbage collection, a general GC/references system, arrays as syntax sugar for [dict]tionaries, object-based I/O and more. Other common features of the Tcl programming language are present, like the "everything is a string" behaviour, implemented internally as dual ported objects to ensure that the execution time does not reflect the semantic of the language :)

  • miniscript ๐Ÿ“ ๐ŸŒ -- the MiniScript scripting language.

  • mujs ๐Ÿ“ ๐ŸŒ -- a lightweight ES5 Javascript interpreter designed for embedding in other software to extend them with scripting capabilities.

  • newlisp ๐Ÿ“ ๐ŸŒ -- newLISP is a LISP-like scripting language for doing things you typically do with scripting languages: programming for the internet, system administration, text processing, gluing other programs together, etc. newLISP is a scripting LISP for people who are fascinated by LISP's beauty and power of expression, but who need it stripped down to easy-to-learn essentials. newLISP is LISP reborn as a scripting language: pragmatic and casual, simple to learn without requiring you to know advanced computer science concepts. Like any good scripting language, newLISP is quick to get into and gets the job done without fuss. newLISP has a very fast startup time, is small on resources like disk space and memory and has a deep, practical API with functions for networking, statistics, machine learning, regular expressions, multiprocessing and distributed computing built right into it, not added as a second thought in external modules.

  • owl ๐Ÿ“ ๐ŸŒ -- Owl Lisp is a functional dialect of the Scheme programming language. It is mainly based on the applicative subset of the R7RS standard.

  • picoc ๐Ÿ“ ๐ŸŒ -- PicoC is a very small C interpreter for scripting. It was originally written as a script language for a UAV's on-board flight system. It's also very suitable for other robotic, embedded and non-embedded applications. The core C source code is around 3500 lines of code. It's not intended to be a complete implementation of ISO C but it has all the essentials.

  • QuickJS ๐Ÿ“ ๐ŸŒ -- a small and embeddable Javascript engine. It supports the ES2020 specification including modules, asynchronous generators, proxies and BigInt. It optionally supports mathematical extensions such as big decimal floating point numbers (BigDecimal), big binary floating point numbers (BigFloat) and operator overloading.

    • libbf ๐Ÿ“ ๐ŸŒ -- a small library to handle arbitrary precision binary or decimal floating point numbers
    • QuickJS-C++-Wrapper ๐Ÿ“ ๐ŸŒ -- quickjscpp is a header-only wrapper around the quickjs JavaScript engine, which allows easy integration into C++11 code. This wrapper also automatically tracks the lifetime of values and objects, is exception-safe, and automates clean-up.
    • QuickJS-C++-Wrapper2 ๐Ÿ“ ๐ŸŒ -- QuickJSPP is QuickJS wrapper for C++. It allows you to easily embed Javascript engine into your program.
    • txiki ๐Ÿ“ ๐ŸŒ -- uses QuickJS as its kernel
  • sbcl ๐Ÿ“ ๐ŸŒ -- SBCL is an implementation of ANSI Common Lisp, featuring a high-performance native compiler, native threads on several platforms, a socket interface, a source-level debugger, a statistical profiler, and much more.

  • ScriptX ๐Ÿ“ ๐ŸŒ -- Tencent's ScriptX is a script engine abstraction layer. A variety of script engines are encapsulated on the bottom and a unified API is exposed on the top, so that the upper-layer caller can completely isolate the underlying engine implementation (back-end).

    ScriptX not only isolates several JavaScript engines (e.g. V8 and QuickJS), but can even isolate different scripting languages, so that the upper layer can seamlessly switch between scripting engine and scripting language without changing the code.

  • tcl ๐Ÿ“ ๐ŸŒ -- the latest Tcl source distribution. Tcl provides a powerful platform for creating integration applications that tie together diverse applications, protocols, devices, and frameworks.

  • tclclockmod ๐Ÿ“ ๐ŸŒ -- TclClockMod is the fastest, most powerful Tcl clock engine written in C. This Tcl clock extension is the faster Tcl-module for the replacement of the standard "clock" ensemble of tcl.

  • txiki ๐Ÿ“ ๐ŸŒ -- a small and powerful JavaScript runtime. It's built on the shoulders of giants: it uses [QuickJS] as its JavaScript engine, [libuv] as the platform layer, [wasm3] as the WebAssembly engine and [curl] as the HTTP / WebSocket client.

  • Facebook's Hermes, Samsung's Escargot and XS/moddablealso here, which led me to a webpage where various embeddable JS engines are compared size- and performance-wise.

  • Google's V8here too, as available in NodeJS, is deemed too complex for integration: when we go there, we could spend the same amount of effort on CPython integration -- though there again is the ever-present "how to debug this visually?!" question...)

  • JerryScript: ES2017/2020 (good!), there's noises about Chrome Developer Tools on the Net for this one. Small, designed for embedded devices. I like that.

  • mujs: ES5, no visual debugger. Out.

  • QuickJS: ES2020, DevTools or VS Code debugging seems to be available. Also comes with an interesting runtime: txiki, which we still need to take a good look at.

UPDATE 2021/June: JerryScript, duktape, XS/moddable, escargot: these have been dropped as we picked QuickJS. After some initial hassle with that codebase, we picked a different branch to test, which was cleaner and compiled out of the box (CMake > MSVC), which is always a good omen for a codebase when you have cross-platform portability in mind.


Libraries we're looking at for this intent:

IPC: flatbuffer et al for protocol design

  • arrow ๐Ÿ“ ๐ŸŒ -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • avro ๐Ÿ“ ๐ŸŒ -- Apache Avroโ„ข is a data serialization system.

  • bebop ๐Ÿ“ ๐ŸŒ -- an extremely simple, fast, efficient, cross-platform serialization format. Bebop is a schema-based binary serialization technology, similar to Protocol Buffers or MessagePack. In particular, Bebop tries to be a good fit for clientโ€“server or distributed web apps that need something faster, more concise, and more type-safe than JSON or MessagePack, while also avoiding some of the complexity of Protocol Buffers, FlatBuffers and the like.

  • bitsery ๐Ÿ“ ๐ŸŒ -- header only C++ binary serialization library, designed around the networking requirements for real-time data delivery, especially for games. All cross-platform requirements are enforced at compile time, so serialized data do not store any meta-data information and is as small as possible.

  • capnproto ๐Ÿ“ ๐ŸŒ -- Cap'n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster.

  • cereal ๐Ÿ“ ๐ŸŒ -- C++11 serialization library

  • flatbuffers ๐Ÿ“ ๐ŸŒ -- a cross platform serialization library architected for maximum memory efficiency. It allows you to directly access serialized data without parsing/unpacking it first, while still having great forwards/backwards compatibility.

  • GoldFish-CBOR ๐Ÿ“ ๐ŸŒ -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • ion-c ๐Ÿ“ ๐ŸŒ -- a C implementation of the Ion data notation. Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of data which can survive multiple generations of software evolution.

  • libbson ๐Ÿ“ ๐ŸŒ -- a library providing useful routines related to building, parsing, and iterating BSON documents.

  • libnop ๐Ÿ“ ๐ŸŒ -- libnop (C++ Native Object Protocols) is a header-only library for serializing and deserializing C++ data types without external code generators or runtime support libraries. The only mandatory requirement is a compiler that supports the C++14 standard.

  • libsmile ๐Ÿ“ ๐ŸŒ -- C implementation of the Smile binary format (https://github.com/FasterXML/smile-format-specification).

    • discouraged; reason: for binary format record serialization we will be using bebop or reflect-cpp exclusively. All other communications will be JSON/JSON5/XML based.
  • mosquitto ๐Ÿ“ ๐ŸŒ -- Eclipse Mosquitto is an open source implementation of a server for version 5.0, 3.1.1, and 3.1 of the MQTT protocol. It also includes a C and C++ client library, and the mosquitto_pub and mosquitto_sub utilities for publishing and subscribing.

  • msgpack-c ๐Ÿ“ ๐ŸŒ -- MessagePack (a.k.a. msgpack) for C/C++ is an efficient binary serialization format, which lets you exchange data among multiple languages like JSON, except that it's faster and smaller. Small integers are encoded into a single byte and short strings require only one extra byte in addition to the strings themselves.

  • msgpack-cpp ๐Ÿ“ ๐ŸŒ -- msgpack for C++: MessagePack is an efficient binary serialization format, which lets you exchange data among multiple languages like JSON, except that it's faster and smaller. Small integers are encoded into a single byte and short strings require only one extra byte in addition to the strings themselves.

  • protobuf ๐Ÿ“ ๐ŸŒ -- Protocol Buffers - Google's data interchange format that is a language-neutral, platform-neutral, extensible mechanism for serializing structured data.

    • โ˜นdiscouraged๐Ÿคง; reason: relatively slow run-time and (in my opinion) rather ugly & convoluted approach at build time. Has too much of a Java/CorporateProgramming smell, which has not lessened over the years, unfortunately.
  • reflect ๐Ÿ“ ๐ŸŒ -- a C++20 Static Reflection library with optimized run-time execution and binary size, fast compilation times and platform agnostic, minimal API. The library only provides basic reflection primitives and is not a full-fledged, heavy, implementation for https://wg21.link/P2996 which is a language proposal with many more features and capabilities.

  • reflect-cpp ๐Ÿ“ ๐ŸŒ -- a C++-20 library for fast serialization, deserialization and validation using reflection, similar to pydantic in Python, serde in Rust, encoding in Go or aeson in Haskell. As the aforementioned libraries are among the most widely used in the respective languages, reflect-cpp fills an important gap in C++ development. It reduces boilerplate code and increases code safety.

  • serde-cpp ๐Ÿ“ ๐ŸŒ -- serialization framework for C++17, inspired by Rust serde project.

  • serdepp ๐Ÿ“ ๐ŸŒ -- a C++17 low cost serialize deserialize adaptor library like Rust serde project.

  • swig ๐Ÿ“ ๐ŸŒ -- SWIG (Simplified Wrapper and Interface Generator) is a software development tool (code generator) that connects programs written in C and C++ with a variety of high-level programming languages. It is used for building scripting language interfaces to C and C++ programs. SWIG simplifies development by largely automating the task of scripting language integration, allowing developers and users to focus on more important problems.

    SWIG ๐ŸŒ was not considered initially; more suitable for RPC than what we have in mind, which is purely data messages enchange. MAY be of use for transitional applications which are mixed-(programming-)language based, e.g. where we want to mix C/C++ and C# in a single Test Application.

  • thrift ๐Ÿ“ ๐ŸŒ -- Apache Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation. Thrift provides clean abstractions and implementations for data transport, data serialization, and application level processing. The code generation system takes a simple definition language as input and generates code across programming languages that uses the abstracted stack to build interoperable RPC clients and servers.

  • velocypack ๐Ÿ“ ๐ŸŒ -- a fast and compact format for serialization and storage. These days, JSON (JavaScript Object Notation, see ECMA-404) is used in many cases where data has to be exchanged. Lots of protocols between different services use it, databases store JSON (document stores naturally, but others increasingly as well). It is popular, because it is simple, human-readable, and yet surprisingly versatile, despite its limitations. At the same time there is a plethora of alternatives ranging from XML over Universal Binary JSON, MongoDB's BSON, MessagePack, BJSON (binary JSON), Apache Thrift till Google's protocol buffers and ArangoDB's shaped JSON. When looking into this, we were surprised to find that none of these formats manages to combine compactness, platform independence, fast access to sub-objects and rapid conversion from and to JSON.

  • zpp_bits ๐Ÿ“ ๐ŸŒ -- A modern, fast, C++20 binary serialization and RPC library, with just one header file.See also the benchmark.

  • ZeroMQ a.k.a. ร˜MQ:

  • FastBinaryEncoding ๐ŸŒ

    • removed; reason: for binary format record serialization we will be using bebop exclusively. All other communications will be JSON/JSON5/XML based.
  • flatbuffers ๐ŸŒ

    • removed; reason: see protobuf: same smell rising. Faster at run time, but still a bit hairy to my tastes while bebop et al are on to something potentially nice.
  • flatcc ๐ŸŒ

    • removed; reason: see flatbuffers. When we don't dig flatbuffers, then flatcc is automatically pretty useless to us. Let's rephrase that professionally: "flatcc has moved out of scope for our project."

IPC: websockets, etc.: all communication means

  • blazingmq ๐Ÿ“ ๐ŸŒ -- BlazingMQ is a modern, High-Performance Message Queue, which focuses on efficiency, reliability, and a rich feature set for modern-day workflows. At its core, BlazingMQ provides durable, fault-tolerant, highly performant, and highly available queues, along with features like various message routing strategies (e.g., work queues, priority, fan-out, broadcast, etc.), compression, strong consistency, poison pill detection, etc. Message queues generally provide a loosely-coupled, asynchronous communication channel ("queue") between application services (producers and consumers) that send messages to one another. You can think about it like a mailbox for communication between application programs, where 'producer' drops a message in a mailbox and 'consumer' picks it up at its own leisure. Messages placed into the queue are stored until the recipient retrieves and processes them. In other words, producer and consumer applications can temporally and spatially isolate themselves from each other by using a message queue to facilitate communication.

  • boringssl ๐Ÿ“ ๐ŸŒ -- BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

  • cpp-httplib ๐Ÿ“ ๐ŸŒ -- an extremely easy to setup C++11 cross platform HTTP/HTTPS library.

    NOTE: This library uses 'blocking' socket I/O. If you are looking for a library with 'non-blocking' socket I/O, this is not the one that you want.

  • cpp-ipc ๐Ÿ“ ๐ŸŒ -- a high-performance inter-process communication using shared memory on Linux/Windows.

  • cpp-netlib ๐Ÿ“ ๐ŸŒ -- modern C++ network programming library: cpp-netlib is a collection of network-related routines/implementations geared towards providing a robust cross-platform networking library.

  • cpp_rest_sdk ๐Ÿ“ ๐ŸŒ -- the C++ REST SDK is a Microsoft project for cloud-based client-server communication in native code using a modern asynchronous C++ API design. This project aims to help C++ developers connect to and interact with services.

  • crow ๐Ÿ“ ๐ŸŒ -- IPC / server framework. Crow is a very fast and easy to use C++ micro web framework (inspired by Python Flask).

    Interface looks nicer than oatpp...

  • ecal ๐Ÿ“ ๐ŸŒ -- the enhanced Communication Abstraction Layer (eCAL) is a middleware that enables scalable, high performance interprocess communication on a single computer node or between different nodes in a computer network. eCAL uses a publish-subscribe pattern to automatically connect different nodes in the network. eCAL automatically chooses the best available data transport mechanism for each link:

    • Shared memory for local communication (incredible fast!)
    • UDP for network communication
  • iceoryx ๐Ÿ“ ๐ŸŒ -- true zero-copy inter-process-communication. iceoryx is an inter-process-communication (IPC) middleware for various operating systems (currently we support Linux, macOS, QNX, FreeBSD and Windows 10). It has its origins in the automotive industry, where large amounts of data have to be transferred between different processes when it comes to driver assistance or automated driving systems. However, the efficient communication mechanisms can also be applied to a wider range of use cases, e.g. in the field of robotics or game development.

  • libetpan ๐Ÿ“ ๐ŸŒ -- this mail library provides a portable, efficient framework for different kinds of mail access: IMAP, SMTP, POP and NNTP.

  • libwebsocketpp ๐Ÿ“ ๐ŸŒ -- WebSocket++ is a header only C++ library that implements RFC6455 The WebSocket Protocol.

  • libwebsockets ๐Ÿ“ ๐ŸŒ -- a simple-to-use C library providing client and server for HTTP/1, HTTP/2, WebSockets, MQTT and other protocols. It supports a lot of lightweight ancilliary implementations for things like JSON, CBOR, JOSE, COSE. It's very gregarious when it comes to event loop sharing, supporting libuv, libevent, libev, sdevent, glib and uloop, as well as custom event libs.

  • MPMCQueue ๐Ÿ“ ๐ŸŒ -- a bounded multi-producer multi-consumer concurrent queue written in C++11.

  • MultipartEncoder ๐Ÿ“ ๐ŸŒ -- a C++ implementation of encoding multipart/form-data. You may find the asynchronous http-client, i.e. cpprestsdk, does not support posting a multipart/form-data request. This MultipartEncoder is a work around to generate the body content of multipart/form-data format, so that then you can use a cpp HTTP-client, which is not limited to cpprestsdk, to post a multipart/form-data request by setting the encoded body content.

  • nanomsg-nng ๐Ÿ“ ๐ŸŒ -- a rewrite of the Scalability Protocols library known as https://github.com/nanomsg/nanomsg[libnanomsg], which adds significant new capabilities, while retaining compatibility with the original. NNG is a lightweight, broker-less library, offering a simple API to solve common recurring messaging problems, such as publish/subscribe, RPC-style request/reply, or service discovery.

  • nghttp3 ๐Ÿ“ ๐ŸŒ -- an implementation of RFC 9114 <https://datatracker.ietf.org/doc/html/rfc9114>_ HTTP/3 mapping over QUIC and RFC 9204 <https://datatracker.ietf.org/doc/html/rfc9204>_ QPACK in C.

  • ngtcp2 ๐Ÿ“ ๐ŸŒ -- ngtcp2 project is an effort to implement RFC9000 <https://datatracker.ietf.org/doc/html/rfc9000>_ QUIC protocol.

  • OpenSSL ๐Ÿ“ ๐ŸŒ -- OpenSSL is a robust, commercial-grade, full-featured Open Source Toolkit for the Transport Layer Security (TLS) protocol formerly known as the Secure Sockets Layer (SSL) protocol. The protocol implementation is based on a full-strength general purpose cryptographic library, which can also be used stand-alone.

  • readerwriterqueue ๐Ÿ“ ๐ŸŒ -- a single-producer, single-consumer lock-free queue for C++.

  • restc-cpp ๐Ÿ“ ๐ŸŒ -- a modern C++ REST Client library. The magic that takes the pain out of accessing JSON API's from C++. The design goal of this project is to make external REST API's simple and safe to use in C++ projects, but still fast and memory efficient.

  • restclient-cpp ๐Ÿ“ ๐ŸŒ -- a simple REST client for C++, which wraps libcurl for HTTP requests.

  • shadesmar ๐Ÿ“ ๐ŸŒ -- an IPC library that uses the system's shared memory to pass messages. Supports publish-subscribe and RPC.

  • sharedhashfile ๐Ÿ“ ๐ŸŒ -- share hash tables with stable key hints stored in memory mapped files between arbitrary processes.

  • shmdata ๐Ÿ“ ๐ŸŒ -- shares streams of framed data between processes (1 writer to many readers) via shared memory. It supports any kind of data stream: it has been used with multichannel audio, video frames, 3D models, OSC messages, and various others types of data. Shmdata is very fast and allows processes to access data streams without the need for extra copies.

  • SPSCQueue ๐Ÿ“ ๐ŸŒ -- a single producer single consumer wait-free and lock-free fixed size queue written in C++11.

  • tcp_pubsub ๐Ÿ“ ๐ŸŒ -- a minimal publish-subscribe library that transports data via TCP. tcp_pubsub does not define a message format but only transports binary blobs. It does however define a protocol around that, which is kept as lightweight as possible.

  • tcpshm ๐Ÿ“ ๐ŸŒ -- a connection-oriented persistent message queue framework based on TCP or SHM IPC for Linux. TCPSHM provides a reliable and efficient solution based on a sequence number and acknowledge mechanism, that every sent out msg is persisted in a send queue until sender got ack that it's been consumed by the receiver, so that disconnects/crashes are tolerated and the recovery process is purely automatic.

  • telegram-bot-api ๐Ÿ“ ๐ŸŒ -- the Telegram Bot API provides an HTTP API for creating Telegram Bots.

  • telegram-td ๐Ÿ“ ๐ŸŒ -- TDLib (Telegram Database library) is a cross-platform library for building Telegram clients. It can be easily used from almost any programming language.

  • ucx ๐Ÿ“ ๐ŸŒ -- Unified Communication X (UCX) is an optimized production proven-communication framework for modern, high-bandwidth and low-latency networks. UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.

  • userver ๐Ÿ“ ๐ŸŒ -- an open source asynchronous framework with a rich set of abstractions for fast and comfortable creation of C++ microservices, services and utilities. The framework solves the problem of efficient I/O interactions transparently for the developers. Operations that would typically suspend the thread of execution do not suspend it. Instead of that, the thread processes other requests and tasks and returns to the handling of the operation only when it is guaranteed to execute immediately. As a result you get straightforward source code and avoid CPU-consuming context switches from OS, efficiently utilizing the CPU with a small amount of execution threads.

  • uvw ๐Ÿ“ ๐ŸŒ -- libuv wrapper in modern C++. uvw started as a header-only, event based, tiny and easy to use wrapper for libuv written in modern C++. Now it's finally available also as a compilable static library. The basic idea is to wrap the C-ish interface of libuv behind a graceful C++ API.

  • websocket-sharp ๐Ÿ“ ๐ŸŒ -- a C# implementation of the WebSocket protocol client and server.

  • WinHttpPAL ๐Ÿ“ ๐ŸŒ -- implements WinHttp API Platform Abstraction Layer for POSIX systems using libcurl

  • ice ๐ŸŒ -- Comprehensive RPC Framework: helps you network your software with minimal effort.

    • removed; reason: has a strong focus on the remote, i.e. R in RPC (thus a focus on things such as encryption, authentication, firewalling, etc.), which we don't want or need: all services are supposed to run on a single machine and comms go through localhost only. When folks find they need to distribute the workload across multiple machines, then we'll be entering a new era in Qiqqa usage and then will be soon enough to (re-)investigate the usefulness of this package.

Also, we are currently more interested in fast data serialization then RPC per se as we aim for a solution that's more akin to a REST API interface style.

  • oatpp ๐ŸŒ -- IPC / server framework

    • removed; reason: see crow. We have picked crow as the preferred way forward, so any similar/competing product is out of scope unless crow throws a tantrum on our test bench after all, the chances of that being very slim.

IPC: ZeroMQ a.k.a. ร˜MQ

IPC: memory mapping

  • arrow ๐Ÿ“ ๐ŸŒ -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • fmem ๐Ÿ“ ๐ŸŒ -- a cross-platform library for opening memory-backed libc streams (a la UNIX fmemopen()).

  • fmemopen_windows ๐Ÿ“ ๐ŸŒ -- provides FILE* handler based on memory backend for fread,fwrite etc. just like fmemopen on linux, but now on MS Windows.

  • libmio ๐Ÿ“ ๐ŸŒ -- An easy to use header-only cross-platform C++11 memory mapping library. mio has been created with the goal to be easily includable (i.e. no dependencies) in any C++ project that needs memory mapped file IO without the need to pull in Boost.

  • libvrb ๐Ÿ“ ๐ŸŒ -- implements a ring buffer, also known as a character FIFO or circular buffer, with a special property that any data present in the buffer, as well as any empty space, are always seen as a single contiguous extent by the calling program. This is implemented with virtual memory mapping by creating a mirror image of the buffer contents at the memory location in the virtual address space immediately after the main buffer location. This allows the mirror image to always be seen without doing any copying of data.

  • portable-memory-mapping ๐Ÿ“ ๐ŸŒ -- portable Memory Mapping C++ Class (Windows/Linux)

  • shadesmar ๐Ÿ“ ๐ŸŒ -- an IPC library that uses the system's shared memory to pass messages. Supports publish-subscribe and RPC.

  • sharedhashfile ๐Ÿ“ ๐ŸŒ -- share hash tables with stable key hints stored in memory mapped files between arbitrary processes.

  • shmdata ๐Ÿ“ ๐ŸŒ -- shares streams of framed data between processes (1 writer to many readers) via shared memory. It supports any kind of data stream: it has been used with multichannel audio, video frames, 3D models, OSC messages, and various others types of data. Shmdata is very fast and allows processes to access data streams without the need for extra copies.

  • tcpshm ๐Ÿ“ ๐ŸŒ -- a connection-oriented persistent message queue framework based on TCP or SHM IPC for Linux. TCPSHM provides a reliable and efficient solution based on a sequence number and acknowledge mechanism, that every sent out msg is persisted in a send queue until sender got ack that it's been consumed by the receiver, so that disconnects/crashes are tolerated and the recovery process is purely automatic.

IPC: JSON for protocol design

  • cJSON ๐Ÿ“ ๐ŸŒ -- ultra-lightweight JSON parser in ANSI C.

  • glaze ๐Ÿ“ ๐ŸŒ -- one of the fastest JSON libraries in the world. Glaze reads and writes from object memory, simplifying interfaces and offering incredible performance. Glaze also supports BEVE (binary efficient versatile encoding), CSV (comma separated value) and Binary data through the same API for maximum performance

  • GoldFish-CBOR ๐Ÿ“ ๐ŸŒ -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • json ๐Ÿ“ ๐ŸŒ -- N. Lohmann's JSON for Modern C++.

  • jsoncons ๐Ÿ“ ๐ŸŒ -- a C++, header-only library for constructing JSON and JSON-like data formats such as CBOR. Compared to other JSON libraries, jsoncons has been designed to handle very large JSON texts. At its heart are SAX-style parsers and serializers. It supports reading an entire JSON text in memory in a variant-like structure. But it also supports efficient access to the underlying data using StAX-style pull parsing and push serializing. It supports incremental parsing into a user's preferred form, using information about user types provided by specializations of json_type_traits.

  • jsoncpp ๐Ÿ“ ๐ŸŒ -- JsonCpp is a C++ library that allows manipulating JSON values, including serialization and deserialization to and from strings. It can also preserve existing comment in unserialization/serialization steps, making it a convenient format to store user input files.

  • json-jansson ๐Ÿ“ ๐ŸŒ -- Jansson is a C library for encoding, decoding and manipulating JSON data.

  • rapidJSON ๐Ÿ“ ๐ŸŒ -- TenCent's fast JSON parser/generator for C++ with both SAX & DOM style APIs.

  • simdjson ๐Ÿ“ ๐ŸŒ -- simdjson : Parsing gigabytes of JSON per second. For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions](https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md).

  • tao-json ๐Ÿ“ ๐ŸŒ -- taoJSON is a C++ header-only JSON library that provides a generic Value Class, uses Type Traits to interoperate with C++ types, uses an Events Interface to convert from and to JSON, JAXN, CBOR, MsgPack and UBJSON, and much more...

  • yyjson ๐Ÿ“ ๐ŸŒ -- allegedly the fastest JSON library in C.

  • libsmile ๐ŸŒ -- "Smile" format, i.e. a compact binary JSON format

    • discouraged; reason: for binary format record serialization we will be using bebop or reflect-cpp exclusively. All other communications will be JSON/JSON5/XML based. I think we'd better standardize on using one or more of these:

      • custom binary exchange formats for those interchanges that demand highest performance and MAY carry large transfer loads.
      • JSON
      • TOML
      • XML
      • YAML

IPC: CBOR for protocol design

  • glaze ๐Ÿ“ ๐ŸŒ -- one of the fastest JSON libraries in the world. Glaze reads and writes from object memory, simplifying interfaces and offering incredible performance. Glaze also supports BEVE (binary efficient versatile encoding), CSV (comma separated value) and Binary data through the same API for maximum performance

  • GoldFish-CBOR ๐Ÿ“ ๐ŸŒ -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • jsoncons ๐Ÿ“ ๐ŸŒ -- a C++, header-only library for constructing JSON and JSON-like data formats such as CBOR. Compared to other JSON libraries, jsoncons has been designed to handle very large JSON texts. At its heart are SAX-style parsers and serializers. It supports reading an entire JSON text in memory in a variant-like structure. But it also supports efficient access to the underlying data using StAX-style pull parsing and push serializing. It supports incremental parsing into a user's preferred form, using information about user types provided by specializations of json_type_traits.

  • libcbor ๐Ÿ“ ๐ŸŒ -- a C library for parsing and generating CBOR, the general-purpose schema-less binary data format.

  • QCBOR ๐Ÿ“ ๐ŸŒ -- a powerful, commercial-quality CBOR encoder/decoder that implements these RFCs:

    • RFC7049 The previous CBOR standard. Replaced by RFC 8949.
    • RFC8742 CBOR Sequences
    • RFC8943 CBOR Dates
    • RFC8949 The CBOR Standard. (Everything except sorting of encoded maps)
  • tinycbor ๐Ÿ“ ๐ŸŒ -- Concise Binary Object Representation (CBOR) library for serializing data to disk or message channel.

IPC: YAML, TOML, etc. for protocol design

Not considered: reason: when we want the IPC protocol to be "human readable" in any form/approximation, we've decided to stick with JSON or XML (if we cannot help it -- I particularly dislike the verbosity and tag redundancy (open+close) in XML and consider it a lousy design choice for any purpose).

The more human readable formats (YAML, TOML, ...) are intended for human to machine communications, e.g. for feeding configurations into applications, and SHOULD NOT be used for IPC anywhere. (Though I must say I'm on the fence where it comes using YAML as an alternative IPC format where it replaces JSON; another contender there are the JSON5/JSON6 formats.)

Content Hashing (cryptographic strength i.e. "guaranteed" collision-free)

The bit about "guaranteed" collision-free is to be read as: hash algorithms in this section must come with strong statistical guarantees that any chance at a hash collision is negligible, even for extremely large collections. In practice this means: use cryptographic hash algorithms with a strength of 128 bits or more. (Qiqqa used a b0rked version SHA1 thus far, which is considered too weak as we already sample PDFs which cause a hash collision for the official SHA1 algo (and thus also collide in our b0rked SHA1 variant): while those can still be argued to be fringe case, I don't want to be bothered with this at all and thus choose to err on the side of 'better than SHA1B' here. Meanwhile, any library in here may contain weaker cryptographic hashes alongside: we won't be using those for content hashing.

  • BLAKE3 ๐Ÿ“ ๐ŸŒ -- cryptographic hash

  • boringssl ๐Ÿ“ ๐ŸŒ -- BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

  • botan ๐Ÿ“ ๐ŸŒ -- Botan (Japanese for peony flower) is a C++ cryptography library which' goal is to be the best option for cryptography in C++ by offering the tools necessary to implement a range of practical systems, such as TLS protocol, X.509 certificates, modern AEAD ciphers, PKCS#11 and TPM hardware support, password hashing, and post quantum crypto schemes.

  • cryptopp ๐Ÿ“ ๐ŸŒ -- crypto library

  • md5-optimisation ๐Ÿ“ ๐ŸŒ -- MD5 Optimisation Tricks: Beating OpenSSLโ€™s Hand-tuned Assembly. Putting aside the security concerns with using MD5 as a cryptographic hash, there have been few developments on the performance front for many years, possibly due to maturity of implementations and existing techniques considered to be optimal. Several new tricks are employed which Iโ€™ve not seen used elsewhere, ultimately beating OpenSSLโ€™s hand-optimized MD5 implementation by roughly 5% in the general case, and 23% for processors with AVX512 support.

  • OpenSSL ๐Ÿ“ ๐ŸŒ -- its crypto library part, more specifically.

  • prvhash ๐Ÿ“ ๐ŸŒ -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).

  • SipHash ๐Ÿ“ ๐ŸŒ -- SipHash is a family of pseudorandom functions (PRFs) optimized for speed on short messages. This is the reference C code of SipHash: portable, simple, optimized for clarity and debugging. SipHash was designed in 2012 by Jean-Philippe Aumasson and Daniel J. Bernstein as a defense against hash-flooding DoS attacks.

    It is simpler and faster on short messages than previous cryptographic algorithms, such as MACs based on universal hashing, competitive in performance with insecure non-cryptographic algorithms, such as fhhash, cryptographically secure, with no sign of weakness despite multiple cryptanalysis projects by leading cryptographers, battle-tested, with successful integration in OSs (Linux kernel, OpenBSD, FreeBSD, FreeRTOS), languages (Perl, Python, Ruby, etc.), libraries (OpenSSL libcrypto, Sodium, etc.) and applications (Wireguard, Redis, etc.).

    As a secure pseudorandom function (a.k.a. keyed hash function), SipHash can also be used as a secure message authentication code (MAC). But SipHash is not a hash in the sense of general-purpose key-less hash function such as BLAKE3 or SHA-3. SipHash should therefore always be used with a secret key in order to be secure.

  • tink ๐Ÿ“ ๐ŸŒ -- A multi-language, cross-platform library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.

  • tink-cc ๐Ÿ“ ๐ŸŒ -- Tink C++: Using crypto in your application shouldn't feel like juggling chainsaws in the dark. Tink is a crypto library written by a group of cryptographers and security engineers at Google. It was born out of our extensive experience working with Google's product teams, fixing weaknesses in implementations, and providing simple APIs that can be used safely without needing a crypto background. Tink provides secure APIs that are easy to use correctly and hard(er) to misuse. It reduces common crypto pitfalls with user-centered design, careful implementation and code reviews, and extensive testing. At Google, Tink is one of the standard crypto libraries, and has been deployed in hundreds of products and systems.

Hash-like Filters & Fast Hashing for Hash Tables et al (64 bits and less, mostly)

These hashes are for other purposes, e.g. fast lookup in dictionaries, fast approximate hit testing and set reduction through fast filtering (think bloom filter). These may be machine specific (and some of them are): these are never supposed to be used for encoding in storage or other means which crosses machine boundaries: if you want to use them for a database index, that is fine as long as you don't expect that database index to be readable by any other machine than the one which produced and uses these hash numbers.

As you can see from the list below, I went on a shopping spree, having fun with all the latest, including some possibly insane stuff that's only really useful for particular edge cases -- which we hope to avoid ourselves, for a while at least. Anyway, I'ld say we've got the motherlode here. Simple fun for those days when your brain-flag is at half-mast. Enjoy.

  • adaptiveqf ๐Ÿ“ ๐ŸŒ -- Adaptive Quotient Filter (AQF) supports approximate membership testing and counting the occurrences of items in a data set. Like other AMQs, the AQF has a chance for false positives during queries. However, the AQF has the ability to adapt to false positives after they have occurred so they are not repeated. At the same time, the AQF maintains the benefits of a quotient filter, as it is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting, resizing, merging, and highly concurrent access.

  • adaptive-radix-tree ๐Ÿ“ ๐ŸŒ -- implements the Adaptive Radix Tree (ART), as proposed by Leis et al. ART, which is a trie based data structure, achieves its performance, and space efficiency, by compressing the tree both vertically, i.e., if a node has no siblings it is merged with its parent, and horizontally, i.e., uses an array which grows as the number of children increases. Vertical compression reduces the tree height and horizontal compression decreases a nodeโ€™s size.

  • BBHash ๐Ÿ“ ๐ŸŒ -- Bloom-filter based minimal perfect hash function library.

    • left-for-dead; reason: has some GCC + Linux specific coding constructs; code isn't clean, which doesn't make my porting effort 'trustworthy'. Overall, if this is the alternative, we'll stick with gperf.
  • BCF-cuckoo-index ๐Ÿ“ ๐ŸŒ -- Better Choice Cuckoo Filter (BCF) is an efficient approximate set representation data structure. Different from the standard Cuckoo Filter (CF), BCF leverages the principle of the power of two choices to select the better candidate bucket during insertion. BCF reduces the average number of relocations of the state-of-the-art CF by 35%.

    • left-for-dead; reason: has some GCC + Linux specific coding constructs: intrinsics + Linux-only API calls, which increase the cost of porting.
  • bitrush-index ๐Ÿ“ ๐ŸŒ -- provides a serializable bitmap index able to index millions values/sec on a single thread. By default this library uses [ozbcbitmap] but if you want you can also use another compressed/uncrompressed bitmap. Only equality-queries (A = X) are supported.

  • bloom ๐Ÿ“ ๐ŸŒ -- C++ Bloom Filter Library, which offers optimal parameter selection based on expected false positive rate, union, intersection and difference operations between bloom filters and compression of in-use table (increase of false positive probability vs space).

  • cmph-hasher ๐Ÿ“ ๐ŸŒ -- C Minimal Perfect Hashing Library for both small and (very) large hash sets.

  • cqf ๐Ÿ“ ๐ŸŒ -- A General-Purpose Counting Filter: Counting Quotient Filter (CQF) supports approximate membership testing and counting the occurrences of items in a data set. This general-purpose AMQ is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting (even on skewed data sets), resizing, merging, and highly concurrent access.

  • crc32c ๐Ÿ“ ๐ŸŒ -- a few CRC32C implementations under an umbrella that dispatches to a suitable implementation based on the host computer's hardware capabilities. CRC32C is specified as the CRC that uses the iSCSI polynomial in RFC 3720. The polynomial was introduced by G. Castagnoli, S. Braeuer and M. Herrmann. CRC32C is used in software such as Btrfs, ext4, Ceph and leveldb.

  • CRoaring ๐Ÿ“ ๐ŸŒ -- portable Roaring bitmaps in C (and C++). Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. They are used by several major systems such as Apache Lucene and derivative systems such as Solr and Elasticsearch, etc.. The CRoaring library is used in several systems such as Apache Doris.

  • Cuckoo_Filter ๐Ÿ“ ๐ŸŒ -- a key-value filter using cuckoo hashing, substituting for bloom filter.

  • cuckoo-index ๐Ÿ“ ๐ŸŒ -- Cuckoo Index (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. The problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition. To identify all partitions containing a key, we would need to probe all per-partition filters (which could be many). Depending on the storage medium, a false positive there can be very expensive. Furthermore, secondary columns typically contain many duplicates (also across partitions). Cuckoo Index (CI) addresses these drawbacks of per-partition filters. (It must know all keys at build time, though.)

  • dablooms ๐Ÿ“ ๐ŸŒ -- a Scalable, Counting, Bloom Filter demonstrating a novel Bloom filter implementation that can scale, and provide not only the addition of new members, but reliable removal of existing members.

  • DCF-cuckoo-index ๐Ÿ“ ๐ŸŒ -- the Dynamic Cuckoo Filter (DCF) is an efficient approximate membership test data structure. Different from the classic Bloom filter and its variants, DCF is especially designed for highly dynamic datasets and supports extending and reducing its capacity. The DCF design is the first to achieve both reliable item deletion and flexibly extending/reducing for approximate set representation and membership testing. DCF outperforms the state-of-the-art DBF designs in both speed and memory consumption.

  • dense_hash_map ๐Ÿ“ ๐ŸŒ -- jg::dense_hash_map: a simple replacement for std::unordered_map with better performance but loose stable addressing as a trade-off.

  • EASTL ๐Ÿ“ ๐ŸŒ -- EASTL (Electronic Arts Standard Template Library) is a C++ template library of containers, algorithms, and iterators useful for runtime and tool development across multiple platforms. It is a fairly extensive and robust implementation of such a library and has an emphasis on high performance above all other considerations.

  • emhash ๐Ÿ“ ๐ŸŒ -- fast and memory efficient open addressing C++ flat hash table/map.

  • emphf-hash ๐Ÿ“ ๐ŸŒ -- an efficient external-memory algorithm for the construction of minimal perfect hash functions for large-scale key sets, focusing on speed and low memory usage (2.61 N bits plus a small constant factor).

  • EWAHBoolArray ๐Ÿ“ ๐ŸŒ -- a C++ compressed bitset data structure (also called bitset or bit vector). It supports several word sizes by a template parameter (16-bit, 32-bit, 64-bit). You should expect the 64-bit word-size to provide better performance, but higher memory usage, while a 32-bit word-size might compress a bit better, at the expense of some performance.

  • eytzinger ๐Ÿ“ ๐ŸŒ -- fixed_eytzinger_map is a free implementation of Eytzingerโ€™s layout, in a form of an STL-like generic associative container, broadly compatible with a well-established access patterns. An Eytzinger map, or BFS(breadth-first search) map, places elements in a lookup order, which leads to a better memory locality. In practice, such container can outperform searching in sorted arrays, like boost::flat_map, due to less cache misses made in a lookup process. In comparison with RB-based trees, like std::map, lookup in Eytzinger map can be multiple times faster. Some comparison graphs are given here.

  • farmhash ๐Ÿ“ ๐ŸŒ -- FarmHash, a family of hash functions. FarmHash provides hash functions for strings and other data. The functions mix the input bits thoroughly but are not suitable for cryptography.

  • fastfilter_cpp ๐Ÿ“ ๐ŸŒ -- Fast Filter: Fast approximate membership filter implementations (C++, research library)

  • fasthashing ๐Ÿ“ ๐ŸŒ -- a few very fast (almost) strongly universal hash functions over 32-bit strings, as described by the paper: Owen Kaser and Daniel Lemire, Strongly universal string hashing is fast, Computer Journal (2014) 57 (11): 1624-1638. http://arxiv.org/abs/1202.4961

  • fifo_map ๐Ÿ“ ๐ŸŒ -- a FIFO-ordered associative container for C++. It has the same interface as std::map, it can be used as drop-in replacement.

  • flat_hash_map ๐Ÿ“ ๐ŸŒ -- a very fast hashtable.

  • flat.hpp ๐Ÿ“ ๐ŸŒ -- a library of flat vector-like based associative containers.

  • fph-table ๐Ÿ“ ๐ŸŒ -- the Flash Perfect Hash (FPH) library is a modern C++/17 implementation of a dynamic perfect hash table (no collisions for the hash), which makes the hash map/set extremely fast for lookup operations. We provide four container classes fph::DynamicFphSet,fph::DynamicFphMap,fph::MetaFphSet and fph::MetaFphMap. The APIs of these four classes are almost the same as those of std::unordered_set and std::unordered_map.

  • fsst ๐Ÿ“ ๐ŸŒ -- Fast Static Symbol Table (FSST): fast text compression that allows random access. See also the PVLDB paper https://github.com/cwida/fsst/raw/master/fsstcompression.pdf. FSST is a compression scheme focused on string/text data: it can compress strings from distributions with many different values (i.e. where dictionary compression will not work well). It allows random-access to compressed data: it is not block-based, so individual strings can be decompressed without touching the surrounding data in a compressed block. When compared to e.g. LZ4 (which is block-based), FSST further achieves similar decompression speed and compression speed, and better compression ratio. FSST encodes strings using a symbol table -- but it works on pieces of the string, as it maps "symbols" (1-8 byte sequences) onto "codes" (single-bytes). FSST can also represent a byte as an exception (255 followed by the original byte). Hence, compression transforms a sequence of bytes into a (supposedly shorter) sequence of codes or escaped bytes. These shorter byte-sequences could be seen as strings again and fit in whatever your program is that manipulates strings. An optional 0-terminated mode (like, C-strings) is also supported.

  • gperf-hash ๐Ÿ“ ๐ŸŒ -- This is GNU gperf, a program that generates C/C++ perfect hash functions for sets of key words.

  • gtl ๐Ÿ“ ๐ŸŒ -- Greg's Template Library of useful classes, including a set of excellent hash map implementations, as well as a btree alternative to std::map and std::set. These are drop-in replacements for the standard C++ classes and provide the same API, but are significantly faster and use less memory. We also have a fast bit_vector implementation, which is an alternative to std::vector<bool> or std::bitset, providing both dynamic resizing and a good assortment of bit manipulation primitives, as well as a novel bit_view class allowing to operate on subsets of the bit_vector. We have lru_cache and memoize classes, both with very fast multi-thread versions relying of the mutex sharding of the parallel hashmap classes. We also offer an intrusive_ptr class, which uses less memory than std::shared_ptr, and is simpler to construct.

  • HashMap ๐Ÿ“ ๐ŸŒ -- a hash table mostly compatible with the C++11 std::unordered_map interface, but with much higher performance for many workloads. This hash table uses open addressing with linear probing and backshift deletion. Open addressing and linear probing minimizes memory allocations and achieves high cache efficiency. Backshift deletion keeps performance high for delete heavy workloads by not clobbering the hash table with tombestones.

  • highwayhash ๐Ÿ“ ๐ŸŒ -- Fast strong hash functions: SipHash/HighwayHash

  • hopscotch-map ๐Ÿ“ ๐ŸŒ -- a C++ implementation of a fast hash map and hash set using hopscotch hashing and open-addressing to resolve collisions. It is a cache-friendly data structure offering better performances than std::unordered_map in most cases and is closely similar to google::dense_hash_map while using less memory and providing more functionalities.

  • iceberghashtable ๐Ÿ“ ๐ŸŒ -- IcebergDB: High Performance Hash Tables Through Stability and Low Associativity is a fast, concurrent, and resizeable hash table implementation. It supports insertions, deletions and queries for 64-bit keys and values.

  • LDCF-hash ๐Ÿ“ ๐ŸŒ -- The Logarithmic Dynamic Cuckoo Filter (LDCF) is an efficient approximate membership test data structure for dynamic big data sets. LDCF uses a novel multi-level tree structure and reduces the worst insertion and membership testing time from O(N) to O(1), while simultaneously reducing the memory cost of DCF as the cardinality of the set increases.

  • libart ๐Ÿ“ ๐ŸŒ -- provides the Adaptive Radix Tree or ART. The ART operates similar to a traditional radix tree but avoids the wasted space of internal nodes by changing the node size. It makes use of 4 node sizes (4, 16, 48, 256), and can guarantee that the overhead is no more than 52 bytes per key, though in practice it is much lower.

  • libbloom ๐Ÿ“ ๐ŸŒ -- a high-performance C server, exposing bloom filters and operations over them. The rate of false positives can be tuned to meet application demands, but reducing the error rate rapidly increases the amount of memory required for the representation. Example: Bloom filters enable you to represent 1MM items with a false positive rate of 0.1% in 2.4MB of RAM.

  • libbloomfilters ๐Ÿ“ ๐ŸŒ -- libbf is a C++11 library which implements various Bloom filters, including:

    • A^2
    • Basic
    • Bitwise
    • Counting
    • Spectral MI
    • Spectral RM
    • Stable
  • libCRCpp ๐Ÿ“ ๐ŸŒ -- easy to use and fast C++ CRC library.

  • libCSD ๐Ÿ“ ๐ŸŒ -- a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper: "Compressed String Dictionaries", Nieves R. Brisaboa, Rodrigo Cรกnovas, Francisco Claude, Miguel A. Martรญnez-Prieto, and Gonzalo Navarro, 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.

  • libcuckoo ๐Ÿ“ ๐ŸŒ -- provides a high-performance, compact hash table that allows multiple concurrent reader and writer threads.

  • lshbox ๐Ÿ“ ๐ŸŒ -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • map_benchmark ๐Ÿ“ ๐ŸŒ -- comprehensive benchmarks of C++ maps.

  • morton_filter ๐Ÿ“ ๐ŸŒ -- a Morton filter -- a new approximate set membership data structure. A Morton filter is a modified cuckoo filter that is optimized for bandwidth-constrained systems. Morton filters use additional computation in order to reduce their off-chip memory traffic. Like a cuckoo filter, a Morton filter supports insertions, deletions, and lookup operations. It additionally adds high-throughput self-resizing, a feature of quotient filters, which allows a Morton filter to increase its capacity solely by leveraging its internal representation. This capability is in contrast to existing vanilla cuckoo filter implementations, which are static and thus require using a backing data structure that contains the full set of items to resize the filter. Morton filters can also be configured to use less memory than a cuckoo filter for the same error rate while simultaneously delivering insertion, deletion, and lookup throughputs that are, respectively, up to 15.5x, 1.3x, and 2.5x higher than a cuckoo filter. Morton filters in contrast to vanilla cuckoo filters do not require a power of two number of buckets but rather only a number that is a multiple of two. They also use fewer bits per item than a Bloom filter when the target false positive rate is less than around 1% to 3%.

  • mutable_rank_select ๐Ÿ“ ๐ŸŒ -- Rank/Select Queries over Mutable Bitmaps. Given a mutable bitmap B[0..u) where n bits are set, the rank/select problem asks for a data structure built from B that supports rank(i) (the number of bits set in B[0..i], for 0 โ‰ค i < u), select(i) (the position of the i-th bit set, for 0 โ‰ค i < n), flip(i) (toggles B[i], for 0 โ‰ค i < u) and access(i) (return B[i], for 0 โ‰ค i < u). The input bitmap is partitioned into blocks and a tree index is built over them. The tree index implemented in the library is an optimized b-ary Segment-Tree with SIMD AVX2/AVX-512 instructions. You can test a block size of 256 or 512 bits, and various rank/select algorithms for the blocks such as broadword techniques, CPU intrinsics, and SIMD instructions.

  • nedtries ๐Ÿ“ ๐ŸŒ -- an in-place bitwise binary Fredkin trie algorithm which allows for near constant time insertions, deletions, finds, closest fit finds and iteration. On modern hardware it is approximately 50-100% faster than red-black binary trees, it handily beats even the venerable O(1) hash table for less than 3000 objects and it is barely slower than the hash table for 10000 objects. Past 10000 objects you probably ought to use a hash table though, and if you need nearest fit rather than close fit then red-black trees are still optimal.

  • OZBCBitmap ๐Ÿ“ ๐ŸŒ -- OZBC provides an efficent compressed bitmap to create bitmap indexes on high-cardinality columns. Bitmap indexes have traditionally been considered to work well for low-cardinality columns, which have a modest number of distinct values. The simplest and most common method of bitmap indexing on attribute A with K cardinality associates a bitmap with every attribute value V then the Vth bitmap rapresent the predicate A=V. This approach ensures an efficient solution for performing search but on high-cardinality attributes the size of the bitmap index increase dramatically. OZBC is a run-length-encoded hybrid compressed bitmap designed exclusively to create a bitmap indexes on L cardinality attributes where L>=16 and provide bitwise logical operations in running time complexity proportianl to the compressed bitmap size.

  • parallel-hashmap ๐Ÿ“ ๐ŸŒ -- a set of hash map implementations, as well as a btree alternative to std::map and std::set

  • phf-hash ๐Ÿ“ ๐ŸŒ -- a simple implementation of the CHD perfect hash algorithm. CHD can generate perfect hash functions for very large key sets -- on the order of millions of keys -- in a very short time.

  • poplar-trie ๐Ÿ“ ๐ŸŒ -- a C++17 library of a memory-efficient associative array whose keys are strings. The data structure is based on a dynamic path-decomposed trie (DynPDT) described in the paper, Shunsuke Kanda, Dominik Kรถppl, Yasuo Tabei, Kazuhiro Morita, and Masao Fuketa: Dynamic Path-decomposed Tries, ACM Journal of Experimental Algorithmics (JEA), 25(1): 1โ€“28, 2020. Poplar-trie is a memory-efficient updatable associative array implementation which maps key strings to values of any type like std::map<std::string,anytype>. DynPDT is composed of two structures: dynamic trie and node label map (NLM) structures.

  • PruningRadixTrie ๐Ÿ“ ๐ŸŒ -- a 1000x faster Radix trie for prefix search & auto-complete, the PruningRadixTrie is a novel data structure, derived from a radix trie - but 3 orders of magnitude faster. A Pruning Radix trie is a novel Radix trie algorithm, that allows pruning of the Radix trie and early termination of the lookup. In many cases, we are not interested in a complete set of all children for a given prefix, but only in the top-k most relevant terms. Especially for short prefixes, this results in a massive reduction of lookup time for the top-10 results. On the other hand, a complete result set of millions of suggestions wouldn't be helpful at all for autocompletion. The lookup acceleration is achieved by storing in each node the maximum rank of all its children. By comparing this maximum child rank with the lowest rank of the results retrieved so far, we can heavily prune the trie and do early termination of the lookup for non-promising branches with low child ranks.

  • prvhash ๐Ÿ“ ๐ŸŒ -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).

  • QALSH ๐Ÿ“ ๐ŸŒ -- QALSH: Query-Aware Locality-Sensitive Hashing, is a package for the problem of Nearest Neighbor Search (NNS) over high-dimensional Euclidean spaces. Given a set of data points and a query, the problem of NNS aims to find the nearest data point to the query. It is a very fundamental probelm and has wide applications in many data mining and machine learning tasks. This package provides the external memory implementations (disk-based) of QALSH and QALSH+ for c-Approximate Nearest Neighbor Search (c-ANNS) under lp norm, where 0 < p โฉฝ 2. The internel memory version can be found here.

  • QALSH_Mem ๐Ÿ“ ๐ŸŒ -- Memory Version of QALSH: QALSH_Mem is a package for the problem of Nearest Neighbor Search (NNS). Given a set of data points and a query, the problem of NNS aims to find the nearest data point to the query. It has wide applications in many data mining and machine learning tasks. This package provides the internal memory implementations of two LSH schemes QALSH and QALSH+ for c-Approximate Nearest Neighbor Search (c-ANNS) under lp norm, where 0 < p โฉฝ 2. The external version of QALSH and QALSH+ can be found here.

  • radix_tree ๐Ÿ“ ๐ŸŒ -- STL like container of radix tree in C++.

  • rapidhash ๐Ÿ“ ๐ŸŒ -- very fast, high quality, platform-independent, this is the fastest recommended hash function by SMHasher. The fastest passing hash in SMHasher3. rapidhash is wyhash' official successor, with improved speed, quality and compatibility.

  • rax ๐Ÿ“ ๐ŸŒ -- an ANSI C radix tree implementation initially written to be used in a specific place of Redis in order to solve a performance problem, but immediately converted into a stand alone project to make it reusable for Redis itself, outside the initial intended application, and for other projects as well. The primary goal was to find a suitable balance between performances and memory usage, while providing a fully featured implementation of radix trees that can cope with many different requirements.

  • RectangleBinPack ๐Ÿ“ ๐ŸŒ -- the source code used in "A Thousand Ways to Pack the Bin - A Practical Approach to Two-Dimensional Rectangle Bin Packing." The code can be

  • RoaringBitmap ๐Ÿ“ ๐ŸŒ -- Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, roaring bitmaps can be hundreds of times faster and they often offer significantly better compression. They can even be faster than uncompressed bitmaps.

  • robin-hood-hashing ๐Ÿ“ ๐ŸŒ -- robin_hood unordered map & set.

  • robin-map ๐Ÿ“ ๐ŸŒ -- a C++ implementation of a fast hash map and hash set using open-addressing and linear robin hood hashing with backward shift deletion to resolve collisions.

  • rollinghashcpp ๐Ÿ“ ๐ŸŒ -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • semimap ๐Ÿ“ ๐ŸŒ -- semi::static_map and semi::map: associative map containers with compile-time lookup! Normally, associative containers require some runtime overhead when looking up their values from a key. However, when the key is known at compile-time (for example, when the key is a literal) then this run-time lookup could technically be avoided. This is exactly what the goal of semi::static_map and semi::map is.

  • SipHash ๐Ÿ“ ๐ŸŒ -- SipHash is a family of pseudorandom functions (PRFs) optimized for speed on short messages. This is the reference C code of SipHash: portable, simple, optimized for clarity and debugging. SipHash was designed in 2012 by Jean-Philippe Aumasson and Daniel J. Bernstein as a defense against hash-flooding DoS attacks.

    It is simpler and faster on short messages than previous cryptographic algorithms, such as MACs based on universal hashing, competitive in performance with insecure non-cryptographic algorithms, such as fhhash, cryptographically secure, with no sign of weakness despite multiple cryptanalysis projects by leading cryptographers, battle-tested, with successful integration in OSs (Linux kernel, OpenBSD, FreeBSD, FreeRTOS), languages (Perl, Python, Ruby, etc.), libraries (OpenSSL libcrypto, Sodium, etc.) and applications (Wireguard, Redis, etc.).

    As a secure pseudorandom function (a.k.a. keyed hash function), SipHash can also be used as a secure message authentication code (MAC). But SipHash is not a hash in the sense of general-purpose key-less hash function such as BLAKE3 or SHA-3. SipHash should therefore always be used with a secret key in order to be secure.

  • slot_map ๐Ÿ“ ๐ŸŒ -- a Slot Map is a high-performance associative container with persistent unique keys to access stored values. Upon insertion, a key is returned that can be used to later access or remove the values. Insertion, removal, and access are all guaranteed to take O(1) time (best, worst, and average case). Great for storing collections of objects that need stable, safe references but have no clear ownership. The difference between a std::unordered_map and a dod::slot_map is that the slot map generates and returns the key when inserting a value. A key is always unique and will only refer to the value that was inserted.

  • smhasher ๐Ÿ“ ๐ŸŒ -- benchmark and collection of fast hash functions for symbol tables or hash tables.

  • sparsehash ๐Ÿ“ ๐ŸŒ -- fast (non-cryptographic) hash algorithms

  • sparse-map ๐Ÿ“ ๐ŸŒ -- a C++ implementation of a memory efficient hash map and hash set. It uses open-addressing with sparse quadratic probing. The goal of the library is to be the most memory efficient possible, even at low load factor, while keeping reasonable performances.

  • sparsepp ๐Ÿ“ ๐ŸŒ -- a fast, memory efficient hash map for C++. Sparsepp is derived from Google's excellent sparsehash implementation.

  • spookyhash ๐Ÿ“ ๐ŸŒ -- a very fast non cryptographic hash function, designed by Bob Jenkins. It produces well-distributed 128-bit hash values for byte arrays of any length. It can produce 64-bit and 32-bit hash values too, at the same speed.

  • StronglyUniversalStringHashing ๐Ÿ“ ๐ŸŒ -- very fast universal hash families on strings.

  • tensorstore ๐Ÿ“ ๐ŸŒ -- TensorStore is an open-source C++ and Python software library designed for storage and manipulation of large multi-dimensional arrays.

  • unordered_dense ๐Ÿ“ ๐ŸŒ -- ankerl::unordered_dense::{map, set} is a fast & densely stored hashmap and hashset based on robin-hood backward shift deletion for C++17 and later. The classes ankerl::unordered_dense::map and ankerl::unordered_dense::set are (almost) drop-in replacements of std::unordered_map and std::unordered_set. While they don't have as strong iterator / reference stability guaranties, they are typically much faster. Additionally, there are ankerl::unordered_dense::segmented_map and ankerl::unordered_dense::segmented_set with lower peak memory usage. and stable iterator/references on insert.

  • wyhash ๐Ÿ“ ๐ŸŒ -- No hash function is perfect, but some are useful. wyhash and wyrand are the ideal 64-bit hash function and PRNG respectively: solid, portable, fastest (especially for short keys), salted (using a dynamic secret to avoid intended attack).

  • xor-and-binary-fuse-filter ๐Ÿ“ ๐ŸŒ -- XOR and Binary Fuse Filter library: Bloom filters are used to quickly check whether an element is part of a set. Xor filters and binary fuse filters are faster and more concise alternative to Bloom filters. They are also smaller than cuckoo filters. They are used in production systems.

  • xsg ๐Ÿ“ ๐ŸŒ -- XOR BST implementations are related to the XOR linked list, a doubly linked list variant, from where we borrow the idea about how links between nodes are to be implemented. Modest resource requirements and simplicity make XOR scapegoat trees stand out of the BST crowd. All iterators (except end() iterators), but not references and pointers, are invalidated, after inserting or erasing from this XOR scapegoat tree implementation. You can dereference invalidated iterators, if they were not erased, but you cannot iterate with them. end() iterators are constant and always valid, but dereferencing them results in undefined behavior.

  • xxHash ๐Ÿ“ ๐ŸŒ -- fast (non-cryptographic) hash algorithm

  • circlehash ๐Ÿ“ ๐ŸŒ -- a family of non-cryptographic hash functions that pass every test in SMHasher.

    • removed; reason: written in Go; port to C/C++ is easy but just too much effort for too little gain; when we're looking for fast non-cryptographic hashes like this, we don't appreciate it to include 128-bit / 64-bit multiplications as those are generally slower than shift, add, xor. While this will surely be a nice hash, it doesn't fit our purposes.

Intermediate Data Storage / Caching / Hierarchical Data Stores (binary hOCR; document text revisions; ...)

  • CacheLib ๐Ÿ“ ๐ŸŒ -- provides an in-process high performance caching mechanism, thread-safe API to build high throughput, low overhead caching services, with built-in ability to leverage DRAM and SSD caching transparently.
  • cachelot ๐Ÿ“ ๐ŸŒ -- is an LRU cache that works at the speed of light. The library works with a fixed pre-allocated memory. You tell the memory size and LRU cache is ready. Small metadata, up to 98% memory utilization.
  • caches ๐Ÿ“ ๐ŸŒ -- implements a simple thread-safe cache with several page replacement policies: LRU (Least Recently Used), FIFO (First-In/First-Out), LFU (Least Frequently Used)
  • c-blosc2 ๐Ÿ“ ๐ŸŒ -- a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans), designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.
  • localmemcache ๐Ÿ“ ๐ŸŒ -- a key-value database and library that provides an interface similar to memcached but for accessing local data instead of remote data. It's based on mmap()'ed shared memory for maximum speed. It supports persistence, also making it a fast alternative to GDBM and Berkeley DB.
  • lru_cache ๐Ÿ“ ๐ŸŒ -- LRU cache is a fast, header-only, generic C++ 17 [LRU cache][1] library, with customizable backend.
  • lrucache11 ๐Ÿ“ ๐ŸŒ -- A header only C++11 LRU Cache template class that allows you to define key, value and optionally the Map type. uses a double linked list and a std::unordered_map style container to provide fast insert, delete and update No dependencies other than the C++ standard library.
  • pelikan ๐Ÿ“ ๐ŸŒ -- Pelikan is Twitter's unified cache backend.
  • stlcache ๐Ÿ“ ๐ŸŒ -- STL::Cache is an in-memory cache for C++ applications. STL::Cache is just a simple wrapper over standard map, that implements some cache algorithms, thus allowing you to limit the storage size and automatically remove unused items from it. It is intended to be used for keeping any key/value data, especially when data's size are too big, to just put it into the map and keep the whole thing. With STL::Cache you could put enormous (really unlimited) amount of data into it, but it will store only some small part of your data. So re-usable data will be kept near your code and not so popular data will not spend expensive memory. STL::Cache uses configurable policies, for decisions, whether data are good, to be kept in cache or they should be thrown away. It is shipped with 8 policies and you are free to implement your own.

RAM-/disk-based large queues and stores: B+tree, LSM-tree, ...

  • arrow ๐Ÿ“ ๐ŸŒ -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • cpp-btree ๐Ÿ“ ๐ŸŒ -- in-memory B+-tree: an alternative for the priority queue as we expect the queue to grow huge, given past experience with Qiqqa.

  • ejdb ๐Ÿ“ ๐ŸŒ -- an embeddable JSON database engine published under MIT license, offering a single file database, online backups support, a simple but powerful query language (JQL), based on the TokyoCabinet-inspired KV store iowow.

  • FASTER ๐Ÿ“ ๐ŸŒ -- helps manage large application state easily, resiliently, and with high performance by offering (1) FASTER Log, which is a high-performance concurrent persistent recoverable log, iterator, and random reader library, and (2) FASTER KV as a concurrent key-value store + cache that is designed for point lookups and heavy updates. FASTER supports data larger than memory, by leveraging fast external storage (local or cloud). It also supports consistent recovery using a fast non-blocking checkpointing technique that lets applications trade-off performance for commit latency. Both FASTER KV and FASTER Log offer orders-of-magnitude higher performance than comparable solutions, on standard workloads.

  • iowow ๐Ÿ“ ๐ŸŒ -- a C/11 file storage utility library and persistent key/value storage engine, supporting multiple key-value databases within a single file, online database backups and Write Ahead Logging (WAL) support. Good performance comparing its main competitors: lmdb, leveldb, kyoto cabinet.

  • libmdbx ๐Ÿ“ ๐ŸŒ -- one of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

  • libpmemobj-cpp ๐Ÿ“ ๐ŸŒ -- a C++ binding for libpmemobj (a library which is a part of PMDK collection).

  • libshmcache ๐Ÿ“ ๐ŸŒ -- a local share memory cache for multi processes. it is a high performance library because read mechanism is lockless. libshmcache is 100+ times faster than a remote interface such as redis.

  • Lightning.NET ๐Ÿ“ ๐ŸŒ -- .NET library for OpenLDAP's LMDB key-value store

  • ligra-graph ๐Ÿ“ ๐ŸŒ -- LIGRA: a Lightweight Graph Processing Framework for Shared Memory; works on both uncompressed and compressed graphs and hypergraphs.

  • lmdb ๐Ÿ“ ๐ŸŒ -- OpenLDAP LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness when used correctly.

  • lmdb-safe ๐Ÿ“ ๐ŸŒ -- A safe modern & performant C++ wrapper of LMDB. LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness.. when used correctly. The design of LMDB is elegant and simple, which aids both the performance and stability. The downside of this elegant design is a nontrivial set of rules that need to be followed to not break things. In other words, LMDB delivers great things but only if you use it exactly right. This is by conscious design. The lmdb-safe library aims to deliver the full LMDB performance while programmatically making sure the LMDB semantics are adhered to, with very limited overhead.

  • lmdb.spreads.net ๐Ÿ“ ๐ŸŒ -- Low-level zero-overhead and the fastest LMDB .NET wrapper with some additional native methods useful for Spreads.

  • lmdb-store ๐Ÿ“ ๐ŸŒ -- an ultra-fast NodeJS interface to LMDB; probably the fastest and most efficient NodeJS key-value/database interface that exists for full storage and retrieval of structured JS data (objects, arrays, etc.) in a true persisted, scalable, ACID compliant database. It provides a simple interface for interacting with LMDB.

  • lmdbxx ๐Ÿ“ ๐ŸŒ -- lmdb++: a comprehensive C++11 wrapper for the LMDB embedded database library, offering both an error-checked procedural interface and an object-oriented resource interface with RAII semantics.

  • palmtree ๐Ÿ“ ๐ŸŒ -- concurrent lock free B+Tree

  • parallel-hashmap ๐Ÿ“ ๐ŸŒ -- a set of hash map implementations, as well as a btree alternative to std::map and std::set

  • pmdk ๐Ÿ“ ๐ŸŒ -- the Persistent Memory Development Kit (PMDK) is a collection of libraries and tools for System Administrators and Application Developers to simplify managing and accessing persistent memory devices.

  • pmdk-tests ๐Ÿ“ ๐ŸŒ -- tests for Persistent Memory Development Kit

  • pmemkv ๐Ÿ“ ๐ŸŒ -- pmemkv is a local/embedded key-value datastore optimized for persistent memory. Rather than being tied to a single language or backing implementation, pmemkv provides different options for language bindings and storage engines.

  • pmemkv-bench ๐Ÿ“ ๐ŸŒ -- benchmark for libpmemkv and its underlying libraries, based on leveldb's db_bench. The pmemkv_bench utility provides some standard read, write & remove benchmarks. It's based on the db_bench utility included with LevelDB and RocksDB, although the list of supported parameters is slightly different.

  • riegeli ๐Ÿ“ ๐ŸŒ -- Riegeli/records is a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

  • tlx-btree ๐Ÿ“ ๐ŸŒ -- in-memory B+-tree: an alternative for the priority queue as we expect the queue to grow huge, given past experience with Qiqqa.

  • vmem ๐Ÿ“ ๐ŸŒ -- libvmem and libvmmalloc are a couple of libraries for using persistent memory for malloc-like volatile uses. They have historically been a part of PMDK despite being solely for volatile uses. You may want consider using memkind instead in code that benefits from extra features like NUMA awareness.

  • vmemcache ๐Ÿ“ ๐ŸŒ -- libvmemcache is an embeddable and lightweight in-memory buffered LRU caching solution. It's designed to fully take advantage of large capacity memory, such as Persistent Memory with DAX, through memory mapping in an efficient and scalable way.

HDF5 file format

  • h5cpp ๐Ÿ“ ๐ŸŒ -- easy to use HDF5 C++ templates for Serial and Paralel HDF5. Hierarchical Data Format HDF5 is prevalent in high performance scientific computing, sits directly on top of sequential or parallel file systems, providing block and stream operations on standardized or custom binary/text objects. Scientific computing platforms come with the necessary libraries to read write HDF5 dataset. H5CPP simplifies interactions with popular linear algebra libraries, provides compiler assisted seamless object persistence, Standard Template Library support and comes equipped with a novel error handling architecture.

    • in-purgatory; reason: see the HDF5 entry below. But advertises to be an interface between OpenCV, Eigen, etc. at the same time...
  • HDF5 ๐ŸŒ

    • removed; reason: HDF5 is a nice concept but considered overkill right now; where we need disk stores, we'll be using SQLite or LMDB-like key-value stores instead. Such stores are not meant to be interchangeable with other software in their raw shape and we'll provide public access APIs instead, where applicable.
  • HighFive-HDF5 ๐ŸŒ

    • removed; reason: see the HDF5 entry above.

Data Storage / Caching / IPC: loss-less data compression

  • 7zip ๐Ÿ“ ๐ŸŒ -- 7-Zip: 7-zip.org

  • 7-Zip-zstd ๐Ÿ“ ๐ŸŒ -- 7-Zip ZS with support of additional Codecs: Zstandard, Brotli, LZ4, LZ5, Lizard, Fast LZMA2

  • brotli ๐Ÿ“ ๐ŸŒ -- compression

  • bxzstr ๐Ÿ“ ๐ŸŒ -- a header-only library for using standard c++ iostreams to access streams compressed with ZLib, libBZ2, libLZMA, or libZstd (.gz, .bz2, .xz, and .zst files). For decompression, the format is automatically detected. For compression, the only parameter exposed is the compression algorithm.

  • bzip2 ๐Ÿ“ ๐ŸŒ -- bzip2 with minor modifications to original sources.

  • bzip3 ๐Ÿ“ ๐ŸŒ -- a better, faster and stronger spiritual successor to BZip2. Features higher compression ratios and better performance thanks to a order-0 context mixing entropy coder, a fast Burrows-Wheeler transform code making use of suffix arrays and a RLE with Lempel Ziv+Prediction pass based on LZ77-style string matching and PPM-style context modeling. Like its ancestor, BZip3 excels at compressing text or code.

  • c-blosc2 ๐Ÿ“ ๐ŸŒ -- a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans), designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.

  • density ๐Ÿ“ ๐ŸŒ -- a superfast compression library. It is focused on high-speed compression, at the best ratio possible. All three of DENSITY's algorithms are currently at the pareto frontier of compression speed vs ratio (cf. here for an independent benchmark).

  • densityxx ๐Ÿ“ ๐ŸŒ -- the c++ version of density, which is a super fast compress library.

  • easylzma ๐Ÿ“ ๐ŸŒ -- a C library and command line tools for LZMA compression and decompression. It uses a Igor Pavlov's reference implementation and SDK written in C.

  • fast-lzma2 ๐Ÿ“ ๐ŸŒ -- the Fast LZMA2 Library is a lossless high-ratio data compression library based on Igor Pavlov's LZMA2 codec from 7-zip. Binaries of 7-Zip forks which use the algorithm are available in the 7-Zip-FL2 project, the 7-Zip-zstd project, and the active fork of p7zip. The library is also embedded in a fork of XZ Utils, named FXZ Utils.

  • fast_pfor ๐Ÿ“ ๐ŸŒ -- a research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

  • fsst ๐Ÿ“ ๐ŸŒ -- Fast Static Symbol Table (FSST): fast text compression that allows random access. See also the PVLDB paper https://github.com/cwida/fsst/raw/master/fsstcompression.pdf. FSST is a compression scheme focused on string/text data: it can compress strings from distributions with many different values (i.e. where dictionary compression will not work well). It allows random-access to compressed data: it is not block-based, so individual strings can be decompressed without touching the surrounding data in a compressed block. When compared to e.g. LZ4 (which is block-based), FSST further achieves similar decompression speed and compression speed, and better compression ratio. FSST encodes strings using a symbol table -- but it works on pieces of the string, as it maps "symbols" (1-8 byte sequences) onto "codes" (single-bytes). FSST can also represent a byte as an exception (255 followed by the original byte). Hence, compression transforms a sequence of bytes into a (supposedly shorter) sequence of codes or escaped bytes. These shorter byte-sequences could be seen as strings again and fit in whatever your program is that manipulates strings. An optional 0-terminated mode (like, C-strings) is also supported.

  • libbsc ๐Ÿ“ ๐ŸŒ -- a library for lossless, block-sorting data compression. bsc is a high performance file compressor based on lossless, block-sorting data compression algorithms.

  • libCSD ๐Ÿ“ ๐ŸŒ -- a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper: "Compressed String Dictionaries", Nieves R. Brisaboa, Rodrigo Cรกnovas, Francisco Claude, Miguel A. Martรญnez-Prieto, and Gonzalo Navarro, 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.

  • libdeflate ๐Ÿ“ ๐ŸŒ -- heavily optimized library for DEFLATE/zlib/gzip compression and decompression.

  • libsais ๐Ÿ“ ๐ŸŒ -- a library for fast linear time suffix array, longest common prefix array and Burrows-Wheeler transform construction based on induced sorting algorithm described in the following papers:

    • Ge Nong, Sen Zhang, Wai Hong Chan Two Efficient Algorithms for Linear Suffix Array Construction, 2009
    • Juha Karkkainen, Giovanni Manzini, Simon J. Puglisi Permuted Longest-Common-Prefix Array, 2009
    • Nataliya Timoshevskaya, Wu-chun Feng SAIS-OPT: On the characterization and optimization of the SA-IS algorithm for suffix array construction, 2014
    • Jing Yi Xie, Ge Nong, Bin Lao, Wentao Xu Scalable Suffix Sorting on a Multicore Machine, 2020

    The libsais is inspired by libdivsufsort, sais libraries by Yuta Mori and msufsort by Michael Maniscalco.

  • libzip ๐Ÿ“ ๐ŸŒ -- a C library for reading, creating, and modifying zip and zip64 archives.

  • libzopfli ๐Ÿ“ ๐ŸŒ -- Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.

  • lizard ๐Ÿ“ ๐ŸŒ -- efficient compression with very fast decompression. Lizard (formerly LZ5) is a lossless compression algorithm which contains 4 compression methods:

    • fastLZ4 : compression levels -10...-19 are designed to give better decompression speed than [LZ4] i.e. over 2000 MB/s
    • fastLZ4 + Huffman : compression levels -30...-39 add Huffman coding to fastLZ4
    • LIZv1 : compression levels -20...-29 are designed to give better ratio than [LZ4] keeping 75% decompression speed
    • LIZv1 + Huffman : compression levels -40...-49 give the best ratio (comparable to [zlib] and low levels of [zstd]/[brotli]) at decompression speed of 1000 MB/s
  • lz4 ๐Ÿ“ ๐ŸŒ -- LZ4 is lossless compression algorithm, providing compression speed > 500 MB/s per core, scalable with multi-cores CPU. It features an extremely fast decoder, with speed in multiple GB/s per core, typically reaching RAM speed limits on multi-core systems.

  • lzbench ๐Ÿ“ ๐ŸŒ -- an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors. It joins all compressors into a single exe.

  • lzham_codec ๐Ÿ“ ๐ŸŒ -- LZHAM is a lossless data compression codec, with a compression ratio similar to LZMA but with 1.5x-8x faster decompression speed.

  • lzma ๐Ÿ“ ๐ŸŒ -- LZMA Utils is an attempt to provide LZMA compression to POSIX-like systems. The idea is to have a gzip-like command line tool and a zlib-like library, which would make it easy to adapt the new compression technology to existing applications.

  • p7zip ๐Ÿ“ ๐ŸŒ -- p7zip-zstd = 7zip with extensions, including major modern codecs such as Brotli, Fast LZMA2, LZ4, LZ5, Lizard and Zstd.

  • shoco ๐Ÿ“ ๐ŸŒ -- a fast compressor for short strings

  • snappy ๐Ÿ“ ๐ŸŒ -- an up-to-date fork of google/snappy, a fast compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.

  • squash ๐Ÿ“ ๐ŸŒ -- an abstraction library which provides a single API to access many compression libraries, allowing applications a great deal of flexibility when choosing a compression algorithm, or allowing a choice between several of them.

  • Turbo-Range-Coder ๐Ÿ“ ๐ŸŒ -- TurboRC: Turbo Range Coder + rANS Asymmetric Numeral Systems is a very fast (branchless) Range Coder / Arithmetic Coder.

  • xz ๐Ÿ“ ๐ŸŒ -- XZ Utils provide a general-purpose data-compression library plus command-line tools. The native file format is the .xz format, but also the legacy .lzma format is supported. The .xz format supports multiple compression algorithms, which are called "filters" in the context of XZ Utils. The primary filter is currently LZMA2. With typical files, XZ Utils create about 30 % smaller files than gzip.

  • zfp-compressed-arrays ๐Ÿ“ ๐ŸŒ -- zfp is a compressed format for representing multidimensional floating-point and integer arrays. zfp provides compressed-array classes that support high throughput read and write random access to individual array elements. zfp also supports serial and parallel (OpenMP and CUDA) compression of whole arrays, e.g., for applications that read and write large data sets to and from disk.

  • zstd ๐Ÿ“ ๐ŸŒ -- Zstandard, a.k.a. zstd, is a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios.

  • lzo ๐ŸŒ

    • removed; reason: gone as part of the first round of compression libraries' cleanup: we intend to support lz4 for fast work, plus zstd and maybe brotli for higher compression ratios, while we won't bother with anything else: the rest can be dealt with through Apache Tika or other thirdparty pipelines when we need to read (or write) them. See also: 7zip-Zstd, which is what I use for accessing almost all compressed material anywhere.
  • lzsse ๐ŸŒ

    • removed; reason: see lzo above. LZ4 either overtakes this one or is on par (anno 2022 AD) and I don't see a lot happening here, so the coolness factor is slowly fading...
  • pithy ๐ŸŒ

    • removed; reason: see lzo above. LZ4 either overtakes this one or is on par (anno 2022 AD) and I don't see a lot happening here, so the coolness factor is slowly fading...
  • xz-utils ๐ŸŒ

    • removed; reason: see lzo2 above. When we want this, we can go through Apache Tika or other thirdparty pipelines.

See also lzbench.

File / Directory Tree Synchronization (local and remote)

  • cdc-file-transfer ๐Ÿ“ ๐ŸŒ -- CDC File Transfer contains tools for syncing and streaming files from Windows to Windows or Linux. The tools are based on Content Defined Chunking (CDC), in particular FastCDC, to split up files into chunks.
  • CryptSync ๐Ÿ“ ๐ŸŒ -- a small utility that synchronizes two folders while encrypting the contents in one folder. That means one of the two folders has all files unencrypted (the files you work with) and the other folder has all the files encrypted. This is best used together with cloud storage tools like OneDrive, DropBox or Google Drive.
  • csync2 ๐Ÿ“ ๐ŸŒ -- a cluster synchronization tool. It can be used to keep files on multiple hosts in a cluster in sync. Csync2 can handle complex setups with much more than just 2 hosts, handle file deletions and can detect conflicts.
  • filecopyex3 ๐Ÿ“ ๐ŸŒ -- a FAR plugin designed to bring to life all kinds of perverted fantasies on the topic of file copying, each of which will speed up the process by 5% ๐Ÿ˜„. At the moment, it has implemented the main features that are sometimes quite lacking in standard copiers.
  • FreeFileSync ๐Ÿ“ ๐ŸŒ -- a folder comparison and synchronization application that creates and manages backup copies of all your important files. Instead of copying every file every time, FreeFileSync determines the differences between a source and a target folder and transfers only the minimum amount of data needed. FreeFileSync is available for Windows, macOS, and Linux.
  • lib_nas_lockfile ๐Ÿ“ ๐ŸŒ -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.
  • librsync ๐Ÿ“ ๐ŸŒ -- a library for calculating and applying network deltas. librsync encapsulates the core algorithms of the rsync protocol, which help with efficient calculation of the differences between two files. The rsync algorithm is different from most differencing algorithms because it does not require the presence of the two files to calculate the delta. Instead, it requires a set of checksums of each block of one file, which together form a signature for that file. Blocks at any position in the other file which have the same checksum are likely to be identical, and whatever remains is the difference. This algorithm transfers the differences between two files without needing both files on the same system.
  • rclone ๐Ÿ“ ๐ŸŒ -- Rclone ("rsync for cloud storage") is a command-line program to sync files and directories to and from different cloud storage providers. See the full list of all storage providers and their features.
  • rsync ๐Ÿ“ ๐ŸŒ -- Rsync is a fast and extraordinarily versatile file copying tool for both remote and local files. Rsync uses a delta-transfer algorithm which provides a very fast method for bringing remote files into sync.
  • vcopy ๐Ÿ“ ๐ŸŒ -- tool to safely copy files across various (local) hardware under circumstances where there may be another file writer active at the same time and/or the (USB?) connection is sometimes flakey or system I/O drivers buggered.
  • zsync2 ๐Ÿ“ ๐ŸŒ -- the advanced file download/sync tool zsync. zsync is a well known tool for downloading and updating local files from HTTP servers using the well known algorithms rsync used for diffing binary files. Therefore, it becomes possible to synchronize modifications by exchanging the changed blocks locally using Range: requests. The system is based on meta files called .zsync files. They contain hash sums for every block of data. The file is generated from and stored along with the actual file it refers to. Due to how system works, nothing but a "dumb" HTTP server is required to make use of zsync2. This makes it easy to integrate zsync2 into existing systems.

OCR: hOCR output format, other output formats? (dedicated binary?)

  • archive-hocr-tools ๐Ÿ“ ๐ŸŒ -- a python package to ease hOCR parsing in a streaming manner.

  • hocr-fileformat ๐Ÿ“ ๐ŸŒ -- tools to alidate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

  • hocr-spec ๐Ÿ“ ๐ŸŒ -- the hOCR Embedded OCR Workflow and Output Format specification originally written by Thomas Breuel.

  • hocr-tools ๐Ÿ“ ๐ŸŒ -- a Public Specification and tools for the hOCR Format.

    hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

Pattern Recognition

"A.I." for cover pages, image/page segmentation, including abstract & summary demarcation, "figure" and "table" detection & extraction from documents, ...

BLAS, LAPACK, ...

  • amd-fftw ๐Ÿ“ ๐ŸŒ -- AOCL-FFTW is AMD optimized version of FFTW implementation targeted for AMD EPYC CPUs. It is developed on top of FFTW (version fftw-3.3.10). AOCL-FFTW achieves high performance as a result of its various optimizations involving improved SIMD Kernel functions, improved copy functions (cpy2d and cpy2d_pair used in rank-0 transform and buffering plan), improved 256-bit kernels selection by Planner and an optional in-place transpose for large problem sizes. AOCL-FFTW improves the performance of in-place MPI FFTs by employing a faster in-place MPI transpose function.

  • armadillo ๐Ÿ“ ๐ŸŒ -- C++ library for linear algebra & scientific computing

  • autodiff ๐Ÿ“ ๐ŸŒ -- a C++17 library that uses modern and advanced programming techniques to enable automatic computation of derivatives in an efficient, easy, and intuitive way.

  • BaseMatrixOps ๐Ÿ“ ๐ŸŒ -- wrappers to C++ linear algebra libraries. No guarantees made about APIs or functionality.

  • blis ๐Ÿ“ ๐ŸŒ -- BLIS is an award-winning portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

  • clBLAS ๐Ÿ“ ๐ŸŒ -- the OpenCLโ„ข BLAS portion of OpenCL's clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.

  • CLBlast ๐Ÿ“ ๐ŸŒ -- the tuned OpenCL BLAS library. CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices.

  • CLBlast-database ๐Ÿ“ ๐ŸŒ -- the full database of tuning results for the CLBlast OpenCL BLAS library. Tuning results are obtained using CLBlast and the CLTune auto-tuner.

  • CLTune ๐Ÿ“ ๐ŸŒ -- automatic OpenCL kernel tuning for CLBlast: CLTune is a C++ library which can be used to automatically tune your OpenCL and CUDA kernels. The only thing you'll need to provide is a tuneable kernel and a list of allowed parameters and values.

  • Cmathtuts ๐Ÿ“ ๐ŸŒ -- a collection of linear algebra math tutorials in C for BLAS, LAPACK and other fundamental APIs. These include samples for BLAS, LAPACK, CLAPACK, LAPACKE, ATLAS, OpenBLAS ...

  • efftw ๐Ÿ“ ๐ŸŒ -- Eigen-FFTW is a modern C++20 wrapper library around FFTW for Eigen.

  • ensmallen ๐Ÿ“ ๐ŸŒ -- a high-quality C++ library for non-linear numerical optimization. ensmallen provides many types of optimizers that can be used for virtually any numerical optimization task. This includes gradient descent techniques, gradient-free optimizers, and constrained optimization. Examples include L-BFGS, SGD, CMAES and Simulated Annealing.

  • fastapprox ๐Ÿ“ ๐ŸŒ -- approximate and vectorized versions of common mathematical functions (e.g. exponential, logarithm, and power, lgamma and digamma, cosh, sinh, tanh, cos, sin, tan, sigmoid and erf, Lambert W)

  • fastrange ๐Ÿ“ ๐ŸŒ -- a fast alternative to the modulo reduction. It has accelerated some operations in Google's Tensorflow by 10% to 20%. Further reading : http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ See also: Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, January 2019 Article No. 3 https://doi.org/10.1145/3230636

  • ffts ๐Ÿ“ ๐ŸŒ -- FFTS -- The Fastest Fourier Transform in the South.

  • gcem ๐Ÿ“ ๐ŸŒ -- GCE-Math (Generalized Constant Expression Math) is a templated C++ library enabling compile-time computation of mathematical functions.

  • GraphBLAS ๐Ÿ“ ๐ŸŒ -- SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

  • h5cpp ๐Ÿ“ ๐ŸŒ -- easy to use HDF5 C++ templates for Serial and Paralel HDF5. Hierarchical Data Format HDF5 is prevalent in high performance scientific computing, sits directly on top of sequential or parallel file systems, providing block and stream operations on standardized or custom binary/text objects. Scientific computing platforms come with the necessary libraries to read write HDF5 dataset. H5CPP simplifies interactions with popular linear algebra libraries, provides compiler assisted seamless object persistence, Standard Template Library support and comes equipped with a novel error handling architecture.

  • Imath ๐Ÿ“ ๐ŸŒ -- a basic, light-weight, and efficient C++ representation of 2D and 3D vectors and matrices and other simple but useful mathematical objects, functions, and data types common in computer graphics applications, including the โ€œhalfโ€ 16-bit floating-point type.

  • itpp ๐Ÿ“ ๐ŸŒ -- IT++ is a C++ library of mathematical, signal processing and communication classes and functions. Its main use is in simulation of communication systems and for performing research in the area of communications. The kernel of the library consists of generic vector and matrix classes, and a set of accompanying routines. Such a kernel makes IT++ similar to MATLAB or GNU Octave. The IT++ library originates from the former department of Information Theory at the Chalmers University of Technology, Gothenburg, Sweden.

  • kalman-cpp ๐Ÿ“ ๐ŸŒ -- Kalman filter and extended Kalman filter implementation in C++. Implements Kalman, Extended Kalman, Second-order extended Kalman and Unscented Kalman filters.

  • kissfft ๐Ÿ“ ๐ŸŒ -- KISS FFT - a mixed-radix Fast Fourier Transform based up on the principle, "Keep It Simple, Stupid."

  • lapack ๐Ÿ“ ๐ŸŒ -- CBLAS + LAPACK optimized linear algebra libs

  • libalg ๐Ÿ“ ๐ŸŒ -- the mathematical ALGLIB library for C++.

  • libbf ๐Ÿ“ ๐ŸŒ -- a small library to handle arbitrary precision binary or decimal floating point numbers

  • libcnl ๐Ÿ“ ๐ŸŒ -- The Compositional Numeric Library (CNL) is a C++ library of fixed-precision numeric classes which enhance integers to deliver safer, simpler, cheaper arithmetic types. CNL is particularly well-suited to: (1) compute on energy-constrained environments where FPUs are absent or costly; (2) compute on energy-intensive environments where arithmetic is the bottleneck such as simulations, machine learning applications and DSPs; and (3) domains such as finance where precision is essential.

  • libeigen ๐Ÿ“ ๐ŸŒ -- a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

  • math-atlas ๐Ÿ“ ๐ŸŒ -- The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance, delivering an efficient BLAS implementation, as well as a few routines from LAPACK.

  • mipp ๐Ÿ“ ๐ŸŒ -- MyIntrinsics++ (MIPP): a portable wrapper for vector intrinsic functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512 and ARM NEON (32-bit and 64-bit) instructions. MIPP wrapper supports simple/double precision floating-point numbers and also signed integer arithmetic (64-bit, 32-bit, 16-bit and 8-bit). With the MIPP wrapper you do not need to write a specific intrinsic code anymore. Just use provided functions and the wrapper will automatically generate the right intrisic calls for your specific architecture.

  • mlpack ๐Ÿ“ ๐ŸŒ -- an intuitive, fast, and flexible C++ machine learning library, meant to be a machine learning analog to LAPACK, aiming to implement a wide array of machine learning methods and functions as a "swiss army knife" for machine learning researchers.

  • nfft ๐Ÿ“ ๐ŸŒ -- Nonequispaced FFT (NFFT) is a software library, written in C, for computing non-equispaced fast Fourier transforms and related variations.

  • OpenBLAS ๐Ÿ“ ๐ŸŒ -- an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version.

  • OpenCL-CTS ๐Ÿ“ ๐ŸŒ -- the OpenCL Conformance Test Suite (CTS) for all versions of the Khronos OpenCL standard.

  • OpenCL-Headers ๐Ÿ“ ๐ŸŒ -- C language headers for the OpenCL API.

  • OpenCL-SDK ๐Ÿ“ ๐ŸŒ -- the Khronos OpenCL SDK. It brings together all the components needed to develop OpenCL applications.

  • optim ๐Ÿ“ ๐ŸŒ -- OptimLib is a lightweight C++ library of numerical optimization methods for nonlinear functions. Features a C++11/14/17 library of local and global optimization algorithms, as well as root finding techniques, derivative-free optimization using advanced, parallelized metaheuristic methods and constrained optimization routines to handle simple box constraints, as well as systems of nonlinear constraints.

  • QuantLib ๐Ÿ“ ๐ŸŒ -- the free/open-source library for quantitative finance, providing a comprehensive software framework for quantitative finance. QuantLib is a free/open-source library for modeling, trading, and risk management in real-life.

  • sdsl-lite ๐Ÿ“ ๐ŸŒ -- The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 [research publications][SDSLLIT]. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical.

  • stan-math ๐Ÿ“ ๐ŸŒ -- the Stan Math Library is a C++, reverse-mode automatic differentiation library designed to be usable, extensive and extensible, efficient, scalable, stable, portable, and redistributable in order to facilitate the construction and utilization of algorithms that utilize derivatives.

  • stats ๐Ÿ“ ๐ŸŒ -- StatsLib is a templated C++ library of statistical distribution functions, featuring unique compile-time computing capabilities and seamless integration with several popular linear algebra libraries. Features a header-only library of probability density functions, cumulative distribution functions, quantile functions, and random sampling methods. Functions are written in C++11 constexpr format, enabling the library to operate as both a compile-time and run-time computation engine. Provides functions to compute the cdf, pdf, quantile, as well as random sampling methods, are available for the following distributions: Bernoulli, Beta, Binomial, Cauchy, Chi-squared, Exponential, F, Gamma, Inverse-Gamma, Inverse-Gaussian, Laplace, Logistic, Log-Normal, Normal (Gaussian), Poisson, Rademacher, Student's t, Uniform and Weibull. In addition, pdf and random sampling functions are available for several multivariate distributions: inverse-Wishart, Multivariate Normal and Wishart.

  • SuiteSparse ๐Ÿ“ ๐ŸŒ -- a set of sparse-matrix-related packages written or co-authored by Tim Davis, available at https://github.com/DrTimothyAldenDavis/SuiteSparse . Packages:

    • AMD - approximate minimum degree ordering. This is the built-in AMD function in MATLAB.
    • BTF - permutation to block triangular form
    • CAMD - constrained approximate minimum degree ordering
    • CCOLAMD - constrained column approximate minimum degree ordering
    • CHOLMOD - sparse Cholesky factorization. Requires AMD, COLAMD, CCOLAMD, the BLAS, and LAPACK. Optionally uses METIS. This is chol and x=A\b in MATLAB.
    • COLAMD - column approximate minimum degree ordering. This is the built-in COLAMD function in MATLAB.
    • CSparse - a concise sparse matrix package, developed for my book, "Direct Methods for Sparse Linear Systems", published by SIAM. Intended primarily for teaching. For production, use CXSparse instead.
    • CXSparse - CSparse Extended. Includes support for complex matrices and both int or long integers. Use this instead of CSparse for production use; it creates a libcsparse.so (or dylib on the Mac) with the same name as CSparse. It is a superset of CSparse.
    • GraphBLAS - graph algorithms in the language of linear algebra. https://graphblas.org
    • KLU - sparse LU factorization, primarily for circuit simulation. Requires AMD, COLAMD, and BTF. Optionally uses CHOLMOD, CAMD, CCOLAMD, and METIS.
    • LAGraph - a graph algorithms library based on GraphBLAS. See also https://github.com/GraphBLAS/LAGraph
    • LDL - a very concise LDL' factorization package
  • tinyexpr ๐Ÿ“ ๐ŸŒ -- a very small recursive descent parser and evaluation engine for math expressions.

  • universal-numbers ๐Ÿ“ ๐ŸŒ -- a header-only C++ template library for universal number arithmetic. The goal of the Universal Numbers Library is to offer applications alternatives to IEEE floating-point that are more efficient and mathematically robust. The Universal library is a ready-to-use header-only library that provides plug-in replacement for native types, and provides a low-friction environment to start exploring alternatives to IEEE floating-point in your own algorithms.

  • xsimd ๐Ÿ“ ๐ŸŒ -- SIMD (Single Instruction, Multiple Data) instructions differ between microprocessor vendors and compilers. xsimd provides a unified means for using these features for library authors. It enables manipulation of batches of numbers with the same arithmetic operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches.

delta features & other feature extraction (see Qiqqa research notes)

  • diffutils ๐Ÿ“ ๐ŸŒ -- the GNU diff, diff3, sdiff, and cmp utilities. Their features are a superset of the Unix features and they are significantly faster.

  • dtl-diff-template-library ๐Ÿ“ ๐ŸŒ -- dtl is the diff template library written in C++.

  • google-diff-match-patch ๐Ÿ“ ๐ŸŒ -- Diff, Match and Patch offers robust algorithms to perform the operations required for synchronizing plain text.

    1. Diff:
      • Compare two blocks of plain text and efficiently return a list of differences.
    2. Match:
      • Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.
    3. Patch:
      • Apply a list of patches onto plain text. Use best-effort to apply patch even when the underlying text doesn't match.

    Originally built in 2006 to power Google Docs.

  • HDiffPatch ๐Ÿ“ ๐ŸŒ -- a library and command-line tools for Diff & Patch between binary files or directories(folders); cross-platform; runs fast; create small delta/differential; support large files and limit memory requires when diff & patch.

  • libdist ๐Ÿ“ ๐ŸŒ -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • libharry ๐Ÿ“ ๐ŸŒ -- Harry - A Tool for Measuring String Similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance, the Jaro-Winkler distance or the spectrum kernel.

  • open-vcdiff ๐Ÿ“ ๐ŸŒ -- an encoder and decoder for the VCDIFF format, as described in RFC 3284: The VCDIFF Generic Differencing and Compression Data Format.

  • rollinghashcpp ๐Ÿ“ ๐ŸŒ -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • ssdeep ๐Ÿ“ ๐ŸŒ -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • xdelta ๐Ÿ“ ๐ŸŒ -- a C library and command-line tool for delta compression using VCDIFF/RFC 3284 streams.

  • yara-pattern-matcher ๐Ÿ“ ๐ŸŒ -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

fuzzy matching

  • FM-fast-match ๐Ÿ“ ๐ŸŒ -- FAsT-Match: a port of the Fast Affine Template Matching algorithm (Simon Korman, Daniel Reichman, Gilad Tsur, Shai Avidan, CVPR 2013, Portland)

  • fuzzy-match ๐Ÿ“ ๐ŸŒ -- FuzzyMatch-cli is a commandline utility allowing to compile FuzzyMatch indexes and use them to lookup fuzzy matches. Okapi BM25 prefiltering is available on branch bm25.

  • libdist ๐Ÿ“ ๐ŸŒ -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • lshbox ๐Ÿ“ ๐ŸŒ -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • pdiff ๐Ÿ“ ๐ŸŒ -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • rollinghashcpp ๐Ÿ“ ๐ŸŒ -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • sdhash ๐Ÿ“ ๐ŸŒ -- a tool which allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during triage and initial investigation phases.

  • ssdeep ๐Ÿ“ ๐ŸŒ -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM ๐Ÿ“ ๐ŸŒ -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 ๐Ÿ“ ๐ŸŒ -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • VQMT ๐Ÿ“ ๐ŸŒ -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • xor-and-binary-fuse-filter ๐Ÿ“ ๐ŸŒ -- XOR and Binary Fuse Filter library: Bloom filters are used to quickly check whether an element is part of a set. Xor filters and binary fuse filters are faster and more concise alternative to Bloom filters. They are also smaller than cuckoo filters. They are used in production systems.

decision trees

  • catboost ๐Ÿ“ ๐ŸŒ -- a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. Supports computation on CPU and GPU.
  • decision-tree ๐Ÿ“ ๐ŸŒ -- a decision tree classifier. Decision trees are a simple machine learning algorithm that use a series of features of an observation to create a prediction of a target outcome class.
  • random-forest ๐Ÿ“ ๐ŸŒ -- a Fast C++ Implementation of Random Forests as described in: Leo Breiman. Random Forests. Machine Learning 45(1):5-32, 2001.
  • Sherwood ๐Ÿ“ ๐ŸŒ -- Sherwood: a library for decision forest inference, which was written by Duncan Robertson to accompany the book "A. Criminisi and J. Shotton. Decision Forests: for Computer Vision and Medical Image Analysis. Springer, 2013." The Sherwood library comprises a general purpose, object-oriented software framework for applying decision forests to a wide range of inference problems.
  • treelite ๐Ÿ“ ๐ŸŒ -- Treelite is a universal model exchange and serialization format for decision tree forests. Treelite aims to be a small library that enables other C++ applications to exchange and store decision trees on the disk as well as the network.
  • yggdrasil-decision-forests ๐Ÿ“ ๐ŸŒ -- Yggdrasil Decision Forests (YDF) is a production-grade collection of algorithms for the training, serving, and interpretation of decision forest models. YDF is open-source and is available in C++, command-line interface (CLI), TensorFlow (under the name TensorFlow Decision Forests; TF-DF), JavaScript (inference only), and Go (inference only).

GMM/HMM/kM

Fit patterns, e.g. match & transform a point cloud or image onto a template --> help matching pages against banner templates, etc. as part of the OCR/recognition task.

  • GMM-HMM-kMeans ๐Ÿ“ ๐ŸŒ -- HMM based on KMeans and GMM
  • GMMreg ๐Ÿ“ ๐ŸŒ -- implementations of the robust point set registration framework described in the paper "Robust Point Set Registration Using Gaussian Mixture Models", Bing Jian and Baba C. Vemuri, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8), pp. 1633-1645. An earlier conference version of this work, "A Robust Algorithm for Point Set Registration Using Mixture of Gaussians, Bing Jian and Baba C. Vemuri.", appeared in the proceedings of ICCV'05.
  • hmm-scalable ๐Ÿ“ ๐ŸŒ -- a Tool for fitting Hidden Markov Models models at scale. In particular, it is targeting a specific kind of HMM used in education called Bayesian Knowledge Tracing (BKT) model.
  • hmm-stoch ๐Ÿ“ ๐ŸŒ -- StochHMM - A Flexible hidden Markov model application and C++ library that implements HMM from simple text files. It implements traditional HMM algorithms in addition to providing additional flexibility. The additional flexibility is achieved by allowing researchers to integrate additional data sources and application code into the HMM framework.
  • liblinear ๐Ÿ“ ๐ŸŒ -- a simple package for solving large-scale regularized linear classification, regression and outlier detection.

graph analysis, graph databases

  • arangodb ๐Ÿ“ ๐ŸŒ -- a scalable open-source multi-model database natively supporting graph, document and search. All supported data models & access patterns can be combined in queries allowing for maximal flexibility.

  • g2o ๐Ÿ“ ๐ŸŒ -- General Graph Optimization (G2O) is a C++ framework for optimizing graph-based nonlinear error functions. g2o has been designed to be easily extensible to a wide range of problems and a new problem typically can be specified in a few lines of code. The current implementation provides solutions to several variants of SLAM and BA.

  • GraphBLAS ๐Ÿ“ ๐ŸŒ -- SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

  • graph-coloring ๐Ÿ“ ๐ŸŒ -- a C++ Graph Coloring Package. This project has two primary uses:

    • As an executable for finding the chromatic number for an input graph (in edge list or edge matrix format)
    • As a library for finding the particular coloring of an input graph (represented as a map<string,vector<string>> edge list)
  • graphit ๐Ÿ“ ๐ŸŒ -- a High-Performance Domain Specific Language for Graph Analytics.

  • kahypar ๐Ÿ“ ๐ŸŒ -- KaHyPar (Karlsruhe Hypergraph Partitioning) is a multilevel hypergraph partitioning framework providing direct k-way and recursive bisection based partitioning algorithms that compute solutions of very high quality.

  • libgrape-lite ๐Ÿ“ ๐ŸŒ -- a C++ library from Alibaba for parallel graph processing (GRAPE). It differs from prior systems in its ability to parallelize sequential graph algorithms as a whole by following the PIE programming model from GRAPE. Sequential algorithms can be easily "plugged into" libgrape-lite with only minor changes and get parallelized to handle large graphs efficiently. libgrape-lite is designed to be highly efficient and flexible, to cope with the scale, variety and complexity of real-life graph applications.

  • midas ๐Ÿ“ ๐ŸŒ -- C++ implementation of:

  • ogdf ๐Ÿ“ ๐ŸŒ -- OGDF stands both for Open Graph Drawing Framework (the original name) and Open Graph algorithms and Data structures Framework. OGDF is a self-contained C++ library for graph algorithms, in particular for (but not restricted to) automatic graph drawing. It offers sophisticated algorithms and data structures to use within your own applications or scientific projects.

  • snap ๐Ÿ“ ๐ŸŒ -- Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. SNAP scales to massive graphs with hundreds of millions of nodes and billions of edges.

NN, ...

  • aho_corasick ๐Ÿ“ ๐ŸŒ -- a header only implementation of the Aho-Corasick pattern search algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a very efficient dictionary matching algorithm that can locate all search patterns against in input text simultaneously in O(n + m), with space complexity O(m) (where n is the length of the input text, and m is the combined length of the search patterns).

  • A-MNS_TemplateMatching ๐Ÿ“ ๐ŸŒ -- the official code for the PatternRecognition2020 paper: Fast and robust template matching with majority neighbour similarity and annulus projection transformation.

  • arrayfire ๐Ÿ“ ๐ŸŒ -- a general-purpose tensor library that simplifies the process of software development for the parallel architectures found in CPUs, GPUs, and other hardware acceleration devices. The library serves users in every technical computing market.

  • Awesome-Document-Image-Rectification ๐Ÿ“ ๐ŸŒ -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • BehaviorTree.CPP ๐Ÿ“ ๐ŸŒ -- this C++/17 library provides a framework to create BehaviorTrees. It was designed to be flexible, easy to use, reactive and fast. Even if our main use-case is robotics, you can use this library to build AI for games, or to replace Finite State Machines. BehaviorTree.CPP features asynchronous Actions, reactive behaviors, execute multiple Actions concurrently (orthogonality), XML-based DSL scripts which can be loaded at run-time, i.e. the morphology of the Trees is not hard-coded.

  • bhtsne--Barnes-Hut-t-SNE ๐Ÿ“ ๐ŸŒ -- Barnes-Hut t-SNE

  • blis ๐Ÿ“ ๐ŸŒ -- BLIS is an award-winning portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

  • bolt ๐Ÿ“ ๐ŸŒ -- a deep learning library with high performance and heterogeneous flexibility.

  • brown-cluster ๐Ÿ“ ๐ŸŒ -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf

  • caffe ๐Ÿ“ ๐ŸŒ -- a fast deep learning framework made with expression and modularity in mind, developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC).

    • ho-hum; reason: uses google protobuffers, CUDA SDK for the GPU access (at least that's how it looks from the header files reported missing by my compiler). Needs more effort before this can be used in the monolithic production builds.
  • catboost ๐Ÿ“ ๐ŸŒ -- a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. Supports computation on CPU and GPU.

  • CNTK ๐Ÿ“ ๐ŸŒ -- the Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes neural networks as a series of computational steps via a directed graph. In this directed graph, leaf nodes represent input values or network parameters, while other nodes represent matrix operations upon their inputs. CNTK allows users to easily realize and combine popular model types such as feed-forward DNNs, convolutional nets (CNNs), and recurrent networks (RNNs/LSTMs). It implements stochastic gradient descent (SGD, error backpropagation) learning with automatic differentiation and parallelization across multiple GPUs and servers. CNTK has been available under an open-source license since April 2015. It is our hope that the community will take advantage of CNTK to share ideas more quickly through the exchange of open source working code.

  • cppflow ๐Ÿ“ ๐ŸŒ -- run TensorFlow models in c++ without Bazel, without TensorFlow installation and without compiling Tensorflow.

  • CRFpp ๐Ÿ“ ๐ŸŒ -- CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.

  • crfsuite ๐Ÿ“ ๐ŸŒ -- an implementation of Conditional Random Fields (CRFs) for labeling sequential data.

  • CRFsuite-extended ๐Ÿ“ ๐ŸŒ -- a fork of Naoaki Okazaki's implementation of conditional random fields (CRFs).

  • CurvatureFilter ๐Ÿ“ ๐ŸŒ -- Curvature filters are efficient solvers for variational models. Traditional solvers, such as gradient descent or Euler Lagrange Equation, start at the total energy and use diffusion scheme to carry out the minimization. When the initial condition is the original image, the data fitting energy always increases while the regularization energy always reduces during the optimization, as illustrated in the below figure. Thus, regularization energy must be the dominant part since the total energy has to decrease. Therefore, Curvature filters focus on minimizing the regularization term, whose minimizers are already known. For example, if the regularization is Gaussian curvature, the developable surfaces minimize this energy. Therefore, in curvature filter, developable surfaces are used to approximate the data. As long as the decreased amount in the regularization part is larger than the increased amount in the data fitting energy, the total energy is reduced.

  • darknet ๐Ÿ“ ๐ŸŒ -- Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

  • DBoW2 ๐Ÿ“ ๐ŸŒ -- a C++ library for indexing and converting images into a bag-of-word representation. It implements a hierarchical tree for approximating nearest neighbours in the image feature space and creating a visual vocabulary. DBoW2 also implements an image database with inverted and direct files to index images and enabling quick queries and feature comparisons.

  • DBow3 ๐Ÿ“ ๐ŸŒ -- DBoW3 is an improved version of the DBow2 library, an open source C++ library for indexing and converting images into a bag-of-word representation. It implements a hierarchical tree for approximating nearest neighbours in the image feature space and creating a visual vocabulary. DBoW3 also implements an image database with inverted and direct files to index images and enabling quick queries and feature comparisons.

  • deepdetect ๐Ÿ“ ๐ŸŒ -- DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state of the art machine learning easy to work with and integrate into existing applications. It has support for both training and inference, with automatic conversion to embedded platforms with TensorRT (NVidia GPU) and NCNN (ARM CPU). It implements support for supervised and unsupervised deep learning of images, text, time series and other data, with focus on simplicity and ease of use, test and connection into existing applications. It supports classification, object detection, segmentation, regression, autoencoders, ... and it relies on external machine learning libraries through a very generic and flexible API.

  • DGM-CRF ๐Ÿ“ ๐ŸŒ -- DGM (Direct Graphical Models) is a cross-platform C++ library implementing various tasks in probabilistic graphical models with pairwise and complete (dense) dependencies. The library aims to be used for the Markov and Conditional Random Fields (MRF / CRF), Markov Chains, Bayesian Networks, etc.

  • DiskANN ๐Ÿ“ ๐ŸŒ -- DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.

  • dkm ๐Ÿ“ ๐ŸŒ -- a generic C++11 k-means clustering implementation. The algorithm is based on Lloyds Algorithm and uses the kmeans++ initialization method.

  • dlib ๐Ÿ“ ๐ŸŒ -- machine learning algorithms

  • DP_means ๐Ÿ“ ๐ŸŒ -- Dirichlet Process K-means is a bayesian non-parametric extension of the K-means algorithm based on small variance assymptotics (SVA) approximation of the Dirichlet Process Mixture Model. B. Kulis and M. Jordan, "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"

  • dynet ๐Ÿ“ ๐ŸŒ -- The Dynamic Neural Network Toolkit. DyNet is a neural network library developed by Carnegie Mellon University and many others. It is written in C++ (with bindings in Python) and is designed to be efficient when run on either CPU or GPU, and to work well with networks that have dynamic structures that change for every training instance. For example, these kinds of networks are particularly important in natural language processing tasks, and DyNet has been used to build state-of-the-art systems for syntactic parsing, machine translation, morphological inflection, and many other application areas.

  • falconn ๐Ÿ“ ๐ŸŒ -- FALCONN (FAst Lookups of Cosine and Other Nearest Neighbors) is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures. Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

  • fast-kmeans ๐Ÿ“ ๐ŸŒ -- this Fast K-means Clustering Toolkit is a testbed for comparing variants of Lloyd's k-means clustering algorithm. It includes implementations of several algorithms that accelerate the algorithm by avoiding unnecessary distance calculations.

  • fbow ๐Ÿ“ ๐ŸŒ -- FBOW (Fast Bag of Words) is an extremmely optimized version of the DBow2/DBow3 libraries. The library is highly optimized to speed up the Bag of Words creation using AVX,SSE and MMX instructions. In loading a vocabulary, fbow is ~80x faster than DBOW2 (see tests directory and try). In transforming an image into a bag of words using on machines with AVX instructions, it is ~6.4x faster.

  • ffht ๐Ÿ“ ๐ŸŒ -- FFHT (Fast Fast Hadamard Transform) is a library that provides a heavily optimized C99 implementation of the Fast Hadamard Transform. FFHT also provides a thin Python wrapper that allows to perform the Fast Hadamard Transform on one-dimensional NumPy arrays. The Hadamard Transform is a linear orthogonal map defined on real vectors whose length is a power of two. For the precise definition, see the Wikipedia entry. The Hadamard Transform has been recently used a lot in various machine learning and numerical algorithms. FFHT uses AVX to speed up the computation.

  • FFME ๐Ÿ“ ๐ŸŒ -- key points detection (OpenCV). This method is a SIFT-like one, but specifically designed for egomotion computation. The key idea is that it avoids some of the steps SIFT gives, so that it runs faster, at the cost of not being so robust against scaling. The good news is that in egomotion estimation the scaling is not so critical as in registration applications, where SIFT should be selected.

  • flann ๐Ÿ“ ๐ŸŒ -- FLANN (Fast Library for Approximate Nearest Neighbors) is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms we found to work best for nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

  • flashlight ๐Ÿ“ ๐ŸŒ -- a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research and the creators of Torch, TensorFlow, Eigen and Deep Speech, with an emphasis on efficiency and scale.

  • flinng ๐Ÿ“ ๐ŸŒ -- Filters to Identify Near-Neighbor Groups (FLINNG) is a near neighbor search algorithm outlined in the paper Practical Near Neighbor Search via Group Testing.

  • gtn ๐Ÿ“ ๐ŸŒ -- GTN (Automatic Differentiation with WFSTs) is a framework for automatic differentiation with weighted finite-state transducers. The goal of GTN is to make adding and experimenting with structure in learning algorithms much simpler. This structure is encoded as weighted automata, either acceptors (WFSAs) or transducers (WFSTs). With gtn you can dynamically construct complex graphs from operations on simpler graphs. Automatic differentiation gives gradients with respect to any input or intermediate graph with a single call to gtn.backward.

  • ikd-Tree ๐Ÿ“ ๐ŸŒ -- an incremental k-d tree designed for robotic applications. The ikd-Tree incrementally updates a k-d tree with new coming points only, leading to much lower computation time than existing static k-d trees. Besides point-wise operations, the ikd-Tree supports several features such as box-wise operations and down-sampling that are practically useful in robotic applications.

  • InferenceHelper ๐Ÿ“ ๐ŸŒ -- a wrapper of deep learning frameworks especially for inference. This class provides a common interface to use various deep learnig frameworks, so that you can use the same application code.

  • InversePerspectiveMapping ๐Ÿ“ ๐ŸŒ -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • jpeg2dct ๐Ÿ“ ๐ŸŒ -- Faster Neural Networks Straight from JPEG: jpeg2dct subroutines -- this module is useful for reproducing results presented in the paper Faster Neural Networks Straight from JPEG (ICLR workshop 2018).

  • kann ๐Ÿ“ ๐ŸŒ -- KANN is a standalone and lightweight library in C for constructing and training small to medium artificial neural networks such as multi-layer perceptrons (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN), including LSTM and GRU. It implements graph-based reverse-mode automatic differentiation and allows to build topologically complex neural networks with recurrence, shared weights and multiple inputs/outputs/costs. In comparison to mainstream deep learning frameworks such as TensorFlow, KANN is not as scalable, but it is close in flexibility, has a much smaller code base and only depends on the standard C library. In comparison to other lightweight frameworks such as tiny-dnn, KANN is still smaller, times faster and much more versatile, supporting RNN, VAE and non-standard neural networks that may fail these lightweight frameworks. KANN could be potentially useful when you want to experiment small to medium neural networks in C/C++, to deploy no-so-large models without worrying about dependency hell, or to learn the internals of deep learning libraries.

  • K-Medoids-Clustering ๐Ÿ“ ๐ŸŒ -- K-medoids is a clustering algorithm related to K-means. In contrast to the K-means algorithm, K-medoids chooses datapoints as centers of the clusters. There are eight combinations of Initialization, Assignment and Update algorithms to achieve the best results in the given dataset. Also Clara algorithm approach is implemented.

  • lapack ๐Ÿ“ ๐ŸŒ -- CBLAS + LAPACK optimized linear algebra libs

  • libahocorasick ๐Ÿ“ ๐ŸŒ -- a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings "index" can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

  • libcluster ๐Ÿ“ ๐ŸŒ -- implements various algorithms with variational Bayes learning procedures and efficient cluster splitting heuristics, including the Variational Dirichlet Process (VDP), the Bayesian Gaussian Mixture Model, the Grouped Mixtures Clustering (GMC) model and more clustering algorithms based on diagonal Gaussian, and Exponential distributions.

  • libdivsufsort ๐Ÿ“ ๐ŸŒ -- a software library that implements a lightweight suffix array construction algorithm.

  • libfann ๐Ÿ“ ๐ŸŒ -- FANN: Fast Artificial Neural Network Library, a free open source neural network library, which implements multilayer artificial neural networks in C with support for both fully connected and sparsely connected networks. Cross-platform execution in both fixed and floating point are supported. It includes a framework for easy handling of training data sets. It is easy to use, versatile, well documented, and fast.

  • libirwls ๐Ÿ“ ๐ŸŒ -- LIBIRWLS is an integrated library that incorporates a parallel implementation of the Iterative Re-Weighted Least Squares (IRWLS) procedure, an alternative to quadratic programming (QP), for training of Support Vector Machines (SVMs). Although there are several methods for SVM training, the number of parallel libraries is very reduced. In particular, this library contains solutions to solve either full or budgeted SVMs making use of shared memory parallelization techniques: (1) a parallel SVM training procedure based on the IRWLS algorithm, (2) a parallel budgeted SVMs solver based on the IRWLS algorithm.

  • libkdtree ๐Ÿ“ ๐ŸŒ -- libkdtree++ is a C++ template container implementation of k-dimensional space sorting, using a kd-tree.

  • libmlpp ๐Ÿ“ ๐ŸŒ -- ML++ :: The intent with this machine-learning library is for it to act as a crossroad between low-level developers and machine learning engineers.

  • libsvm ๐Ÿ“ ๐ŸŒ -- a simple, easy-to-use, and efficient software for SVM classification and regression. It solves C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression. It also provides an automatic model selection tool for C-SVM classification.

  • LightGBM ๐Ÿ“ ๐ŸŒ -- LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

    • Better accuracy.
    • Capable of handling large-scale data.
    • Faster training speed and higher efficiency.
    • Lower memory usage.
    • Support of parallel, distributed, and GPU learning.
  • LMW-tree ๐Ÿ“ ๐ŸŒ -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

  • mace ๐Ÿ“ ๐ŸŒ -- Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices. The design focuses on the following

  • mapreduce ๐Ÿ“ ๐ŸŒ -- the MapReduce-MPI (MR-MPI) library. MapReduce is the operation popularized by Google for computing on large distributed data sets. See the Wikipedia entry on MapReduce for an overview of what a MapReduce is. The MR-MPI library is a simple, portable implementation of MapReduce that runs on any serial desktop machine or large parallel machine using MPI message passing.

  • marian ๐Ÿ“ ๐ŸŒ -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • MegEngine ๐Ÿ“ ๐ŸŒ -- MegEngine is a fast, scalable, and user friendly deep learning framework with 3 key features: (1) Unified framework for both training and inference, (2) The lowest hardware requirements and (3) Inference efficiently on all platforms.

  • midas ๐Ÿ“ ๐ŸŒ -- C++ implementation of:

  • minhash_clustering ๐Ÿ“ ๐ŸŒ -- this program is for clustering protein conserved regions using MinWise Independent Hashing. The code uses MRMPI library for MapReduce in C/C++ and constitutes of two major parts.

  • MITIE-nlp ๐Ÿ“ ๐ŸŒ -- provides state-of-the-art information extraction tools. Includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors. MITIE is built on top of dlib, a high-performance machine-learning library, MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings and Structural Support Vector Machines.

  • MNN ๐Ÿ“ ๐ŸŒ -- a highly efficient and lightweight deep learning framework. It supports inference and training of deep learning models, and has industry leading performance for inference and training on-device. At present, MNN has been integrated in more than 30 apps of Alibaba Inc, such as Taobao, Tmall, Youku, Dingtalk, Xianyu and etc., covering more than 70 usage scenarios such as live broadcast, short video capture, search recommendation, product searching by image, interactive marketing, equity distribution, security risk control. In addition, MNN is also used on embedded devices, such as IoT. Inside Alibaba, MNN works as the basic module of the compute container in the Walle System, the first end-to-end, general-purpose, and large-scale production system for device-cloud collaborative machine learning, which has been published in the top system conference OSDIโ€™22.

  • mrpt ๐Ÿ“ ๐ŸŒ -- MRPT is a lightweight and easy-to-use library for approximate nearest neighbor search with random projection. The index building has an integrated hyperparameter tuning algorithm, so the only hyperparameter required to construct the index is the target recall level! According to our experiments MRPT is one of the fastest libraries for approximate nearest neighbor search.

    In the offline phase of the algorithm MRPT indexes the data with a collection of random projection trees. In the online phase the index structure allows us to answer queries in superior time. A detailed description of the algorithm with the time and space complexities, and the aforementioned comparisons can be found in our article that was published in IEEE International Conference on Big Data 2016.

    The algorithm for automatic hyperparameter tuning is described in detail in our new article that will be presented in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2019 (arxiv preprint).

  • Multicore-TSNE ๐Ÿ“ ๐ŸŒ -- Multicore t-SNE is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with Python CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core (as of version 0.18).

  • multiverso ๐Ÿ“ ๐ŸŒ -- a parameter server based framework for training machine learning models on big data with numbers of machines. It is currently a standard C++ library and provides a series of friendly programming interfaces. Now machine learning researchers and practitioners do not need to worry about the system routine issues such as distributed model storage and operation, inter-process and inter-thread communication, multi-threading management, and so on. Instead, they are able to focus on the core machine learning logics: data, model, and training.

  • mxnet ๐Ÿ“ ๐ŸŒ -- Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.

  • nanoflann_dbscan ๐Ÿ“ ๐ŸŒ -- a fast C++ implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

  • ncnn ๐Ÿ“ ๐ŸŒ -- high-performance neural network inference computing framework optimized for mobile platforms (i.e. small footprint)

  • NiuTrans.NMT ๐Ÿ“ ๐ŸŒ -- a lightweight and efficient Transformer-based neural machine translation system. Its main features are:

    • Few dependencies. It is implemented with pure C++, and all dependencies are optional.
    • Flexible running modes. The system can run with various systems and devices (Linux vs. Windows, CPUs vs. GPUs, and FP32 vs. FP16, etc.).
    • Framework agnostic. It supports various models trained with other tools, e.g., fairseq models.
    • High efficiency. It is heavily optimized for fast decoding, see our WMT paper for more details.
  • oneDNN ๐Ÿ“ ๐ŸŒ -- oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is intended for deep learning applications and framework developers interested in improving application performance on CPUs and GPUs.

  • onnxruntime ๐Ÿ“ ๐ŸŒ -- a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more โ†’

  • onnxruntime-extensions ๐Ÿ“ ๐ŸŒ -- a library that extends the capability of the ONNX models and inference with ONNX Runtime, via ONNX Runtime Custom Operator ABIs. It includes a set of ONNX Runtime Custom Operator to support the common pre- and post-processing operators for vision, text, and nlp models. The basic workflow is to enhance a ONNX model firstly and then do the model inference with ONNX Runtime and ONNXRuntime-Extensions package.

  • onnxruntime-genai ๐Ÿ“ ๐ŸŒ -- generative AI: run Llama, Phi, Gemma, Mistral with ONNX Runtime. This API gives you an easy, flexible and performant way of running LLMs on any device. It implements the generative AI loop for ONNX models, including pre and post processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.

  • OpenBLAS ๐Ÿ“ ๐ŸŒ -- an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version.

  • OpenCL-CTS ๐Ÿ“ ๐ŸŒ -- the OpenCL Conformance Test Suite (CTS) for all versions of the Khronos OpenCL standard.

  • OpenCL-Headers ๐Ÿ“ ๐ŸŒ -- C language headers for the OpenCL API.

  • OpenCL-SDK ๐Ÿ“ ๐ŸŒ -- the Khronos OpenCL SDK. It brings together all the components needed to develop OpenCL applications.

  • OpenFST ๐Ÿ“ ๐ŸŒ -- a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition. FSTs have key applications in speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction and retrieval among others. Often a weighted transducer is used to represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be optimized by determinization and minimization, models can be applied to hypothesis sets (also represented as automata) or cascaded by finite-state composition, and the best results can be selected by shortest-path algorithms.

  • OpenFST-utils ๐Ÿ“ ๐ŸŒ -- a set of useful programs for manipulating Finite State Transducer with the OpenFst library.

  • OpenNN ๐Ÿ“ ๐ŸŒ -- a software library written in C++ for advanced analytics. It implements neural networks, the most successful machine learning method. The main advantage of OpenNN is its high performance. This library outstands in terms of execution speed and memory allocation. It is constantly optimized and parallelized in order to maximize its efficiency.

  • openvino ๐Ÿ“ ๐ŸŒ -- OpenVINOโ„ข is an open-source toolkit for optimizing and deploying AI inference, includind several components: namely [Model Optimizer], [OpenVINOโ„ข Runtime], [Post-Training Optimization Tool], as well as CPU, GPU, GNA, multi device and heterogeneous plugins to accelerate deep learning inference on Intelยฎ CPUs and Intelยฎ Processor Graphics. It supports pre-trained models from [Open Model Zoo], along with 100+ open source and public models in popular formats such as TensorFlow, ONNX, PaddlePaddle, MXNet, Caffe, Kaldi.

  • OTB ๐Ÿ“ ๐ŸŒ -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • PaddleClas ๐Ÿ“ ๐ŸŒ -- an image classification and image recognition toolset for industry and academia, helping users train better computer vision models and apply them in real scenarios, based onย PaddlePaddle.

  • PaddleDetection ๐Ÿ“ ๐ŸŒ -- a Highly Efficient Development Toolkit for Object Detection based onย PaddlePaddle.

  • Paddle-Lite ๐Ÿ“ ๐ŸŒ -- an updated version of Paddle-Mobile, an open-open source deep learning framework designed to make it easy to perform inference on mobile, embeded, and IoT devices. It is compatible with PaddlePaddle and pre-trained models from other sources.

  • PaddleNLP ๐Ÿ“ ๐ŸŒ -- a NLP library that is both easy to use and powerful. It aggregates high-quality pretrained models in the industry and provides a plug-and-play development experience, covering a model library for various NLP scenarios. With practical examples from industry practices, PaddleNLP can meet the needs of developers who require flexible customization.

  • PaddleOCR ๐Ÿ“ ๐ŸŒ -- PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and apply them into practice.

  • PaddlePaddle ๐Ÿ“ ๐ŸŒ -- the first independent R&D deep learning platform in China. It is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tools & components as well as service platforms. PaddlePaddle is originated from industrial practices with dedication and commitments to industrialization. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service, and so on while serving more than 4.7 million developers, 180,000 companies and generating 560,000 models. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.

  • pagerank ๐Ÿ“ ๐ŸŒ -- a pagerank implementation in C++ able to handle very big graphs.

  • pecos ๐Ÿ“ ๐ŸŒ -- PECOS (Predictions for Enormous and Correlated Output Spaces) is a versatile and modular machine learning (ML) framework for fast learning and inference on problems with large output spaces, such as extreme multi-label ranking (XMR) and large-scale retrieval. PECOS' design is intentionally agnostic to the specific nature of the inputs and outputs as it is envisioned to be a general-purpose framework for multiple distinct applications. Given an input, PECOS identifies a small set (10-100) of relevant outputs from amongst an extremely large (~100MM) candidate set and ranks these outputs in terms of relevance.

  • PGM-index ๐Ÿ“ ๐ŸŒ -- the Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

  • pico_tree ๐Ÿ“ ๐ŸŒ -- a C++ header only library for fast nearest neighbor searches and range searches using a KdTree.

  • puffinn ๐Ÿ“ ๐ŸŒ -- PUFFINN - Parameterless and Universal Fast FInding of Nearest Neighbors - is an easily configurable library for finding the approximate nearest neighbors of arbitrary points. It also supports the identification of the closest pairs in the dataset. The only necessary parameters are the allowed space usage and the recall. Each near neighbor is guaranteed to be found with the probability given by the recall, regardless of the difficulty of the query. Under the hood PUFFINN uses Locality Sensitive Hashing with an adaptive query mechanism. This means that the algorithm works for any similarity measure where a Locality Sensitive Hash family exists. Currently Cosine similarity is supported using SimHash or cross-polytope LSH and Jaccard similarity is supported using MinHash.

  • pyclustering ๐Ÿ“ ๐ŸŒ -- a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each algorithm or model.

  • pyglass ๐Ÿ“ ๐ŸŒ -- a library for fast inference of graph index for approximate similarity search.

    • It's high performant.
    • No third-party library dependencies, does not rely on OpenBLAS / MKL or any other computing framework.
    • Sophisticated memory management and data structure design, very low memory footprint.
    • Supports multiple graph algorithms, like HNSW and NSG.
    • Supports multiple hardware platforms, like X86 and ARM. Support for GPU is on the way
  • pytorch ๐Ÿ“ ๐ŸŒ -- PyTorch library in C++

  • pytorch_cluster ๐Ÿ“ ๐ŸŒ -- a small extension library of highly optimized graph cluster algorithms for the use in PyTorch, supporting:

  • pytorch_cpp_demo ๐Ÿ“ ๐ŸŒ -- Deep Learning sample programs of PyTorch written in C++.

  • spherical-k-means ๐Ÿ“ ๐ŸŒ -- the spherical K-means algorithm in Matlab and C++. The C++ version emphasizes a multithreaded implementation and features three ways of running the algorithm. It can be executed with a single-thread (same as the Matlab implementation), or using OpenMP or Galois (http://iss.ices.utexas.edu/?p=projects/galois). The purpose of this code is to optimize and compare the different parallel paradigms to maximize the efficiency of the algorithm.

  • ssdeep ๐Ÿ“ ๐ŸŒ -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM ๐Ÿ“ ๐ŸŒ -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 ๐Ÿ“ ๐ŸŒ -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • stan ๐Ÿ“ ๐ŸŒ -- Stan is a C++ package providing (1) full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC), (2) approximate Bayesian inference using automatic differentiation variational inference (ADVI), and (3) penalized maximum likelihood estimation (MLE) using L-BFGS optimization. It is built on top of the Stan Math library.

  • stan-math ๐Ÿ“ ๐ŸŒ -- the Stan Math Library is a C++, reverse-mode automatic differentiation library designed to be usable, extensive and extensible, efficient, scalable, stable, portable, and redistributable in order to facilitate the construction and utilization of algorithms that utilize derivatives.

  • StarSpace ๐Ÿ“ ๐ŸŒ -- a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems.

  • tapkee ๐Ÿ“ ๐ŸŒ -- a C++ template library for dimensionality reduction with some bias on spectral methods. The Tapkee origins from the code developed during GSoC 2011 as the part of the Shogun machine learning toolbox. The project aim is to provide efficient and flexible standalone library for dimensionality reduction which can be easily integrated to existing codebases. Tapkee leverages capabilities of effective Eigen3 linear algebra library and optionally makes use of the ARPACK eigensolver. The library uses CoverTree and VP-tree data structures to compute nearest neighbors. To achieve greater flexibility we provide a callback interface which decouples dimension reduction algorithms from the data representation and storage schemes.

  • tensorflow ๐Ÿ“ ๐ŸŒ -- an end-to-end open source platform for machine learning.

  • tensorflow-docs ๐Ÿ“ ๐ŸŒ -- TensorFlow documentation

  • tensorflow-io ๐Ÿ“ ๐ŸŒ -- TensorFlow I/O is a collection of file systems and file formats that are not available in TensorFlow's built-in support. A full list of supported file systems and file formats by TensorFlow I/O can be found here.

  • tensorflow-text ๐Ÿ“ ๐ŸŒ -- TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

  • tensorstore ๐Ÿ“ ๐ŸŒ -- TensorStore is an open-source C++ and Python software library designed for storage and manipulation of large multi-dimensional arrays.

  • thunderSVM ๐Ÿ“ ๐ŸŒ -- ThunderSVM exploits GPUs and multi-core CPUs to achieve high efficiency, supporting all functionalities of LibSVM such as one-class SVMs, SVC, SVR and probabilistic SVMs.

  • tinn ๐Ÿ“ ๐ŸŒ -- Tinn (Tiny Neural Network) is a 200 line dependency free neural network library written in C99.

  • tiny-dnn ๐Ÿ“ ๐ŸŒ -- a C++14 implementation of deep learning. It is suitable for deep learning on limited computational resource, embedded systems and IoT devices.

  • TNN ๐Ÿ“ ๐ŸŒ -- a high-performance, lightweight neural network inference framework open sourced by Tencent Youtu Lab. It also has many outstanding advantages such as cross-platform, high performance, model compression, and code tailoring. The TNN framework further strengthens the support and performance optimization of mobile devices on the basis of the original Rapidnet and ncnn frameworks. At the same time, it refers to the high performance and good scalability characteristics of the industry's mainstream open source frameworks, and expands the support for X86 and NV GPUs. On the mobile phone, TNN has been used by many applications such as mobile QQ, weishi, and Pitu. As a basic acceleration framework for Tencent Cloud AI, TNN has provided acceleration support for the implementation of many businesses. Everyone is welcome to participate in the collaborative construction to promote the further improvement of the TNN inference framework.

  • vxl ๐Ÿ“ ๐ŸŒ -- VXL (the Vision-something-Libraries) is a collection of C++ libraries designed for computer vision research and implementation. It was created from TargetJr and the IUE with the aim of making a light, fast and consistent system.

  • waifu2x-ncnn-vulkan ๐Ÿ“ ๐ŸŒ -- waifu2x ncnn Vulkan: an ncnn project implementation of the waifu2x converter. Runs fast on Intel / AMD / Nvidia / Apple-Silicon with Vulkan API.

  • warp-ctc ๐Ÿ“ ๐ŸŒ -- A fast parallel implementation of CTC, on both CPU and GPU. Connectionist Temporal Classification (CTC) is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition.

  • xnnpack ๐Ÿ“ ๐ŸŒ -- a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms. XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it provides low-level performance primitives for accelerating high-level machine learning frameworks, such as TensorFlow Lite, TensorFlow.js, PyTorch, and MediaPipe.

  • xtensor ๐Ÿ“ ๐ŸŒ -- C++ tensors with broadcasting and lazy computing. xtensor is a C++ library meant for numerical analysis with multi-dimensional array expressions.

  • xtensor-blas ๐Ÿ“ ๐ŸŒ -- an extension to the xtensor library, offering bindings to BLAS and LAPACK libraries through cxxblas and cxxlapack.

  • xtensor-io ๐Ÿ“ ๐ŸŒ -- a xtensor plugin to read and write images, audio files, NumPy (compressed) NPZ and HDF5 files.

  • xtl ๐Ÿ“ ๐ŸŒ -- xtensor core library

  • yara-pattern-matcher ๐Ÿ“ ๐ŸŒ -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

  • ZQCNN ๐Ÿ“ ๐ŸŒ -- ZQCNN is an inference framework that can run under windows, linux and arm-linux. At the same time, there are some demos related to face detection and recognition.

similarity search

  • aho_corasick ๐Ÿ“ ๐ŸŒ -- a header only implementation of the Aho-Corasick pattern search algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a very efficient dictionary matching algorithm that can locate all search patterns against in input text simultaneously in O(n + m), with space complexity O(m) (where n is the length of the input text, and m is the combined length of the search patterns).

  • annoy ๐Ÿ“ ๐ŸŒ -- ANNOY (Approximate Nearest Neighbors Oh Yeah) is a C++ library to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmap-ped into memory so that many processes may share the same data. ANNOY is almost as fast as the fastest libraries, but what really sets Annoy apart is: it has the ability to use static files as indexes, enabling you to share an index across processes. ANNOY also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. ANNOY tries to minimize its memory footprint: the indexes are quite small. This is useful when you want to find nearest neighbors using multiple CPU's. Spotify uses ANNOY for music recommendations.

  • brown-cluster ๐Ÿ“ ๐ŸŒ -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf

  • cppsimhash ๐Ÿ“ ๐ŸŒ -- C++ simhash implementation for documents and an additional (prototype) simhash index for text documents. Simhash is a hashing technique that belongs to the LSH (Local Sensitive Hashing) algorithmic family. It was initially developed by Moses S. Charikar in 2002 and is described in detail in his paper.

  • CTCWordBeamSearch ๐Ÿ“ ๐ŸŒ -- Connectionist Temporal Classification (CTC) decoder with dictionary and Language Model (LM).

  • DiskANN ๐Ÿ“ ๐ŸŒ -- DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.

  • DP_means ๐Ÿ“ ๐ŸŒ -- Dirichlet Process K-means is a bayesian non-parametric extension of the K-means algorithm based on small variance assymptotics (SVA) approximation of the Dirichlet Process Mixture Model. B. Kulis and M. Jordan, "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"

  • faiss ๐Ÿ“ ๐ŸŒ -- a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research.

  • falconn ๐Ÿ“ ๐ŸŒ -- FALCONN (FAst Lookups of Cosine and Other Nearest Neighbors) is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures. Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

  • flann ๐Ÿ“ ๐ŸŒ -- FLANN (Fast Library for Approximate Nearest Neighbors) is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms we found to work best for nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

  • flinng ๐Ÿ“ ๐ŸŒ -- Filters to Identify Near-Neighbor Groups (FLINNG) is a near neighbor search algorithm outlined in the paper Practical Near Neighbor Search via Group Testing.

  • FM-fast-match ๐Ÿ“ ๐ŸŒ -- FAsT-Match: a port of the Fast Affine Template Matching algorithm (Simon Korman, Daniel Reichman, Gilad Tsur, Shai Avidan, CVPR 2013, Portland)

  • fuzzy-match ๐Ÿ“ ๐ŸŒ -- FuzzyMatch-cli is a commandline utility allowing to compile FuzzyMatch indexes and use them to lookup fuzzy matches. Okapi BM25 prefiltering is available on branch bm25.

  • hnswlib ๐Ÿ“ ๐ŸŒ -- fast approximate nearest neighbor search. Header-only C++ HNSW implementation with python bindings.

  • ikd-Tree ๐Ÿ“ ๐ŸŒ -- an incremental k-d tree designed for robotic applications. The ikd-Tree incrementally updates a k-d tree with new coming points only, leading to much lower computation time than existing static k-d trees. Besides point-wise operations, the ikd-Tree supports several features such as box-wise operations and down-sampling that are practically useful in robotic applications.

  • imagehash ๐Ÿ“ ๐ŸŒ -- an image hashing library written in Python. ImageHash supports Average hashing, Perceptual hashing, Difference hashing, Wavelet hashing, HSV color hashing (colorhash) and Crop-resistant hashing. The image hash algorithms (average, perceptual, difference, wavelet) analyse the image structure on luminance (without color information). The color hash algorithm analyses the color distribution and black & gray fractions (without position information).

  • ivf-hnsw ๐Ÿ“ ๐ŸŒ -- Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. This is the code for the current state-of-the-art billion-scale nearest neighbor search system presented in the paper: Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors (Dmitry Baranchuk, Artem Babenko, Yury Malkov).

  • kgraph ๐Ÿ“ ๐ŸŒ -- a library for k-nearest neighbor (k-NN) graph construction and online k-NN search using a k-NN Graph as index. KGraph implements heuristic algorithms that are extremely generic and fast. KGraph works on abstract objects. The only assumption it makes is that a similarity score can be computed on any pair of objects, with a user-provided function.

  • K-Medoids-Clustering ๐Ÿ“ ๐ŸŒ -- K-medoids is a clustering algorithm related to K-means. In contrast to the K-means algorithm, K-medoids chooses datapoints as centers of the clusters. There are eight combinations of Initialization, Assignment and Update algorithms to achieve the best results in the given dataset. Also Clara algorithm approach is implemented.

  • libahocorasick ๐Ÿ“ ๐ŸŒ -- a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings "index" can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

  • libharry ๐Ÿ“ ๐ŸŒ -- Harry - A Tool for Measuring String Similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance, the Jaro-Winkler distance or the spectrum kernel.

  • libkdtree ๐Ÿ“ ๐ŸŒ -- libkdtree++ is a C++ template container implementation of k-dimensional space sorting, using a kd-tree.

  • libngt-ann ๐Ÿ“ ๐ŸŒ -- Yahoo's Neighborhood Graph and Tree for Indexing High-dimensional Data. NGT provides commands and a library for performing high-speed approximate nearest neighbor searches against a large volume of data (several million to several 10 million items of data) in high dimensional vector data space (several ten to several thousand dimensions).

  • libsptag ๐Ÿ“ ๐ŸŒ -- a library for fast approximate nearest neighbor search. SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by Microsoft Research (MSR) and Microsoft Bing.

  • LMW-tree ๐Ÿ“ ๐ŸŒ -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

  • lshbox ๐Ÿ“ ๐ŸŒ -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • mrpt ๐Ÿ“ ๐ŸŒ -- MRPT is a lightweight and easy-to-use library for approximate nearest neighbor search with random projection. The index building has an integrated hyperparameter tuning algorithm, so the only hyperparameter required to construct the index is the target recall level! According to our experiments MRPT is one of the fastest libraries for approximate nearest neighbor search.

    In the offline phase of the algorithm MRPT indexes the data with a collection of random projection trees. In the online phase the index structure allows us to answer queries in superior time. A detailed description of the algorithm with the time and space complexities, and the aforementioned comparisons can be found in our article that was published in IEEE International Conference on Big Data 2016.

    The algorithm for automatic hyperparameter tuning is described in detail in our new article that will be presented in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2019 (arxiv preprint).

  • n2-kNN ๐Ÿ“ ๐ŸŒ -- N2: Lightweight approximate N\ earest N\ eighbor algorithm library. N2 stands for two N's, which comes from 'Approximate N\ earest N\ eighbor Algorithm'. Before N2, there has been other great approximate nearest neighbor libraries such as Annoy and NMSLIB. However, each of them had different strengths and weaknesses regarding usability, performance, etc. N2 has been developed aiming to bring the strengths of existing aKNN libraries and supplement their weaknesses.

  • nanoflann ๐Ÿ“ ๐ŸŒ -- a C++11 header-only library for building KD-Trees of datasets with different topologies: R^2, R^3 (point clouds), SO(2) and SO(3) (2D and 3D rotation groups). No support for approximate NN is provided. This library is a fork of the flann library by Marius Muja and David G. Lowe, and born as a child project of MRPT.

  • nanoflann_dbscan ๐Ÿ“ ๐ŸŒ -- a fast C++ implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

  • nmslib ๐Ÿ“ ๐ŸŒ -- Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service. The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

  • online-hnsw ๐Ÿ“ ๐ŸŒ -- Online HNSW: an implementation of the HNSW index for approximate nearest neighbors search for C++14, that supports incremental insertion and removal of elements.

  • pagerank ๐Ÿ“ ๐ŸŒ -- a pagerank implementation in C++ able to handle very big graphs.

  • pHash ๐Ÿ“ ๐ŸŒ -- the open source perceptual hash library. Potential applications include copyright protection, similarity search for media files, or even digital forensics. For example, YouTube could maintain a database of hashes that have been submitted by the major movie producers of movies to which they hold the copyright. If a user then uploads the same video to YouTube, the hash will be almost identical, and it can be flagged as a possible copyright violation. The audio hash could be used to automatically tag MP3 files with proper ID3 information, while the text hash could be used for plagiarism detection.

  • phash-gpl ๐Ÿ“ ๐ŸŒ -- pHashโ„ข Perceptual Hashing Library is a collection of perceptual hashing algorithms for image, audo, video and text media.

  • pico_tree ๐Ÿ“ ๐ŸŒ -- a C++ header only library for fast nearest neighbor searches and range searches using a KdTree.

  • probminhash ๐Ÿ“ ๐ŸŒ -- a class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

  • pyglass ๐Ÿ“ ๐ŸŒ -- a library for fast inference of graph index for approximate similarity search.

    • It's high performant.
    • No third-party library dependencies, does not rely on OpenBLAS / MKL or any other computing framework.
    • Sophisticated memory management and data structure design, very low memory footprint.
    • Supports multiple graph algorithms, like HNSW and NSG.
    • Supports multiple hardware platforms, like X86 and ARM. Support for GPU is on the way
  • sdhash ๐Ÿ“ ๐ŸŒ -- a tool which allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during triage and initial investigation phases.

  • Shifted-Hamming-Distance ๐Ÿ“ ๐ŸŒ -- Shifted Hamming Distance (SHD) is an edit-distance based filter that can quickly check whether the minimum number of edits (including insertions, deletions and substitutions) between two strings is smaller than a user defined threshold T (the number of allowed edits between the two strings). Testing if two stings differs by a small amount is a prevalent function that is used in many applications. One of its biggest usage, perhaps, is in DNA or protein mapping, where a short DNA or protein string is compared against an enormous database, in order to find similar matches. In such applications, a query string is usually compared against multiple candidate strings in the database. Only candidates that are similar to the query are considered matches and recorded. SHD expands the basic Hamming distance computation, which only detects substitutions, into a full-fledged edit-distance filter, which counts not only substitutions but insertions and deletions as well.

  • simhash-cpp ๐Ÿ“ ๐ŸŒ -- Simhash Near-Duplicate Detection enables the identification of all fingerprints that are nearly identical to a query fingerprint. In this context, a fingerprint is an unsigned 64-bit integer. It also comes with an auxillary function designed to generate a fingerprint given a char* and a length. This fingeprint is generated with a tokenizer and a hash function (both of which may be provided as template parameters). Using a cyclic hash function, it then performs simhash on a moving window of tokens (as defined by the tokenizer).

  • spherical-k-means ๐Ÿ“ ๐ŸŒ -- the spherical K-means algorithm in Matlab and C++. The C++ version emphasizes a multithreaded implementation and features three ways of running the algorithm. It can be executed with a single-thread (same as the Matlab implementation), or using OpenMP or Galois (http://iss.ices.utexas.edu/?p=projects/galois). The purpose of this code is to optimize and compare the different parallel paradigms to maximize the efficiency of the algorithm.

  • ssdeep ๐Ÿ“ ๐ŸŒ -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM ๐Ÿ“ ๐ŸŒ -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 ๐Ÿ“ ๐ŸŒ -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • tiny-dnn ๐Ÿ“ ๐ŸŒ -- a C++14 implementation of deep learning. It is suitable for deep learning on limited computational resource, embedded systems and IoT devices.

  • tlsh ๐Ÿ“ ๐ŸŒ -- TLSH - Trend Micro Locality Sensitive Hash - is a fuzzy matching library. Given a byte stream with a minimum length of 50 bytes TLSH generates a hash value which can be used for similarity comparisons. Similar objects will have similar hash values which allows for the detection of similar objects by comparing their hash values. Note that the byte stream should have a sufficient amount of complexity. For example, a byte stream of identical bytes will not generate a hash value.

  • usearch ๐Ÿ“ ๐ŸŒ -- smaller & faster Single-File Similarity Search Engine for vectors & texts.

  • VQMT ๐Ÿ“ ๐ŸŒ -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • xgboost ๐Ÿ“ ๐ŸŒ -- an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.

text tokenization (as a preprocessing step for LDA et al):

i.e. breaking text into words when you receive a textstream without spaces. Also useful for Asian languages, which don't do spaces, e.g. Chinese.

  • Bi-Sent2Vec ๐Ÿ“ ๐ŸŒ -- provides cross-lingual numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task with applications geared towards cross-lingual word translation, cross-lingual sentence retrieval as well as cross-lingual downstream NLP tasks. The library is a cross-lingual extension of Sent2Vec. Bi-Sent2Vec vectors are also well suited to monolingual tasks as indicated by a marked improvement in the monolingual quality of the word embeddings. (For more details, see paper)

  • BlingFire ๐Ÿ“ ๐ŸŒ -- we are a team at Microsoft called Bling (Beyond Language Understanding), sharing our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

    Fire can also be used to improve FastText: see here.

    Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization.

  • chewing_text_cud ๐Ÿ“ ๐ŸŒ -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • cppjieba ๐Ÿ“ ๐ŸŒ -- the C++ version of the Chinese "Jieba" project:

    • Supports loading a custom user dictionary, using the '|' separator when multipathing or the ';' separator for separate, multiple, dictionaries.
    • Supports 'utf8' encoding.
    • The project comes with a relatively complete unit test, and the stability of the core function Chinese word segmentation (utf8) has been tested by the online environment.
  • fastBPE ๐Ÿ“ ๐ŸŒ -- text tokenization / ngrams

  • fastText ๐Ÿ“ ๐ŸŒ -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • fribidi ๐Ÿ“ ๐ŸŒ -- GNU FriBidi: the Free Implementation of the [Unicode Bidirectional Algorithm]. One of the missing links stopping the penetration of free software in Middle East is the lack of support for the Arabic and Hebrew alphabets. In order to have proper Arabic and Hebrew support, the bidi algorithm needs to be implemented. It is our hope that this library will stimulate more free software in the Middle Eastern countries.

  • friso ๐Ÿ“ ๐ŸŒ -- high performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm.

  • fxt ๐Ÿ“ ๐ŸŒ -- a large scale feature extraction tool for text-based machine learning.

  • koan ๐Ÿ“ ๐ŸŒ -- a word2vec negative sampling implementation with correct CBOW update. kลan only depends on Eigen.

    Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.

  • libchewing ๐Ÿ“ ๐ŸŒ -- The Chewing (้…ท้Ÿณ) is an intelligent phonetic input method (Zhuyin/Bopomofo) and is one of the most popular choices for Traditional Chinese users. Chewing was inspired by other proprietary intelligent Zhuyin input methods on Microsoft Windows, namely Wang-Xin by Eten, Microsoft New Zhuyin, and Nature Zhuyin (aka Going).

  • libchopshop ๐Ÿ“ ๐ŸŒ -- NLP/text processing with automated stop word detection and stemmer-based filtering. This library / toolkit is engineered to be able to provide both of the (often more or less disparate) n-gram token streams / vectors required for (1) initializing / training FTS databases, neural nets, etc. and (2) executing effective queries / matches on these engines.

  • libcppjieba ๐Ÿ“ ๐ŸŒ -- source code extracted from the [CppJieba] project to form a separate project, making it easier to understand and use.

  • libdtm ๐Ÿ“ ๐ŸŒ -- LibDTM (Dynamic Topic Models and the Document Influence Model) implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change. This code is the result of work by David M. Blei and Sean M. Gerrish.

  • libpostal ๐Ÿ“ ๐ŸŒ -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat ๐Ÿ“ ๐ŸŒ -- text language detection

  • many-stop-words ๐Ÿ“ ๐ŸŒ -- Many Stop Words is a simple Python package that provides a single function for loading sets of stop words for different languages.

  • mecab ๐Ÿ“ ๐ŸŒ -- MeCab (Yet Another Part-of-Speech and Morphological Analyzer) is a high-performance morphological analysis engine, designed to be independent of languages, dictionaries, and corpora, using Conditional Random Fields ((CRF)[http://www.cis.upenn.edu/~pereira/papers/crf.pdf]) to estimate the parameters.

  • ngrams-weighted ๐Ÿ“ ๐ŸŒ -- implements the method to compute N-gram IDF weights for all valid word N-grams in the given corpus (document set).

  • sally ๐Ÿ“ ๐ŸŒ -- a Tool for Embedding Strings in Vector Spaces. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can be applied to several types of strings, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files of string data. Sally implements a standard technique for mapping strings to a vector space that can be referred to as generalized bag-of-words model. The strings are characterized by a set of features, where each feature is associated with one dimension of the vector space. The following types of features are supported by Sally: bytes, tokens (words), n-grams of bytes and n-grams of tokens.

  • scws-chinese-word-segmentation ๐Ÿ“ ๐ŸŒ -- SCWS (Simple Chinese Word Segmentation) (i.e.: Simple Chinese Word Segmentation System). This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly segment a whole paragraph of Chinese text into words. A word is the smallest morpheme unit in Chinese, but when writing, it does not separate words with spaces like English, so how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation. SCWS supports Chinese encoding includes GBK, UTF-8, etc.

    There are not many innovative elements in the word segmentation algorithm. It uses a word frequency dictionary collected by itself, supplemented by certain proper names, names of people, and place names. Basic word segmentation is achieved by identifying rules such as digital age. After small-scale testing, the accuracy is between 90% and 95%, which can basically satisfy some use in small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.

  • sent2vec ๐Ÿ“ ๐ŸŒ -- a tool and pre-trained models related to the Bi-Sent2vec. The cross-lingual extension of Sent2Vec can be found here. This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.

  • sentencepiece ๐Ÿ“ ๐ŸŒ -- text tokenization

  • sentence-tokenizer ๐Ÿ“ ๐ŸŒ -- text tokenization

  • SheenBidi ๐Ÿ“ ๐ŸŒ -- implements Unicode Bidirectional Algorithm available at http://www.unicode.org/reports/tr9. It is a sophisticated implementation which provides the developers an easy way to use UBA in their applications.

  • stopwords ๐Ÿ“ ๐ŸŒ -- default English stop words from different sources.

  • ucto ๐Ÿ“ ๐ŸŒ -- text tokenization

    • libfolia ๐Ÿ“ ๐ŸŒ -- working with the Format for Linguistic Annotation (FoLiA). Provides a high-level API to read, manipulate, and create FoLiA documents.
    • uctodata ๐Ÿ“ ๐ŸŒ -- data for ucto library
  • word2vec ๐Ÿ“ ๐ŸŒ -- Word2Vec in C++ 11.

  • word2vec-GloVe ๐Ÿ“ ๐ŸŒ -- an implementation of the GloVe (Global Vectors for Word Representation) model for learning word representations.

  • worde_butcher ๐Ÿ“ ๐ŸŒ -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • wordfreq ๐Ÿ“ ๐ŸŒ -- wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.

  • wordfrequency ๐Ÿ“ ๐ŸŒ -- FrequencyWords: Frequency Word List Generator and processed files.

  • you-token-to-me ๐Ÿ“ ๐ŸŒ -- text tokenization

regex matchers (manual edit - pattern recognition)

  • hyperscan ๐Ÿ“ ๐ŸŒ -- Hyperscan is a high-performance multiple regex matching library.

  • libgnurx ๐Ÿ“ ๐ŸŒ -- the POSIX regex functionality from glibc extracted into a separate library, for Win32.

  • libwildmatch ๐Ÿ“ ๐ŸŒ -- wildmatch is a BSD-licensed C/C++ library for git/rsync-style pattern matching.

  • pcre ๐Ÿ“ ๐ŸŒ -- PCRE2 : Perl-Compatible Regular Expressions. The PCRE2 library is a set of C functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE2 has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. It comes in three forms, for processing 8-bit, 16-bit, or 32-bit code units, in either literal or UTF encoding.

  • pdfgrep ๐Ÿ“ ๐ŸŒ -- a tool to search text in PDF files. It works similarly to grep.

  • ragel ๐Ÿ“ ๐ŸŒ -- State Machine Compiler

  • re2 ๐Ÿ“ ๐ŸŒ -- RE2, a regular expression library.

  • re2c ๐Ÿ“ ๐ŸŒ -- a lexer generator for C/C++, Go and Rust. Its main goal is generating fast lexers: at least as fast as their reasonably optimized hand-coded counterparts. Instead of using traditional table-driven approach, re2c encodes the generated finite state automata directly in the form of conditional jumps and comparisons. The resulting programs are faster and often smaller than their table-driven analogues, and they are much easier to debug and understand. re2c applies quite a few optimizations in order to speed up and compress the generated code. Another distinctive feature is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment.

  • RE-flex ๐Ÿ“ ๐ŸŒ -- the regex-centric, fast lexical analyzer generator for C++ with full Unicode support. Faster than Flex. Accepts Flex specifications. Generates reusable source code that is easy to understand. Introduces indent/dedent anchors, lazy quantifiers, functions for lex/syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

    The RE/flex matcher tracks line numbers, column numbers, and indentations, whereas Flex does not (option noyylineno) and neither do the other regex matchers (except PCRE2 and Boost.Regex when used with RE/flex). Tracking this information incurs some overhead. RE/flex also automatically decodes UTF-8/16/32 input and accepts std::istream, strings, and wide strings as input.

    RE/flex runs equally fast or slightly faster than the best times of Flex.

  • tre ๐Ÿ“ ๐ŸŒ -- TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching. The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression.

  • ugrep ๐Ÿ“ ๐ŸŒ -- search for anything in everything... ultra fast. "grep for arbitrary binary files."

  • yara-pattern-matcher ๐Ÿ“ ๐ŸŒ -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

OCR: quality improvements, language detect, ...

  • Awesome-Document-Image-Rectification ๐Ÿ“ ๐ŸŒ -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • Awesome-Image-Quality-Assessment ๐Ÿ“ ๐ŸŒ -- a comprehensive collection of IQA papers, datasets and codes. We also provide PyTorch implementations of mainstream metrics in IQA-PyTorch

  • Capture2Text ๐Ÿ“ ๐ŸŒ -- Linux CLI port of Capture2Text v4.5.1 (Ubuntu) - the OCR results from Capture2Text were generally better than standard Tesseract, so it seemed ideal to make this run on Linux.

  • chewing_text_cud ๐Ÿ“ ๐ŸŒ -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • EasyOCR ๐Ÿ“ ๐ŸŒ -- ready-to-use OCR with 80+ supported languages and all popular writing scripts including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.

  • EasyOCR-cpp ๐Ÿ“ ๐ŸŒ -- custom C++ implementation of EasyOCR. This C++ project implements the pre/post processing to run a OCR pipeline consisting of a text detector CRAFT, and a CRNN based text recognizer. Unlike the EasyOCR python which is API based, this repo provides a set of classes to show how you can integrate OCR in any C++ program for maximum flexibility.

  • fastText ๐Ÿ“ ๐ŸŒ -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • fribidi ๐Ÿ“ ๐ŸŒ -- GNU FriBidi: the Free Implementation of the [Unicode Bidirectional Algorithm]. One of the missing links stopping the penetration of free software in Middle East is the lack of support for the Arabic and Hebrew alphabets. In order to have proper Arabic and Hebrew support, the bidi algorithm needs to be implemented. It is our hope that this library will stimulate more free software in the Middle Eastern countries.

  • hunspell ๐Ÿ“ ๐ŸŒ -- a free spell checker and morphological analyzer library and command-line tool, designed for quick and high quality spell checking and correcting for languages with word-level writing system, including languages with rich morphology, complex word compounding and character encoding.

  • hunspell-dictionaries ๐Ÿ“ ๐ŸŒ -- Collection of normalized and installable [hunspell][] dictionaries.

  • hunspell-hyphen ๐Ÿ“ ๐ŸŒ -- hyphenation library to use converted TeX hyphenation patterns with hunspell.

  • IMGUR5K-Handwriting-Dataset ๐Ÿ“ ๐ŸŒ -- the IMGUR5K Handwriting Dataset for OCR/image preprocessing benchmarks.

  • InversePerspectiveMapping ๐Ÿ“ ๐ŸŒ -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • ipa-dict ๐Ÿ“ ๐ŸŒ -- Monolingual wordlists with pronunciation information in IPA aims to provide a series of dictionaries consisting of wordlists with accompanying phonemic pronunciation information in International Phonetic Alphabet (IPA) transcription for as many words as possible in as many languages / dialects / variants as possible. The dictionary data is available in a number of human- and machine-readable formats, in order to make it as useful as possible for various other applications.

  • JamSpell ๐Ÿ“ ๐ŸŒ -- a spell checking library, which considers words surroundings (context) for better correction (accuracy) and is fast (near 5K words per second)

  • libpinyin ๐Ÿ“ ๐ŸŒ -- the libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

  • libpostal ๐Ÿ“ ๐ŸŒ -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat ๐Ÿ“ ๐ŸŒ -- text language detection

  • LSWMS ๐Ÿ“ ๐ŸŒ -- LSWMS (Line Segment detection using Weighted Mean-Shift): line segment detection with OpenCV, originally published by Marcos Nieto Doncel.

  • marian ๐Ÿ“ ๐ŸŒ -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • nuspell ๐Ÿ“ ๐ŸŒ -- a fast and safe spelling checker software program. It is designed for languages with rich morphology and complex word compounding. Nuspell is written in modern C++ and it supports Hunspell dictionaries.

  • ocreval ๐Ÿ“ ๐ŸŒ -- ocreval contains 17 tools for measuring the performance of and experimenting with OCR output. ocreval is a modern port of the ISRI Analytic Tools for OCR Evaluation, with UTF-8 support and other improvements.

  • ocr-evaluation-tools ๐Ÿ“ ๐ŸŒ -- 19 tools for measuring the performance and quality of OCR output.

  • OTB ๐Ÿ“ ๐ŸŒ -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • pinyin ๐Ÿ“ ๐ŸŒ -- pฤซnyฤซn is a tool for converting Chinese characters to pinyin. It can be used for Chinese phonetic notation, sorting, and retrieval.

  • retinex ๐Ÿ“ ๐ŸŒ -- the Retinex algorithm for intrinsic image decomposition. The provided code computes image gradients, and assembles a sparse linear "Ax = b" system. The system is solved using Eigen.

  • scws-chinese-word-segmentation ๐Ÿ“ ๐ŸŒ -- SCWS (Simple Chinese Word Segmentation) (i.e.: Simple Chinese Word Segmentation System). This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly segment a whole paragraph of Chinese text into words. A word is the smallest morpheme unit in Chinese, but when writing, it does not separate words with spaces like English, so how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation. SCWS supports Chinese encoding includes GBK, UTF-8, etc.

    There are not many innovative elements in the word segmentation algorithm. It uses a word frequency dictionary collected by itself, supplemented by certain proper names, names of people, and place names. Basic word segmentation is achieved by identifying rules such as digital age. After small-scale testing, the accuracy is between 90% and 95%, which can basically satisfy some use in small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.

  • SheenBidi ๐Ÿ“ ๐ŸŒ -- implements Unicode Bidirectional Algorithm available at http://www.unicode.org/reports/tr9. It is a sophisticated implementation which provides the developers an easy way to use UBA in their applications.

  • SymSpell ๐Ÿ“ ๐ŸŒ -- spelling correction & fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm. The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

  • SymspellCPP ๐Ÿ“ ๐ŸŒ -- a C++ port from https://github.com/wolfgarbe/SymSpell v6.5

  • tesslinesplit ๐Ÿ“ ๐ŸŒ -- a standalone program for using Tesseract's line segmentation algorithm to split up document images.

  • unpaper ๐Ÿ“ ๐ŸŒ -- a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. The program also tries to detect misaligned centering and rotation of ages and will automatically straighten each page by rotating it to the correct angle (a.k.a. deskewing).

OCR page image preprocessing, [scanner] tooling: getting the pages to the OCR engine

  • Awesome-Document-Image-Rectification ๐Ÿ“ ๐ŸŒ -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • Awesome-Image-Quality-Assessment ๐Ÿ“ ๐ŸŒ -- a comprehensive collection of IQA papers, datasets and codes. We also provide PyTorch implementations of mainstream metrics in IQA-PyTorch

  • butteraugli ๐Ÿ“ ๐ŸŒ -- a tool for measuring perceived differences between images. Butteraugli is a project that estimates the psychovisual similarity of two images. It gives a score for the images that is reliable in the domain of barely noticeable differences. Butteraugli not only gives a scalar score, but also computes a spatial map of the level of differences. One of the main motivations for this project is the statistical differences in location and density of different color receptors, particularly the low density of blue cones in the fovea. Another motivation comes from more accurate modeling of ganglion cells, particularly the frequency space inhibition.

  • Capture2Text ๐Ÿ“ ๐ŸŒ -- Linux CLI port of Capture2Text v4.5.1 (Ubuntu) - the OCR results from Capture2Text were generally better than standard Tesseract, so it seemed ideal to make this run on Linux.

  • ccv-nnc ๐Ÿ“ ๐ŸŒ -- C-based/Cached/Core Computer Vision Library. A Modern Computer Vision Library.

  • CImg ๐Ÿ“ ๐ŸŒ -- a small C++ toolkit for image processing.

  • colorm ๐Ÿ“ ๐ŸŒ -- ColorM is a C++11 header-only color conversion and manipulation library for CSS colors with an API similar to chroma.js's API.

  • ColorSpace ๐Ÿ“ ๐ŸŒ -- library for converting between color spaces and comparing colors.

  • color-util ๐Ÿ“ ๐ŸŒ -- a header-only C++11 library for handling colors, including color space converters between RGB, XYZ, Lab, etc. and color difference calculators such as CIEDE2000.

  • DocLayNet ๐Ÿ“ ๐ŸŒ -- DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank, e.g. Human Annotation: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout.

  • doxa ๐Ÿ“ ๐ŸŒ -- ฮ”oxa Binarization Framework (ฮ”BF) is an image binarization framework which focuses primarily on local adaptive thresholding algorithms, aimed at providing the building blocks one might use to advance the state of handwritten manuscript binarization.

    Supported Algorithms:

    • Otsu - "A threshold selection method from gray-level histograms", 1979.
    • Bernsen - "Dynamic thresholding of gray-level images", 1986.
    • Niblack - "An Introduction to Digital Image Processing", 1986.
    • Sauvola - "Adaptive document image binarization", 1999.
    • Wolf - "Extraction and Recognition of Artificial Text in Multimedia Documents", 2003.
    • Gatos - "Adaptive degraded document image binarization", 2005. (Partial)
    • NICK - "Comparison of Niblack inspired Binarization methods for ancient documents", 2009.
    • Su - "Binarization of Historical Document Images Using the Local Maximum and Minimum", 2010.
    • T.R. Singh - "A New local Adaptive Thresholding Technique in Binarization", 2011.
    • Bataineh - "An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows", 2011. (unreproducible)
    • ISauvola - "ISauvola: Improved Sauvolaโ€™s Algorithm for Document Image Binarization", 2016.
    • WAN - "Binarization of Document Image Using Optimum Threshold Modification", 2018.

    Optimizations:

    • Shafait - "Efficient Implementation of Local Adaptive Thresholding Techniques Using Integral Images", 2008.
    • Petty - An algorithm for efficiently calculating the min and max of a local window. Unpublished, 2019.
    • Chan - "Memory-efficient and fast implementation of local adaptive binarization methods", 2019.

    Performance Metrics:

    • Overall Accuracy
    • F-Measure
    • Peak Signal-To-Noise Ratio (PSNR)
    • Negative Rate Metric (NRM)
    • Matthews Correlation Coefficient (MCC)
    • Distance-Reciprocal Distortion Measure (DRDM) - "An Objective Distortion Measure for Binary Document Images Based on Human Visual Perception", 2002.

    Native Image Support:

    • Portable Any-Map: PBM (P4), 8-bit PGM (P5), PPM (P6), PAM (P7)
  • EasyOCR ๐Ÿ“ ๐ŸŒ -- ready-to-use OCR with 80+ supported languages and all popular writing scripts including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.

  • EasyOCR-cpp ๐Ÿ“ ๐ŸŒ -- custom C++ implementation of EasyOCR. This C++ project implements the pre/post processing to run a OCR pipeline consisting of a text detector CRAFT, and a CRNN based text recognizer. Unlike the EasyOCR python which is API based, this repo provides a set of classes to show how you can integrate OCR in any C++ program for maximum flexibility.

  • farver-OKlab ๐Ÿ“ ๐ŸŒ -- provides very fast, vectorised functions for conversion of colours between different colour spaces, colour comparisons (distance between colours), encoding/decoding, and channel manipulation in colour strings.

  • fCWT ๐Ÿ“ ๐ŸŒ -- the fast Continuous Wavelet Transform (fCWT) is a highly optimized C++ library for very fast calculation of the CWT in C++, Matlab, and Python. fCWT has been featured on the January 2022 cover of NATURE Computational Science. In this article, fCWT is compared against eight competitor algorithms, tested on noise resistance and validated on synthetic electroencephalography and in vivo extracellular local field potential data.

  • FFmpeg ๐Ÿ“ ๐ŸŒ -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • gegl ๐Ÿ“ ๐ŸŒ -- GEGL (Generic Graphics Library) is a data flow based image processing framework, providing floating point processing and non-destructive image processing capabilities to GNU Image Manipulation Program and other projects. With GEGL you chain together processing operations to represent the desired image processing pipeline. GEGL provides operations for image loading and storing, color adjustments, GIMPs artistic filters and more forms of image processing GEGL can be used on the command-line with the same syntax that can be used for creating processing flows interactively with text from GIMP using gegl-graph.

  • gmic ๐Ÿ“ ๐ŸŒ -- a Full-Featured Open-Source Framework for Image Processing. It provides several different user interfaces to convert/manipulate/filter/visualize generic image datasets, ranging from 1d scalar signals to 3d+t sequences of multi-spectral volumetric images, hence including 2d color images.

  • gmic-community ๐Ÿ“ ๐ŸŒ -- community contributions for the GMIC Full-Featured Open-Source Framework for Image Processing. It provides several different user interfaces to convert/manipulate/filter/visualize generic image datasets, ranging from 1d scalar signals to 3d+t sequences of multi-spectral volumetric images, hence including 2d color images.

  • graph-coloring ๐Ÿ“ ๐ŸŒ -- a C++ Graph Coloring Package. This project has two primary uses:

    • As an executable for finding the chromatic number for an input graph (in edge list or edge matrix format)
    • As a library for finding the particular coloring of an input graph (represented as a map<string,vector<string>> edge list)
  • GraphicsMagick ๐Ÿ“ ๐ŸŒ -- provides a comprehensive collection of utilities, programming interfaces, and GUIs, to support file format conversion, image processing, and 2D vector rendering. GraphicsMagick is originally based on ImageMagick from ImageMagick Studio (which was originally written by John Cristy at Dupont). The goal of GraphicsMagick is to provide the highest quality product possible while encouraging open and active participation from all interested developers.

  • gtsam ๐Ÿ“ ๐ŸŒ -- Georgia Tech Smoothing and Mapping Library (GTSAM) is a C++ library that implements smoothing and mapping (SAM) in robotics and vision, using Factor Graphs and Bayes Networks as the underlying computing paradigm rather than sparse matrices.

  • guetzli ๐Ÿ“ ๐ŸŒ -- a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster decompression speeds they offer.

  • hsluv-c ๐Ÿ“ ๐ŸŒ -- HSLuv (revision 4) is a human-friendly alternative to HSL. HSLuv is very similar to CIELUV, a color space designed for perceptual uniformity based on human experiments. When accessed by polar coordinates, it becomes functionally similar to HSL with a single problem: its chroma component doesn't fit into a specific range. HSLuv extends CIELUV with a new saturation component that allows you to span all the available chroma as a neat percentage.

  • ImageMagick ๐Ÿ“ ๐ŸŒ -- ImageMagickยฎ can create, edit, compose, or convert digital images. It can read and write images in a variety of formats (over 200) including PNG, JPEG, GIF, WebP, HEIC, SVG, PDF, DPX, EXR, and TIFF. ImageMagick can resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses, and Bรฉzier curves.

  • Image-Smoothing-Algorithm-Based-on-Gradient-Analysis ๐Ÿ“ ๐ŸŒ -- the implementation of an image smoothing algorithm that was proposed in this publication. Our algorithm uses filtering and to achieve edge-preserving smoothing it uses two components of gradient vectors: their magnitudes (or lengths) and directions. Our method discriminates between two types of boundaries in given neighborhood: regular and irregular ones.

  • IMGUR5K-Handwriting-Dataset ๐Ÿ“ ๐ŸŒ -- the IMGUR5K Handwriting Dataset for OCR/image preprocessing benchmarks.

  • InversePerspectiveMapping ๐Ÿ“ ๐ŸŒ -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • ITK ๐Ÿ“ ๐ŸŒ -- The Insight Toolkit (ITK) is an open-source, cross-platform toolkit for N-dimensional scientific image processing, segmentation, and registration. Segmentation is the process of identifying and classifying data found in a digitally sampled representation. Typically the sampled representation is an image acquired from such medical instrumentation as CT or MRI scanners. Registration is the task of aligning or developing correspondences between data. For example, in the medical environment, a CT scan may be aligned with a MRI scan in order to combine the information contained in both.

  • jasper ๐Ÿ“ ๐ŸŒ -- JasPer Image Processing/Coding Tool Kit

  • jpeg2dct ๐Ÿ“ ๐ŸŒ -- Faster Neural Networks Straight from JPEG: jpeg2dct subroutines -- this module is useful for reproducing results presented in the paper Faster Neural Networks Straight from JPEG (ICLR workshop 2018).

  • lcms2 ๐Ÿ“ ๐ŸŒ -- lcms2mt is a thread-safe fork of lcms (a.k.a. Little CMS). Little CMS intends to be a small-footprint color management engine, with special focus on accuracy and performance. It uses the International Color Consortium standard (ICC), which is the modern standard when regarding to color management. The ICC specification is widely used and is referred to in many International and other de-facto standards. It was approved as an International Standard, ISO 15076-1, in 2005. Little CMS is a full implementation of ICC specification 4.3, it fully supports all kind of V2 and V4 profiles, including abstract, devicelink and named color profiles.

  • leptonica ๐Ÿ“ ๐ŸŒ -- supports many operations that are useful on images.

    Features:

    • Rasterop (aka bitblt)
    • Affine transforms (scaling, translation, rotation, shear) on images of arbitrary pixel depth
    • Projective and bilinear transforms
    • Binary and grayscale morphology, rank order filters, and convolution
    • Seedfill and connected components
    • Image transformations with changes in pixel depth, both at the same scale and with scale change
    • Pixelwise masking, blending, enhancement, arithmetic ops, etc.

    Documentation:

    • LeptonicaDocsSite ๐Ÿ“ ๐ŸŒ -- unofficial Reference Documentation for the Leptonica image processing library (www.leptonica.org).
    • UnofficialLeptDocs ๐Ÿ“ ๐ŸŒ -- unofficial Sphinx-generated documentation for the Leptonica image processing library.
  • libchiaroscuramente ๐Ÿ“ ๐ŸŒ -- a collection of C/C++ functions (components) to help improving / enhancing your images for various purposes (e.g. helping an OCR engine detect and recognize the text in the page scan image)

  • libdip ๐Ÿ“ ๐ŸŒ -- DIPlib is a C++ library for quantitative image analysis.

  • libimagequant ๐Ÿ“ ๐ŸŒ -- Palette quantization library that powers pngquant and other PNG optimizers. libimagequant converts RGBA images to palette-based 8-bit indexed images, including alpha component. It's ideal for generating tiny PNG images and nice-looking GIFs. Image encoding/decoding isn't handled by the library itself, bring your own encoder.

  • libinsane ๐Ÿ“ ๐ŸŒ -- the library to access scanners on both Linux and Windows.

  • libjpegqs ๐Ÿ“ ๐ŸŒ -- JPEG Quant Smooth tries to recreate lost precision of DCT coefficients based on quantization table from jpeg image. You may not notice jpeg artifacts on the screen without zooming in, but you may notice them after printing. Also, when editing compressed images, artifacts can accumulate, but if you use this program before editing - the result will be better.

  • libpano13 ๐Ÿ“ ๐ŸŒ -- the pano13 library, part of the Panorama Tools by Helmut Dersch of the University of Applied Sciences Furtwangen.

  • libpillowfight ๐Ÿ“ ๐ŸŒ -- simple C Library containing various image processing algorithms.

    Available algorithms:

    • ACE (Automatic Color Equalization; Parallelized implementation)

    • Canny edge detection

    • Compare: Compare two images (grayscale) and makes the pixels that are different really visible (red).

    • Gaussian blur

    • Scan borders: Tries to detect the borders of a page in an image coming from a scanner.

    • Sobel operator

    • SWT (Stroke Width Transformation)

    • Unpaper's algorithms

      • Blackfilter
      • Blurfilter
      • Border
      • Grayfilter
      • Masks
      • Noisefilter
  • libprecog ๐Ÿ“ ๐ŸŒ -- PRLib - Pre-Recognition Library. The main aim of the library is to prepare images for OCR (text recogntion). Image processing can really help to improve recognition quality.

  • libprecog-data ๐Ÿ“ ๐ŸŒ -- PRLib (a.k.a. libprecog) test data.

  • libprecog-manuals ๐Ÿ“ ๐ŸŒ -- PRLib (a.k.a. libprecog) related papers.

  • libraqm ๐Ÿ“ ๐ŸŒ -- a small library that encapsulates the logic for complex text layout and provides a convenient API.

  • libvips ๐Ÿ“ ๐ŸŒ -- a demand-driven, horizontally threaded image processing library which has around 300 operations covering arithmetic, histograms, convolution, morphological operations, frequency filtering, colour, resampling, statistics and others. It supports a large range of numeric types, from 8-bit int to 128-bit complex. Images can have any number of bands. It supports a good range of image formats, including JPEG, JPEG2000, JPEG-XL, TIFF, PNG, WebP, HEIC, AVIF, FITS, Matlab, OpenEXR, PDF, SVG, HDR, PPM / PGM / PFM, CSV, GIF, Analyze, NIfTI, DeepZoom, and OpenSlide. It can also load images via ImageMagick or GraphicsMagick, letting it work with formats like DICOM.

  • libxbr-standalone ๐Ÿ“ ๐ŸŒ -- this standalone XBR/hqx Library implements the xBR pixel art scaling filter developed by Hyllian, and now also the hqx filter developed by Maxim Stepin. Original source for the xBR implementation: http://git.videolan.org/gitweb.cgi/ffmpeg.git/?p=ffmpeg.git;a=blob;f=libavfilter/vf_xbr.c;h=5c14565b3a03f66f1e0296623dc91373aeac1ed0;hb=HEAD

  • local_adaptive_binarization ๐Ÿ“ ๐ŸŒ -- uses an improved contrast maximization version of Niblack/Sauvola et al's method to binarize document images. It is also able to perform the more classical Niblack as well as Sauvola et al. methods. Details can be found in the ICPR 2002 paper.

  • LSWMS ๐Ÿ“ ๐ŸŒ -- LSWMS (Line Segment detection using Weighted Mean-Shift): line segment detection with OpenCV, originally published by Marcos Nieto Doncel.

  • magsac ๐Ÿ“ ๐ŸŒ -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.

  • oidn-OpenImageDenoise ๐Ÿ“ ๐ŸŒ -- Intelยฎ Open Image Denoise is an open source library of high-performance, high-quality denoising filters for images rendered with ray tracing.

  • olena ๐Ÿ“ ๐ŸŒ -- a platform dedicated to image processing. At the moment it is mainly composed of a C++ library: Milena. This library features many tools to easily perform image processing tasks. Its main characteristic is its genericity: it allows to write an algorithm once and run it over many kinds of images (gray scale, color, 1D, 2D, 3D, ...).

  • OpenColorIO ๐Ÿ“ ๐ŸŒ -- OpenColorIO (OCIO) is a complete color management solution geared towards motion picture production with an emphasis on visual effects and computer animation. OCIO provides a straightforward and consistent user experience across all supporting applications while allowing for sophisticated back-end configuration options suitable for high-end production usage. OCIO is compatible with the Academy Color Encoding Specification (ACES) and is LUT-format agnostic, supporting many popular formats.

  • OpenCP ๐Ÿ“ ๐ŸŒ -- a library for computational photography.

  • opencv ๐Ÿ“ ๐ŸŒ -- OpenCV: Open Source Computer Vision Library

  • opencv_contrib ๐Ÿ“ ๐ŸŒ -- OpenCV's extra modules. This is where you'll find new, bleeding edge OpenCV module development.

  • opencv_extra ๐Ÿ“ ๐ŸŒ -- extra data for OpenCV: Open Source Computer Vision Library

  • OTB ๐Ÿ“ ๐ŸŒ -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • pdiff ๐Ÿ“ ๐ŸŒ -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • Pillow ๐Ÿ“ ๐ŸŒ -- the friendly PIL (Python Imaging Library) fork by Jeffrey A. Clark (Alex) and contributors. PIL is the Python Imaging Library by Fredrik Lundh and Contributors. This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.

  • pillow-resize ๐Ÿ“ ๐ŸŒ -- a C++ porting of the resize method from the Pillow python library. It is written in C++ using OpenCV for matrix support. The main difference with respect to the resize method of OpenCV is the use of an anti aliasing filter, that is missing in OpenCV and could introduce some artifacts, in particular with strong down-sampling.

  • pixman ๐Ÿ“ ๐ŸŒ -- a library that provides low-level pixel manipulation features such as image compositing and trapezoid rasterization.

  • poisson_blend ๐Ÿ“ ๐ŸŒ -- a simple, readable implementation of Poisson Blending, that demonstrates the concepts explained in my article, seamlessly blending a source image and a target image, at some specified pixel location.

  • pylene ๐Ÿ“ ๐ŸŒ -- Pylene is a fork of Olena/Milena, an image processing library targeting genericity and efficiency. It provided mostly Mathematical Morphology building blocs for image processing pipelines.

  • radon-tf ๐Ÿ“ ๐ŸŒ -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.

  • RandomizedRedundantDCTDenoising ๐Ÿ“ ๐ŸŒ -- demonstrates the paper S. Fujita, N. Fukushima, M. Kimura, and Y. Ishibashi, "Randomized redundant DCT: Efficient denoising by using random subsampling of DCT patches," Proc. Siggraph Asia, Technical Brief, Nov. 2015. In this paper, the DCT-based denoising is accelerated by using a randomized algorithm. The DCT is based on the fastest algorithm and is SIMD vectorized by using SSE. Some modifications improve denoising performance in term of PSNR. The code is 100x faster than the OpenCV's implementation (cv::xphoto::dctDenoising) for the paper. Optionally, we can use DHT (discrete Walshโ€“Hadamard transform) for fast computation instead of using DCT.

  • retinex ๐Ÿ“ ๐ŸŒ -- the Retinex algorithm for intrinsic image decomposition. The provided code computes image gradients, and assembles a sparse linear "Ax = b" system. The system is solved using Eigen.

  • rotate ๐Ÿ“ ๐ŸŒ -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.

  • rotate_detection ๐Ÿ“ ๐ŸŒ -- angle rotation detection on scanned documents. Designed for embedding in systems using tesseract OCR. The detection algorithm based on Rรฉnyi entropy.

  • scantailor ๐Ÿ“ ๐ŸŒ -- scantailor_advanced is the ScanTailor version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes. ScanTailor is an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, selecting content, ... and many others.

  • scilab ๐Ÿ“ ๐ŸŒ -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.

  • simd-imgproc ๐Ÿ“ ๐ŸŒ -- the Simd Library is an image processing and machine learning library designed for C and C++ programmers. It provides many useful high performance algorithms for image processing such as: pixel format conversion, image scaling and filtration, extraction of statistic information from images, motion detection, object detection (HAAR and LBP classifier cascades) and classification, neural network.

    The algorithms are optimized, using different SIMD CPU extensions where available. The library supports following CPU extensions: SSE, AVX, AVX-512 and AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC (big-endian), NEON for ARM.

  • SSIM ๐Ÿ“ ๐ŸŒ -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • tesslinesplit ๐Ÿ“ ๐ŸŒ -- a standalone program for using Tesseract's line segmentation algorithm to split up document images.

  • twain_library ๐Ÿ“ ๐ŸŒ -- the DTWAIN Library, Version 5.x, from Dynarithmic Software. DTWAIN is an open source programmer's library that will allow applications to acquire images from TWAIN-enabled devices using a simple Application Programmer's Interface (API).

  • unblending ๐Ÿ“ ๐ŸŒ -- a C++ library for decomposing a target image into a set of semi-transparent layers associated with advanced color-blend modes (e.g., "multiply" and "color-dodge"). Output layers can be imported to Adobe Photoshop, Adobe After Effects, GIMP, Krita, etc. and are useful for performing complex edits that are otherwise difficult.

  • unpaper ๐Ÿ“ ๐ŸŒ -- a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. The program also tries to detect misaligned centering and rotation of ages and will automatically straighten each page by rotating it to the correct angle (a.k.a. deskewing).

  • vivid ๐Ÿ“ ๐ŸŒ -- vivid ๐ŸŒˆ is a simple-to-use C++ color library.

  • VQMT ๐Ÿ“ ๐ŸŒ -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • wavelib ๐Ÿ“ ๐ŸŒ -- C implementation of Discrete Wavelet Transform (DWT,SWT and MODWT), Continuous Wavelet transform (CWT) and Discrete Packet Transform ( Full Tree Decomposition and Best Basis DWPT).

  • wdenoise ๐Ÿ“ ๐ŸŒ -- Wavelet Denoising in ANSI C using empirical bayes thresholding and a host of other thresholding methods.

  • xbrzscale ๐Ÿ“ ๐ŸŒ -- xBRZ upscaling commandline tool. This tool allows you to scale your graphics with xBRZ algorithm, see https://en.wikipedia.org/wiki/Pixel-art_scaling_algorithms#xBR_family

image export, image / [scanned] document import

  • brunsli ๐Ÿ“ ๐ŸŒ -- a lossless JPEG repacking library. Brunsli allows for a 22% decrease in file size while allowing the original JPEG to be recovered byte-by-byte.

  • CImg ๐Ÿ“ ๐ŸŒ -- a small C++ toolkit for image processing.

  • CxImage ๐Ÿ“ ๐ŸŒ -- venerated library for reading and creating many image file formats.

  • FFmpeg ๐Ÿ“ ๐ŸŒ -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • fpng ๐Ÿ“ ๐ŸŒ -- a very fast C++ .PNG image reader/writer for 24/32bpp images. fpng was written to see just how fast you can write .PNG's without sacrificing too much compression. The files written by fpng conform to the PNG standard, are readable using any PNG decoder, and load or validate successfully using libpng, wuffs, lodepng, stb_image, and pngcheck. PNG files written using fpng can also be read using fpng faster than other PNG libraries, due to its explicit use of Length-Limited Prefix Codes and an optimized decoder that exploits the properties of these codes.

  • giflib-turbo ๐Ÿ“ ๐ŸŒ -- GIFLIB-Turbo is a faster drop-in replacement for GIFLIB. The original GIF codecs were written for a much different world and took great pains to use as little memory as possible and to accommodate a slow and unreliable input stream of data. Those constraints are no longer a problem for the vast majority of users and they were hurting the performance. Another feature holding back the performance of the original GIFLIB was that the original codec was designed to work with image data a line at a time and used a separate LZW dictionary to manage the strings of repeating symbols. My codec uses the output image as the dictionary; this allows much faster 'unwinding' of the codes since they are all stored in the right direction to just be copied to the new location.

  • grok-jpeg2000 ๐Ÿ“ ๐ŸŒ -- World's Leading Open Source JPEG 2000 Codec

    Features:

    • support for new High Throughput JPEG 2000 (HTJ2K) standard
    • fast random-access sub-image decoding using TLM and PLT markers
    • full encode/decode support for ICC colour profiles
    • full encode/decode support for XML,IPTC, XMP and EXIF meta-data
    • full encode/decode support for monochrome, sRGB, palette, YCC, extended YCC, CIELab and CMYK colour spaces
    • full encode/decode support for JPEG,PNG,BMP,TIFF,RAW,PNM and PAM image formats
    • full encode/decode support for 1-16 bit precision images
  • guetzli ๐Ÿ“ ๐ŸŒ -- a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster decompression speeds they offer.

  • icer_compression ๐Ÿ“ ๐ŸŒ -- implements the NASA ICER image compression algorithm as a C library. Said compression algorithm is a progressive, wavelet-based image compression algorithm designed to be resistant to data loss, making it suitable for use as the image compression algorithm when encoding images to be transmitted over unreliable delivery channels, such as those in satellite radio communications.

  • jbig2dec ๐Ÿ“ ๐ŸŒ -- a decoder library and example utility implementing the JBIG2 bi-level image compression spec. Also known as ITU T.88 and ISO IEC 14492, and included by reference in Adobe's PDF version 1.4 and later.

  • jbig2enc ๐Ÿ“ ๐ŸŒ -- an encoder for JBIG2. JBIG2 encodes bi-level (1 bpp) images using a number of clever tricks to get better compression than G4. This encoder can:

    • Generate JBIG2 files, or fragments for embedding in PDFs
    • Generic region encoding
    • Perform symbol extraction, classification and text region coding
    • Perform refinement coding and,
    • Compress multipage documents

    It uses the Leptonica library.

  • jpeginfo ๐Ÿ“ ๐ŸŒ -- prints information and tests integrity of JPEG/JFIF files.

  • JPEG-XL ๐Ÿ“ ๐ŸŒ -- JPEG XL reference implementation (encoder and decoder), called libjxl. JPEG XL was standardized in 2022 as ISO/IEC 18181. The core codestream is specified in 18181-1, the file format in 18181-2. Decoder conformance is defined in 18181-3, and 18181-4 is the reference software.

  • knusperli ๐Ÿ“ ๐ŸŒ -- Knusperli reduces blocking artifacts in decoded JPEG images by interpreting quantized DCT coefficients in the image data as an interval, rather than a fixed value, and choosing the value from that interval that minimizes discontinuities at block boundaries.

  • lerc ๐Ÿ“ ๐ŸŒ -- LERC (Limited Error Raster Compression) is an open-source image or raster format which supports rapid encoding and decoding for any pixel type (not just RGB or Byte). Users set the maximum compression error per pixel while encoding, so the precision of the original input image is preserved (within user defined error bounds).

  • libaom ๐Ÿ“ ๐ŸŒ -- AV1 Codec Library

  • libavif ๐Ÿ“ ๐ŸŒ -- a friendly, portable C implementation of the AV1 Image File Format, as described here: https://aomediacodec.github.io/av1-avif/

  • libde265 ๐Ÿ“ ๐ŸŒ -- libde265 is an open source implementation of the h.265 video codec. It is written from scratch and has a plain C API to enable a simple integration into other software. libde265 supports WPP and tile-based multithreading and includes SSE optimizations. The decoder includes all features of the Main profile and correctly decodes almost all conformance streams (see [wiki page]).

  • libgd ๐Ÿ“ ๐ŸŒ -- GD is a library for the dynamic creation of images by programmers. GD has support for: WebP, JPEG, PNG, AVIF, HEIF, TIFF, BMP, GIF, TGA, WBMP, XPM.

  • libgif ๐Ÿ“ ๐ŸŒ -- a library for manipulating GIF files.

  • libheif ๐Ÿ“ ๐ŸŒ -- High Efficiency Image File Format (HEIF) :: a visual media container format standardized by the Moving Picture Experts Group (MPEG) for storage and sharing of images and image sequences. It is based on the well-known ISO Base Media File Format (ISOBMFF) standard. HEIF Reader/Writer Engine is an implementation of HEIF standard in order to demonstrate its powerful features and capabilities.

  • libheif-alt ๐Ÿ“ ๐ŸŒ -- an ISO/IEC 23008-12:2017 HEIF and AVIF (AV1 Image File Format) file format decoder and encoder. HEIF and AVIF are new image file formats employing HEVC (h.265) or AV1 image coding, respectively, for the best compression ratios currently possible.

  • libjpeg ๐Ÿ“ ๐ŸŒ -- the Independent JPEG Group's JPEG software

  • libjpeg-turbo ๐Ÿ“ ๐ŸŒ -- a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.

  • libkra ๐Ÿ“ ๐ŸŒ -- a C++ library for importing Krita's KRA & KRZ formatted documents.

  • libpng ๐Ÿ“ ๐ŸŒ -- LIBPNG: Portable Network Graphics support, official libpng repository.

  • libtiff ๐Ÿ“ ๐ŸŒ -- TIFF Software Distribution

  • libwebp ๐Ÿ“ ๐ŸŒ -- a library to encode and decode images in WebP format.

  • mozjpeg ๐Ÿ“ ๐ŸŒ -- the Mozilla JPEG Encoder Project improves JPEG compression efficiency achieving higher visual quality and smaller file sizes at the same time. It is compatible with the JPEG standard, and the vast majority of the world's deployed JPEG decoders. MozJPEG is a patch for libjpeg-turbo.

  • OpenEXR ๐Ÿ“ ๐ŸŒ -- a high dynamic-range (HDR) image file format developed by Industrial Light & Magic (ILM) for use in computer imaging applications. OpenEXR is a lossless format for multi-layered images. Professional use. (I've used it before; nice file format.)

  • openexr-images ๐Ÿ“ ๐ŸŒ -- collection of images associated with the OpenEXR distribution.

  • OpenImageIO ๐Ÿ“ ๐ŸŒ -- Reading, writing, and processing images in a wide variety of file formats, using a format-agnostic API, aimed at VFX applications.

    Also includes:

    • an ImageCache class that transparently manages a cache so that it can access truly vast amounts of image data (tens of thousands of image files totaling multiple TB) very efficiently using only a tiny amount (tens of megabytes at most) of runtime memory.
    • ImageBuf and ImageBufAlgo functions, which constitute a simple class for storing and manipulating whole images in memory, plus a collection of the most useful computations you might want to do involving those images, including many image processing operations.

    The primary target audience for OIIO is VFX studios and developers of tools such as renderers, compositors, viewers, and other image-related software you'd find in a production pipeline.

  • openjpeg ๐Ÿ“ ๐ŸŒ -- OPENJPEG Library and Applications -- OpenJPEG is an open-source JPEG 2000 codec written in C language. It has been developed in order to promote the use of JPEG 2000, a still-image compression standard from the Joint Photographic Experts Group (JPEG). Since April 2015, it is officially recognized by ISO/IEC and ITU-T as a JPEG 2000 Reference Software.

  • pdiff ๐Ÿ“ ๐ŸŒ -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • pmt-png-tools ๐Ÿ“ ๐ŸŒ -- pngcrush and other PNG and MNG tools

  • psd_sdk ๐Ÿ“ ๐ŸŒ -- a C++ library that directly reads Photoshop PSD files. The library supports:

    • Groups
    • Nested layers
    • Smart Objects
    • User and vector masks
    • Transparency masks and additional alpha channels
    • 8-bit, 16-bit, and 32-bit data in grayscale and RGB color mode
    • All compression types known to Photoshop

    Additionally, limited export functionality is also supported.

  • qoi ๐Ÿ“ ๐ŸŒ -- QOI: the โ€œQuite OK Image Formatโ€ for fast, lossless image compression, single-file MIT licensed library for C/C++. Compared to stb_image and stb_image_write QOI offers 20x-50x faster encoding, 3x-4x faster decoding and 20% better compression. It's also stupidly simple and fits in about 300 lines of C.

  • SFML ๐Ÿ“ ๐ŸŒ -- Simple and Fast Multimedia Library (SFML) is a simple, fast, cross-platform and object-oriented multimedia API. It provides access to windowing, graphics, audio and network.

  • tinyexr ๐Ÿ“ ๐ŸŒ -- Tiny OpenEXR: tinyexr is a small, single header-only library to load and save OpenEXR (.exr) images.

  • twain_library ๐Ÿ“ ๐ŸŒ -- the DTWAIN Library, Version 5.x, from Dynarithmic Software. DTWAIN is an open source programmer's library that will allow applications to acquire images from TWAIN-enabled devices using a simple Application Programmer's Interface (API).

  • Imath ๐ŸŒ -- float16 support lib for OpenEXR format

    • optional; reason: considered overkill for the projects I'm currently involved in, including Qiqqa. Those can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.
  • OpenImageIO ๐ŸŒ -- a library for reading, writing, and processing images in a wide variety of file formats, using a format-agnostic API, aimed at VFX applications.

    • tentative/pending; reason: considered nice & cool but still overkill. Qiqqa tooling can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.
  • cgohlke::imagecodecs ๐ŸŒ (not included; see also DICOM slot above)

  • DICOM to NIfTI (not included; see also DICOM slot above)

  • GDCM-Grassroots-DICOM ๐ŸŒ

    • removed; reason: not a frequently used format; the filter codes can be found in other libraries. Overkill. Qiqqa tooling can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.

Monte Carlo simulations, LDA, keyword inference/extraction, etc.

  • ceres-solver ๐Ÿ“ ๐ŸŒ -- a library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems: (1) Non-linear Least Squares problems with bounds constraints, and (2) General unconstrained optimization problems.

  • gibbs-lda ๐Ÿ“ ๐ŸŒ -- modified GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation by by Xuan-Hieu Phan and Cam-Tu Nguyen.

  • lda ๐Ÿ“ ๐ŸŒ -- variational EM for latent Dirichlet allocation (LDA), David Blei et al

  • lda-3-variants ๐Ÿ“ ๐ŸŒ -- three modified open source versions of LDA with collapsed Gibbs Sampling: GibbsLDA++, ompi-lda and online_twitter_lda.

  • lda-bigartm ๐Ÿ“ ๐ŸŒ -- BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

  • lda-Familia ๐Ÿ“ ๐ŸŒ -- Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering (Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu) (2018)

  • LightLDA ๐Ÿ“ ๐ŸŒ -- a distributed system for large scale topic modeling. It implements a distributed sampler that enables very large data sizes and models. LightLDA improves sampling throughput and convergence speed via a fast O(1) metropolis-Hastings algorithm, and allows small cluster to tackle very large data and model sizes through model scheduling and data parallelism architecture. LightLDA is implemented with C++ for performance consideration.

  • mcmc ๐Ÿ“ ๐ŸŒ -- Monte Carlo

  • mmc ๐Ÿ“ ๐ŸŒ -- Monte Carlo

  • multiverso ๐Ÿ“ ๐ŸŒ -- a parameter server based framework for training machine learning models on big data with numbers of machines. It is currently a standard C++ library and provides a series of friendly programming interfaces. Now machine learning researchers and practitioners do not need to worry about the system routine issues such as distributed model storage and operation, inter-process and inter-thread communication, multi-threading management, and so on. Instead, they are able to focus on the core machine learning logics: data, model, and training.

  • ncnn ๐Ÿ“ ๐ŸŒ -- high-performance neural network inference computing framework optimized for mobile platforms (i.e. small footprint)

  • OptimizationTemplateLibrary ๐Ÿ“ ๐ŸŒ -- Optimization Template Library (OTL)

  • pke ๐Ÿ“ ๐ŸŒ -- python keyphrase extraction (PKE) is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

  • RAKE ๐Ÿ“ ๐ŸŒ -- the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

  • stan ๐Ÿ“ ๐ŸŒ -- Stan is a C++ package providing (1) full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC), (2) approximate Bayesian inference using automatic differentiation variational inference (ADVI), and (3) penalized maximum likelihood estimation (MLE) using L-BFGS optimization. It is built on top of the Stan Math library.

  • stateline ๐Ÿ“ ๐ŸŒ -- a framework for distributed Markov Chain Monte Carlo (MCMC) sampling written in C++. It implements random walk Metropolis-Hastings with parallel tempering to improve chain mixing, provides an adaptive proposal distribution to speed up convergence, and allows the user to factorise their likelihoods (eg. over sensors or data).

  • waifu2x-ncnn-vulkan ๐Ÿ“ ๐ŸŒ -- waifu2x ncnn Vulkan: an ncnn project implementation of the waifu2x converter. Runs fast on Intel / AMD / Nvidia / Apple-Silicon with Vulkan API.

  • warpLDA ๐Ÿ“ ๐ŸŒ -- a cache efficient implementation for Latent Dirichlet Allocation.

  • worde_butcher ๐Ÿ“ ๐ŸŒ -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • yake ๐Ÿ“ ๐ŸŒ -- Yet Another Keyword Extractor (Yake) is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain.

  • other topic modeling code on the Net:

Random generators & all things random

  • EigenRand ๐Ÿ“ ๐ŸŒ -- EigenRand: The Fastest C++11-compatible random distribution generator for Eigen. EigenRand is a header-only library for Eigen, providing vectorized random number engines and vectorized random distribution generators. Since the classic Random functions of Eigen rely on an old C function rand(), there is no way to control random numbers and no guarantee for quality of generated numbers. In addition, Eigen's Random is slow because rand() is hard to vectorize. EigenRand provides a variety of random distribution functions similar to C++11 standard's random functions, which can be vectorized and easily integrated into Eigen's expressions of Matrix and Array. You can get 5~10 times speed by just replacing old Eigen's Random or unvectorizable c++11 random number generators with EigenRand.
  • fastPRNG ๐Ÿ“ ๐ŸŒ -- a single header-only FAST 32/64 bit PRNG (pseudo-random generator), highly optimized to obtain faster code from compilers, it's based on xoshiro / xoroshiro (Blackman/Vigna), xorshift and other Marsaglia algorithms.
  • libchaos ๐Ÿ“ ๐ŸŒ -- Advanced library for randomization, hashing and statistical analysis (devoted to chaos machines) written to help with the development of software for scientific research. Project goal is to implement & analyze various algorithms for randomization and hashing, while maintaining simplicity and security, making them suitable for use in your own code. Popular tools like TestU01, Dieharder and Hashdeep are obsolete or their development has been stopped. Libchaos aims to replace them.
  • libprng ๐Ÿ“ ๐ŸŒ -- a collection of C/C++ PRNGs (pseudo-random number generators) + supporting code.
  • pcg-cpp-random ๐Ÿ“ ๐ŸŒ -- a C++ implementation of the PCG family of random number generators, which are fast, statistically excellent, and offer a number of useful features.
  • pcg-c-random ๐Ÿ“ ๐ŸŒ -- a C implementation of the PCG family of random number generators, which are fast, statistically excellent, and offer a number of useful features.
  • prvhash ๐Ÿ“ ๐ŸŒ -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).
  • randen ๐Ÿ“ ๐ŸŒ -- What if we could default to attack-resistant random generators without excessive CPU cost? We introduce 'Randen', a new generator with security guarantees; it outperforms MT19937, pcg64_c32, Philox, ISAAC and ChaCha8 in real-world benchmarks. This is made possible by AES hardware acceleration and a large Feistel permutation.
  • random ๐Ÿ“ ๐ŸŒ -- random for modern C++ with a convenient API.
  • RNGSobol ๐Ÿ“ ๐ŸŒ -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • trng4 ๐Ÿ“ ๐ŸŒ -- Tinaโ€™s Random Number Generator Library (TRNG) is a state of the art C++ pseudo-random number generator library for sequential and parallel Monte Carlo simulations. Its design principles are based on the extensible random number generator facility that was introduced in the C++11 standard. The TRNG library features an object oriented design, is easy to use and has been speed optimized. Its implementation does not depend on any communication library or hardware architecture.
  • Xoshiro-cpp ๐Ÿ“ ๐ŸŒ -- a header-only pseudorandom number generator library for modern C++. Based on David Blackman and Sebastiano Vigna's xoshiro/xoroshiro generators.

Regression, curve fitting, polynomials, splines, geometrics, interpolation, math

  • baobzi ๐Ÿ“ ๐ŸŒ -- an adaptive fast function approximator based on tree search. Word salad aside, baobzi is a tool to convert very CPU intensive function calculations into relatively cheap ones (at the cost of memory). This is similar to functions like chebeval in MATLAB, but can be significantly faster since the order of the polynomial fit can be much much lower to meet similar tolerances. It also isn't constrained for use only in MATLAB. Internally, baobzi represents your function by a grid of binary/quad/oct/N trees, where the leaves represent the function in some small sub-box of the function's domain with chebyshev polynomials. When you evaluate your function at a point with baobzi, it searches the tree for the box containing your point and evaluates using this approximant.
  • blaze ๐Ÿ“ ๐ŸŒ -- a high-performance C++ math library for dense and sparse arithmetic. With its state-of-the-art Smart Expression Template implementation Blaze combines the elegance and ease of use of a domain-specific language with HPC-grade performance, making it one of the most intuitive and fastest C++ math libraries available.
  • Clipper2 ๐Ÿ“ ๐ŸŒ -- a Polygon Clipping and Offsetting library.
  • fastops ๐Ÿ“ ๐ŸŒ -- vector operations library, which enables acceleration of bulk calls of certain math functions on AVX and AVX2 hardware. Currently supported operations are exp, log, sigmoid and tanh. The library itself implements operations using AVX and AVX2, but will work on any hardware with at least SSE2 support.
  • fastrange ๐Ÿ“ ๐ŸŒ -- a fast alternative to the modulo reduction. It has accelerated some operations in Google's Tensorflow by 10% to 20%. Further reading : http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ See also: Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, January 2019 Article No. 3 https://doi.org/10.1145/3230636
  • figtree ๐Ÿ“ ๐ŸŒ -- FIGTree is a library that provides a C/C++ and MATLAB interface for speeding up the computation of the Gauss Transform.
  • fityk ๐Ÿ“ ๐ŸŒ -- a program for nonlinear fitting of analytical functions (especially peak-shaped) to data (usually experimental data). To put it differently, it is primarily peak fitting software, but can handle other types of functions as well. Apart from the actual fitting, the program helps with data processing and provides ergonomic graphical interface (and also command line interface and scripting API -- but if the program is popular in some fields, it's thanks to its graphical interface). It is reportedly__ used in crystallography, chromatography, photoluminescence and photoelectron spectroscopy, infrared and Raman spectroscopy, to name but a few. Fityk offers various nonlinear fitting methods, simple background subtraction and other manipulations to the dataset, easy placement of peaks and changing of peak parameters, support for analysis of series of datasets, automation of common tasks with scripts, and much more.
  • float_compare ๐Ÿ“ ๐ŸŒ -- C++ header providing floating point value comparators with user-specifiable tolerances and behaviour.
  • fmath ๐Ÿ“ ๐ŸŒ -- fast approximate function of exponential function exp and log: includes fmath::log, fmath::exp, fmath::expd.
  • gmt ๐Ÿ“ ๐ŸŒ -- GMT (Generic Mapping Tools) is an open source collection of about 100 command-line tools for manipulating geographic and Cartesian data sets (including filtering, trend fitting, gridding, projecting, etc.) and producing high-quality illustrations ranging from simple x-y plots via contour maps to artificially illuminated surfaces, 3D perspective views and animations. The GMT supplements add another 50 more specialized and discipline-specific tools. GMT supports over 30 map projections and transformations and requires support data such as GSHHG coastlines, rivers, and political boundaries and optionally DCW country polygons.
  • half ๐Ÿ“ ๐ŸŒ -- IEEE 754-based half-precision floating point library forked from http://half.sourceforge.net/. This is a C++ header-only library to provide an IEEE 754 conformant 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions and common mathematical functions. It aims for both efficiency and ease of use, trying to accurately mimic the behaviour of the built-in floating-point types at the best performance possible.
  • hilbert_curves ๐Ÿ“ ๐ŸŒ -- the world's fastest implementations of 2D and 3D hilbert curve functions.
  • hilbert_hpp ๐Ÿ“ ๐ŸŒ -- contains two implementations of the hilbert curve encoding & decoding algorithm described by John Skilling in his paper "Programming the Hilbert Curve".
  • ifopt ๐Ÿ“ ๐ŸŒ -- a modern, light-weight, [Eigen]-based C++ interface to Nonlinear Programming solvers, such as Ipopt and Snopt.
  • ink-stroke-modeler ๐Ÿ“ ๐ŸŒ -- smoothes raw freehand input and predicts the input's motion to minimize display latency. It turns noisy pointer input from touch/stylus/etc. into the beautiful stroke patterns of brushes/markers/pens/etc. Be advised that this library was designed to model handwriting, and as such, prioritizes smooth, good-looking curves over precise recreation of the input.
  • Ipopt ๐Ÿ“ ๐ŸŒ -- Ipopt (Interior Point OPTimizer, pronounced eye-pea-Opt) is a software package for large-scale nonlinear optimization. It is designed to find (local) solutions of mathematical optimization problems.
  • libhilbert ๐Ÿ“ ๐ŸŒ -- an implementation of the Chenyang, Hong, Nengchao 2008 IEEE N-dimensional Hilber mapping algorithm. The Hilbert generating genes are statically compiled into the library, thus producing a rather large executable size. This library support the forward and backward mapping algorithms from R_N -> R_1 and R_1 -> R_N. The library is used straigth forwardly and for guidance and documentation, see hilbertKey.h.
  • libInterpolate ๐Ÿ“ ๐ŸŒ -- a C++ interpolation library, which provides classes to perform various types of 1D and 2D function interpolation (linear, spline, etc.).
  • lmfit ๐Ÿ“ ๐ŸŒ -- least squares fitting Files Levenberg-Marquardt least squares minimization and curve fitting. To minimize arbitrary user-provided functions, or to fit user-provided data. No need to provide derivatives.
  • lol ๐Ÿ“ ๐ŸŒ -- the header-only part of the Lol (Math) Engine framework.
  • lolremez ๐Ÿ“ ๐ŸŒ -- LolRemez is a Remez algorithm implementation to approximate functions using polynomials.
  • magsac ๐Ÿ“ ๐ŸŒ -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.
  • mathtoolbox ๐Ÿ“ ๐ŸŒ -- mathematical tools (interpolation, dimensionality reduction, optimization, etc.) written in C++11 and Eigen.
  • mlinterp ๐Ÿ“ ๐ŸŒ -- a fast C++ routine for linear interpolation in arbitrary dimensions (i.e., multilinear interpolation).
  • nlopt-util ๐Ÿ“ ๐ŸŒ -- a single-header utility library for calling NLopt optimization in a single line using Eigen::VectorXd.
  • openlibm ๐Ÿ“ ๐ŸŒ -- OpenLibm is an effort to have a high quality, portable, standalone C mathematical library (libm). The project was born out of a need to have a good libm for the Julia programming language that worked consistently across compilers and operating systems, and in 32-bit and 64-bit environments.
  • polatory ๐Ÿ“ ๐ŸŒ -- a fast and memory-efficient framework for RBF (radial basis function) interpolation. Polatory can perform kriging prediction via RBF interpolation (dual kriging). Although different terminologies are used, both methods produce the same results.
  • qHilbert ๐Ÿ“ ๐ŸŒ -- a vectorized speedup of Hilbert curve generation using SIMD intrinsics. A hilbert curve is a space filling self-similar curve that provides a mapping between 2D space to 1D, and 1D to 2D space while preserving locality between mappings. Hilbert curves split a finite 2D space into recursive quadrants(similar to a full quad-tree) and traverse each quadrant in recursive "U" shapes at each iteration such that every quadrant gets fully visited before moving onto the next one. qHilbert is an attempt at a vectorized speedup of mapping multiple linear 1D indices into planar 2D points in parallel that is based on the Butz Algorithm's utilization of Gray code.
  • radon-tf ๐Ÿ“ ๐ŸŒ -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.
  • RNGSobol ๐Ÿ“ ๐ŸŒ -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • rotate ๐Ÿ“ ๐ŸŒ -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.
  • RRD ๐Ÿ“ ๐ŸŒ -- RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection
  • RRPN ๐Ÿ“ ๐ŸŒ -- (Arbitrary-Oriented Scene Text Detection via Rotation Proposals)[https://arxiv.org/abs/1703.01086]
  • rtl ๐Ÿ“ ๐ŸŒ -- RANSAC Template Library (RTL) is an open-source robust regression tool especially with RANSAC family. RTL aims to provide fast, accurate, and easy ways to estimate any model parameters with data contaminated with outliers (incorrect data). RTL includes recent RANSAC variants with their performance evaluation with several models with synthetic and real data. RANdom SAmple Consensus (RANSAC) is an iterative method to make any parameter estimator strong against outliers. For example of line fitting, RANSAC enable to estimate a line parameter even though data points include wrong point observations far from the true line.
  • ruy ๐Ÿ“ ๐ŸŒ -- a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture. ruy supports both floating-point and 8-bit-integer-quantized matrices.
  • scilab ๐Ÿ“ ๐ŸŒ -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.
  • sequential-line-search ๐Ÿ“ ๐ŸŒ -- a C++ library for performing the sequential line search method (which is a human-in-the-loop variant of Bayesian optimization), following the paper "Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. 2017. Sequential Line Search for Efficient Visual Design Optimization by Crowds. ACM Trans. Graph. 36, 4, pp.48:1--48:11 (2017). (a.k.a. Proceedings of SIGGRAPH 2017), DOI: https://doi.org/10.1145/3072959.3073598"
  • Sophus ๐Ÿ“ ๐ŸŒ -- a C++ implementation of Lie groups commonly used for 2d and 3d geometric problems (i.e. for Computer Vision or Robotics applications). Among others, this package includes the special orthogonal groups SO(2) and SO(3) to present rotations in 2d and 3d as well as the special Euclidean group SE(2) and SE(3) to represent rigid body transformations (i.e. rotations and translations) in 2d and 3d.
  • spline ๐Ÿ“ ๐ŸŒ -- a lightweight C++ cubic spline interpolation library.
  • splinter ๐Ÿ“ ๐ŸŒ -- SPLINTER (SPLine INTERpolation) is a library for multivariate function approximation with splines. The library can be used for function approximation, regression, data smoothing, data reduction, and much more. Spline approximations are represented by a speedy C++ implementation of the tensor product B-spline. The B-spline consists of piecewise polynomial basis functions, offering a high flexibility and smoothness. The B-spline can be fitted to data using ordinary least squares (OLS), possibly with regularization. The library also offers construction of penalized splines (P-splines).
  • sse2neon ๐Ÿ“ ๐ŸŒ -- converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code.
  • sse-popcount ๐Ÿ“ ๐ŸŒ -- SIMD popcount; sample programs for my article http://0x80.pl/articles/sse-popcount.html / Faster Population Counts using AVX2 Instructions (https://arxiv.org/abs/1611.07612)
  • theoretica ๐Ÿ“ ๐ŸŒ -- a numerical and automatic math library for scientific research and graphical applications. Theoretica is a header-only mathematical library which provides algorithms for systems simulation, statistical analysis of lab data and numerical approximation, using a functional oriented paradigm to mimic mathematical notation and formulas. The aim of the library is to provide simple access to powerful algorithms while keeping an elegant and transparent interface, enabling the user to focus on the problem at hand.
  • tinynurbs ๐Ÿ“ ๐ŸŒ -- a lightweight header-only C++14 library for Non-Uniform Rational B-Spline curves and surfaces. The API is simple to use and the code is readable while being efficient.
  • tinyspline ๐Ÿ“ ๐ŸŒ -- TinySpline is a small, yet powerful library for interpolating, transforming, and querying arbitrary NURBS, B-Splines, and Bรฉzier curves.
  • tweeny ๐Ÿ“ ๐ŸŒ -- an inbetweening library designed for the creation of complex animations for games and other beautiful interactive software. It leverages features of modern C++ to empower developers with an intuitive API for declaring tweenings of any type of value, as long as they support arithmetic operations. The goal of Tweeny is to provide means to create fluid interpolations when animating position, scale, rotation, frames or other values of screen objects, by setting their values as the tween starting point and then, after each tween step, plugging back the result.

Solvers, Clustering, Monte Carlo, Decision Trees

  • ArborX ๐Ÿ“ ๐ŸŒ -- a library designed to provide performance portable algorithms for geometric search, similarly to nanoflann and Boost Geometry.
  • baobzi ๐Ÿ“ ๐ŸŒ -- an adaptive fast function approximator based on tree search. Word salad aside, baobzi is a tool to convert very CPU intensive function calculations into relatively cheap ones (at the cost of memory). This is similar to functions like chebeval in MATLAB, but can be significantly faster since the order of the polynomial fit can be much much lower to meet similar tolerances. It also isn't constrained for use only in MATLAB. Internally, baobzi represents your function by a grid of binary/quad/oct/N trees, where the leaves represent the function in some small sub-box of the function's domain with chebyshev polynomials. When you evaluate your function at a point with baobzi, it searches the tree for the box containing your point and evaluates using this approximant.
  • brown-cluster ๐Ÿ“ ๐ŸŒ -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf
  • CppNumericalSolvers ๐Ÿ“ ๐ŸŒ -- a header-only C++17 BFGS / L-BFGS-B optimization library.
  • dbscan ๐Ÿ“ ๐ŸŒ -- Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms: a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes: DBSCAN, HDBSCAN, OPTICS/OPTICSXi, FOSC, Jarvis-Patrick clustering, LOF (Local outlier factor), GLOSH (Global-Local Outlier Score from Hierarchies), kd-tree based kNN search, Fixed-radius NN search
  • FIt-SNE ๐Ÿ“ ๐ŸŒ -- FFT-accelerated implementation of t-Stochastic Neighborhood Embedding (t-SNE), which is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular implementation of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent.
  • fityk ๐Ÿ“ ๐ŸŒ -- a program for nonlinear fitting of analytical functions (especially peak-shaped) to data (usually experimental data). To put it differently, it is primarily peak fitting software, but can handle other types of functions as well. Apart from the actual fitting, the program helps with data processing and provides ergonomic graphical interface (and also command line interface and scripting API -- but if the program is popular in some fields, it's thanks to its graphical interface). It is reportedly__ used in crystallography, chromatography, photoluminescence and photoelectron spectroscopy, infrared and Raman spectroscopy, to name but a few. Fityk offers various nonlinear fitting methods, simple background subtraction and other manipulations to the dataset, easy placement of peaks and changing of peak parameters, support for analysis of series of datasets, automation of common tasks with scripts, and much more.
  • genieclust ๐Ÿ“ ๐ŸŒ -- a faster and more powerful version of Genie (Fast and Robust Hierarchical Clustering with Noise Point Detection) โ€“ a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016).
  • gram_savitzky_golay ๐Ÿ“ ๐ŸŒ -- Savitzky-Golay filtering based on Gram polynomials, as described in General Least-Squares Smoothing and Differentiation by the Convolution (Savitzky-Golay) Method
  • hdbscan ๐Ÿ“ ๐ŸŒ -- a fast parallel implementation for HDBSCAN* [1] (hierarchical DBSCAN). The implementation stems from our parallel algorithms [2] developed at MIT, and presented at SIGMOD 2021. Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram, which are used in visualizing clusters of different scale that arise for HDBSCAN*.
  • hdbscan-cpp ๐Ÿ“ ๐ŸŒ -- Fast and Efficient Implementation of HDBSCAN in C++ using STL. HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection. In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select. HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).
  • ifopt ๐Ÿ“ ๐ŸŒ -- a modern, light-weight, [Eigen]-based C++ interface to Nonlinear Programming solvers, such as Ipopt and Snopt.
  • Ipopt ๐Ÿ“ ๐ŸŒ -- Ipopt (Interior Point OPTimizer, pronounced eye-pea-Opt) is a software package for large-scale nonlinear optimization. It is designed to find (local) solutions of mathematical optimization problems.
  • kiwi ๐Ÿ“ ๐ŸŒ -- Kiwi is an efficient C++ implementation of the Cassowary constraint solving algorithm. Kiwi is an implementation of the algorithm based on the seminal Cassowary paper <https://constraints.cs.washington.edu/solvers/cassowary-tochi.pdf>_. It is not a refactoring of the original C++ solver. Kiwi has been designed from the ground up to be lightweight and fast. Kiwi ranges from 10x to 500x faster than the original Cassowary solver with typical use cases gaining a 40x improvement. Memory savings are consistently > 5x.
  • LBFGS-Lite ๐Ÿ“ ๐ŸŒ -- a header-only L-BFGS unconstrained optimizer.
  • liblbfgs ๐Ÿ“ ๐ŸŒ -- libLBFGS: C library of limited-memory BFGS (L-BFGS), a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal. The original FORTRAN source code is available at: http://www.ece.northwestern.edu/~nocedal/lbfgs.html
  • lmfit ๐Ÿ“ ๐ŸŒ -- least squares fitting Files Levenberg-Marquardt least squares minimization and curve fitting. To minimize arbitrary user-provided functions, or to fit user-provided data. No need to provide derivatives.
  • LMW-tree ๐Ÿ“ ๐ŸŒ -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.
  • mapreduce ๐Ÿ“ ๐ŸŒ -- the MapReduce-MPI (MR-MPI) library. MapReduce is the operation popularized by Google for computing on large distributed data sets. See the Wikipedia entry on MapReduce for an overview of what a MapReduce is. The MR-MPI library is a simple, portable implementation of MapReduce that runs on any serial desktop machine or large parallel machine using MPI message passing.
  • mathtoolbox ๐Ÿ“ ๐ŸŒ -- mathematical tools (interpolation, dimensionality reduction, optimization, etc.) written in C++11 and Eigen.
  • mcmc-jags ๐Ÿ“ ๐ŸŒ -- JAGS (Just Another Gibbs Sampler), a program for analysis of Bayesian Graphical models by Gibbs Sampling.
  • MicroPather ๐Ÿ“ ๐ŸŒ -- a path finder and A* solver (astar or a-star) written in platform independent C++ that can be easily integrated into existing code. MicroPather focuses on being a path finding engine for video games but is a generic A* solver.
  • Multicore-TSNE ๐Ÿ“ ๐ŸŒ -- Multicore t-SNE is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with Python CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core (as of version 0.18).
  • nlopt ๐Ÿ“ ๐ŸŒ -- a library for nonlinear local and global optimization, for functions with and without gradient information. It is designed as a simple, unified interface and packaging of several free/open-source nonlinear optimization libraries.
  • nlopt-util ๐Ÿ“ ๐ŸŒ -- a single-header utility library for calling NLopt optimization in a single line using Eigen::VectorXd.
  • openlibm ๐Ÿ“ ๐ŸŒ -- OpenLibm is an effort to have a high quality, portable, standalone C mathematical library (libm). The project was born out of a need to have a good libm for the Julia programming language that worked consistently across compilers and operating systems, and in 32-bit and 64-bit environments.
  • or-tools ๐Ÿ“ ๐ŸŒ -- Google Optimization Tools (a.k.a., OR-Tools) is an open-source, fast and portable software suite for solving combinatorial optimization problems. The suite includes a constraint programming solver, a linear programming solver and various graph algorithms.
  • osqp ๐Ÿ“ ๐ŸŒ -- the Operator Splitting Quadratic Program Solver.
  • osqp-cpp ๐Ÿ“ ๐ŸŒ -- a C++ wrapper for OSQP, an ADMM-based solver for quadratic programming. Compared with OSQP's native C interface, the wrapper provides a more convenient input format using Eigen sparse matrices and handles the lifetime of the OSQPWorkspace struct. This package has similar functionality to osqp-eigen.
  • osqp-eigen ๐Ÿ“ ๐ŸŒ -- a simple C++ wrapper for osqp library.
  • paramonte ๐Ÿ“ ๐ŸŒ -- ParaMonte (Plain Powerful Parallel Monte Carlo Library) is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in data science, Machine Learning, and scientific inference, with the design goal of unifying the automation (of Monte Carlo simulations), user-friendliness (of the library), accessibility (from multiple programming environments), high-performance (at runtime), and scalability (across many parallel processors).
  • rgf ๐Ÿ“ ๐ŸŒ -- Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better results than gradient boosted decision trees (GBDT) on a number of datasets and it has been used to win a few Kaggle competitions. Unlike the traditional boosted decision tree approach, RGF works directly with the underlying forest structure. RGF integrates two ideas: one is to include tree-structured regularization into the learning formulation; and the other is to employ the fully-corrective regularized greedy algorithm.
  • RNGSobol ๐Ÿ“ ๐ŸŒ -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • scilab ๐Ÿ“ ๐ŸŒ -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.
  • sequential-line-search ๐Ÿ“ ๐ŸŒ -- a C++ library for performing the sequential line search method (which is a human-in-the-loop variant of Bayesian optimization), following the paper "Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. 2017. Sequential Line Search for Efficient Visual Design Optimization by Crowds. ACM Trans. Graph. 36, 4, pp.48:1--48:11 (2017). (a.k.a. Proceedings of SIGGRAPH 2017), DOI: https://doi.org/10.1145/3072959.3073598"
  • somoclu ๐Ÿ“ ๐ŸŒ -- a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing the workload in a cluster, and it can be accelerated by CUDA. A sparse kernel is also included, which is useful for training maps on vector spaces generated in text mining processes.
  • theoretica ๐Ÿ“ ๐ŸŒ -- a numerical and automatic math library for scientific research and graphical applications. Theoretica is a header-only mathematical library which provides algorithms for systems simulation, statistical analysis of lab data and numerical approximation, using a functional oriented paradigm to mimic mathematical notation and formulas. The aim of the library is to provide simple access to powerful algorithms while keeping an elegant and transparent interface, enabling the user to focus on the problem at hand.
  • uno-solver ๐Ÿ“ ๐ŸŒ -- a modern, modular solver for nonlinearly constrained nonconvex optimization.

Distance Metrics, Image Quality Metrics, Image Comparison

  • edit-distance ๐Ÿ“ ๐ŸŒ -- a fast implementation of the edit distance (Levenshtein distance). The algorithm used in this library is proposed by Heikki Hyyrรถ, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf.

  • figtree ๐Ÿ“ ๐ŸŒ -- FIGTree is a library that provides a C/C++ and MATLAB interface for speeding up the computation of the Gauss Transform.

  • flip ๐Ÿ“ ๐ŸŒ -- ๊ŸปLIP: A Tool for Visualizing and Communicating Errors in Rendered Images, implements the LDR-๊ŸปLIP and HDR-๊ŸปLIP image error metrics.

  • glfw ๐Ÿ“ ๐ŸŒ -- GLFW is an Open Source, multi-platform library for OpenGL, OpenGL ES and Vulkan application development. It provides a simple, platform-independent API for creating windows, contexts and surfaces, reading input, handling events, etc.

  • imagedistance ๐Ÿ“ ๐ŸŒ -- given two images, calculate their distance in several criteria.

  • libdip ๐Ÿ“ ๐ŸŒ -- DIPlib is a C++ library for quantitative image analysis.

  • libxcam ๐Ÿ“ ๐ŸŒ -- libXCam is a project for extended camera features and focus on image quality improvement and video analysis. There are lots features supported in image pre-processing, image post-processing and smart analysis. This library makes GPU/CPU/ISP working together to improve image quality. OpenCL is used to improve performance in different platforms.

  • magsac ๐Ÿ“ ๐ŸŒ -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.

  • mecab ๐Ÿ“ ๐ŸŒ -- MeCab (Yet Another Part-of-Speech and Morphological Analyzer) is a high-performance morphological analysis engine, designed to be independent of languages, dictionaries, and corpora, using Conditional Random Fields ((CRF)[http://www.cis.upenn.edu/~pereira/papers/crf.pdf]) to estimate the parameters.

  • pg_similarity ๐Ÿ“ ๐ŸŒ -- pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function).

  • poisson_blend ๐Ÿ“ ๐ŸŒ -- a simple, readable implementation of Poisson Blending, that demonstrates the concepts explained in my article, seamlessly blending a source image and a target image, at some specified pixel location.

  • polatory ๐Ÿ“ ๐ŸŒ -- a fast and memory-efficient framework for RBF (radial basis function) interpolation. Polatory can perform kriging prediction via RBF interpolation (dual kriging). Although different terminologies are used, both methods produce the same results.

  • radon-tf ๐Ÿ“ ๐ŸŒ -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.

  • RapidFuzz ๐Ÿ“ ๐ŸŒ -- rapid fuzzy string matching in Python and C++ using the Levenshtein Distance.

  • rotate ๐Ÿ“ ๐ŸŒ -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.

  • Shifted-Hamming-Distance ๐Ÿ“ ๐ŸŒ -- Shifted Hamming Distance (SHD) is an edit-distance based filter that can quickly check whether the minimum number of edits (including insertions, deletions and substitutions) between two strings is smaller than a user defined threshold T (the number of allowed edits between the two strings). Testing if two stings differs by a small amount is a prevalent function that is used in many applications. One of its biggest usage, perhaps, is in DNA or protein mapping, where a short DNA or protein string is compared against an enormous database, in order to find similar matches. In such applications, a query string is usually compared against multiple candidate strings in the database. Only candidates that are similar to the query are considered matches and recorded. SHD expands the basic Hamming distance computation, which only detects substitutions, into a full-fledged edit-distance filter, which counts not only substitutions but insertions and deletions as well.

  • vmaf ๐Ÿ“ ๐ŸŒ -- VMAF (Video Multi-Method Assessment Fusion) is an Emmy-winning perceptual video quality assessment algorithm developed by Netflix. It also provides a set of tools that allows a user to train and test a custom VMAF model.

  • ZLMediaKit ๐Ÿ“ ๐ŸŒ -- a high-performance operational-level streaming media service framework based on C++11, supporting multiple protocols (RTSP/RTMP/HLS/HTTP-FLV/WebSocket-FLV/GB28181/HTTP-TS/WebSocket-TS/HTTP-fMP4/WebSocket-fMP4/MP4/WebRTC) and protocol conversion.

    This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided.

    • Cosine Distance;
    • Dice Coefficient;
    • Euclidean Distance;
    • Hamming Distance;
    • Jaccard Coefficient;
    • Jaro Distance;
    • Jaro-Winkler Distance;
    • L1 Distance (as known as City Block or Manhattan Distance);
    • Levenshtein Distance;
    • Matching Coefficient;
    • Monge-Elkan Coefficient;
    • Needleman-Wunsch Coefficient;
    • Overlap Coefficient;
    • Q-Gram Distance;
    • Smith-Waterman Coefficient;
    • Smith-Waterman-Gotoh Coefficient;
    • Soundex Distance.

database "backend storage"

  • arangodb ๐Ÿ“ ๐ŸŒ -- a scalable open-source multi-model database natively supporting graph, document and search. All supported data models & access patterns can be combined in queries allowing for maximal flexibility.

  • arrow ๐Ÿ“ ๐ŸŒ -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • BitFunnel ๐Ÿ“ ๐ŸŒ -- the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search.

  • csv-parser ๐Ÿ“ ๐ŸŒ -- Vince's CSV Parser: there's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.

  • csvquote ๐Ÿ“ ๐ŸŒ -- smart and simple CSV processing on the command line. This program can be used at the start and end of a text processing pipeline so that regular unix command line tools can properly handle CSV data that contain commas and newlines inside quoted data fields. Without this program, embedded special characters would be incorrectly interpreted as separators when they are inside quoted data fields.

  • datamash ๐Ÿ“ ๐ŸŒ -- GNU Datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files. It is designed to be portable and reliable, and aid researchers to easily automate analysis pipelines, without writing code or even short scripts.

  • duckdb ๐Ÿ“ ๐ŸŒ -- DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more.

  • Extensible-Storage-Engine ๐Ÿ“ ๐ŸŒ -- ESE is an embedded / ISAM-based database engine, that provides rudimentary table and indexed access. However the library provides many other strongly layered and thus reusable sub-facilities as well: A Synchronization / Locking library, a Data-structures / STL-like library, an OS-abstraction layer, and a Cache Manager, as well as the full-blown database engine itself.

  • fast-cpp-csv-parser ๐Ÿ“ ๐ŸŒ -- a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.

  • groonga ๐Ÿ“ ๐ŸŒ -- an open-source fulltext search engine and column store.

  • harbour-core ๐Ÿ“ ๐ŸŒ -- Harbour is the free software implementation of a multi-platform, multi-threading, object-oriented, scriptable programming language, backward compatible with Clipper/xBase. Harbour consists of a compiler and runtime libraries with multiple UI and database backends, its own make system and a large collection of libraries and interfaces to many popular APIs.

  • IdGenerator ๐Ÿ“ ๐ŸŒ -- a digital ID generator using the snowflake algorithm, developed in response to the performance problems that often occur. Example use is when you, as an architecture designer, want to solve the problem of unique database primary keys, especially in multi-database distributed systems. You want the primary key of the data table to use the least storage space, while the index speed and the Select, Insert, and Update queries are fast. Meanwhile there may be more than 50 application instances, and each concurrent request can reach 10W/s. You do not want to rely on the auto-increment operation of redis to obtain continuous primary key IDs, because continuous IDs pose business data security risks.

  • iODBC ๐Ÿ“ ๐ŸŒ -- the iODBC Driver Manager provides you with everything you need to develop ODBC-compliant applications under Unix without having to pay royalties to other parties. An ODBC driver is still needed to affect your connection architecture. You may build a driver with the iODBC components or obtain an ODBC driver from a commercial vendor.

  • libcsv2 ๐Ÿ“ ๐ŸŒ -- CSV file format reader/writer library.

  • lib_nas_lockfile ๐Ÿ“ ๐ŸŒ -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.

  • libsiridb ๐Ÿ“ ๐ŸŒ -- SiriDB Connector C (libsiridb) is a library which can be used to communicate with SiriDB using the C program language. This library contains useful functions but does not handle the connection itself.

  • libsl3 ๐Ÿ“ ๐ŸŒ -- a C++ interface for SQLite 3.x. libsl3 is designed to enable comfortable and efficient communication with a SQLite database based on its natural language, which is SQL.

  • libsql ๐Ÿ“ ๐ŸŒ -- libSQL is an open source, open contribution fork of SQLite. We aim to evolve it to suit many more use cases than SQLite was originally designed for, and plan to use third-party OSS code wherever it makes sense.

    SQLite has solidified its place in modern technology stacks, embedded in nearly any computing device you can think of. Its open source nature and public domain availability make it a popular choice for modification to meet specific use cases. But despite having its code available, SQLite famously doesn't accept external contributors, so community improvements cannot be widely enjoyed. There have been other forks in the past, but they all focus on a specific technical difference. We aim to be a community where people can contribute from many different angles and motivations. We want to see a world where everyone can benefit from all of the great ideas and hard work that the SQLite community contributes back to the codebase.

  • libsqlfs ๐Ÿ“ ๐ŸŒ -- a POSIX style file system on top of an SQLite database. It allows applications to have access to a full read/write file system in a single file, complete with its own file hierarchy and name space. This is useful for applications which needs structured storage, such as embedding documents within documents, or management of configuration data or preferences.

  • ligra-graph ๐Ÿ“ ๐ŸŒ -- LIGRA: a Lightweight Graph Processing Framework for Shared Memory; works on both uncompressed and compressed graphs and hypergraphs.

  • mcmd ๐Ÿ“ ๐ŸŒ -- MCMD (M-Command): a set of commands for handling large scale CSV data. MCMD (called as M-Command) is a set of commands that are developed for the purpose of high-speed processing of large-scale structured tabular data in CSV format. It is possible to efficiently process large scale data with hundred millions row of records on a standard PC.

  • mydumper ๐Ÿ“ ๐ŸŒ -- a MySQL Logical Backup Tool. It has 2 tools:

    • mydumper which is responsible to export a consistent backup of MySQL databases
    • myloader reads the backup from mydumper, connects the to destination database and imports the backup.
  • mysql-connector-cpp ๐Ÿ“ ๐ŸŒ -- MySQL Connector/C++ is a release of MySQL Connector/C++, the C++ interface for communicating with MySQL servers.

  • nanodbc ๐Ÿ“ ๐ŸŒ -- a small C++ wrapper for the native C ODBC API.

  • ormpp ๐Ÿ“ ๐ŸŒ -- modern C++ ORM, C++17, support mysql, postgresql, sqlite.

  • otl ๐Ÿ“ ๐ŸŒ -- Oracle Template Library (STL-like wrapper for SQL DB queries; supports many databases besides Oracle)

  • percona-server ๐Ÿ“ ๐ŸŒ -- Percona Server for MySQL is a free, fully compatible, enhanced, and open source drop-in replacement for any MySQL database. It provides superior performance, scalability, and instrumentation.

  • qlever ๐Ÿ“ ๐ŸŒ -- a SPARQL engine that can efficiently index and query very large knowledge graphs with up to 100 billion triples on a single standard PC or server. In particular, QLever is fast for queries that involve large intermediate or final results, which are notoriously hard for engines like Blazegraph or Virtuoso.

  • siridb-server ๐Ÿ“ ๐ŸŒ -- SiriDB Server is a highly-scalable, robust and super fast time series database. SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDBโ€™s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series. SiriDB is scalable on the fly and has no downtime while updating or expanding your database. The scalable possibilities enable you to enlarge the database time after time without losing speed. SiriDB is developed to give an unprecedented performance without downtime. A SiriDB cluster distributes time series across multiple pools. Each pool supports active replicas for load balancing and redundancy. When one of the replicas is not available the database is still accessible.

  • sqawk ๐Ÿ“ ๐ŸŒ -- apply SQL on CSV files in the shell: sqawk imports CSV files into an on-the-fly SQLite database, and runs a user-supplied query on the data.

  • sqlcipher ๐Ÿ“ ๐ŸŒ -- SQLCipher is a standalone fork of the SQLite database library that adds 256 bit AES encryption of database files and other security features.

  • sqlean ๐Ÿ“ ๐ŸŒ -- The ultimate set of SQLite extensions: SQLite has few functions compared to other database management systems. SQLite authors see this as a feature rather than a problem, because SQLite has an extension mechanism in place. There are a lot of SQLite extensions out there, but they are incomplete, inconsistent and scattered across the internet. sqlean brings them together, neatly packaged into domain modules, documented, tested, and built for Linux, Windows and macOS.

  • sqleet ๐Ÿ“ ๐ŸŒ -- an encryption extension for SQLite3. The encryption is transparent (on-the-fly) and based on modern cryptographic algorithms designed for high performance in software and robust side-channel resistance.

  • sqlite ๐Ÿ“ ๐ŸŒ -- the complete SQLite database engine.

  • sqlite3-compression-encryption-vfs ๐Ÿ“ ๐ŸŒ -- CEVFS: Compression & Encryption VFS for SQLite 3 is a SQLite 3 Virtual File System for compressing and encrypting data at the pager level. Once set up, you use SQLite as you normally would and the compression and encryption is transparently handled during database read/write operations via the SQLite pager.

  • sqlite3pp ๐Ÿ“ ๐ŸŒ -- a minimal ORM wrapper for SQLite et al.

  • sqlite-amalgamation ๐Ÿ“ ๐ŸŒ -- the SQLite amalgamation, which is the recommended method of building SQLite into larger projects.

  • SQLiteCpp ๐Ÿ“ ๐ŸŒ -- a smart and easy to use C++ SQLite3 wrapper. SQLiteC++ offers an encapsulation around the native C APIs of SQLite, with a few intuitive and well documented C++ classes.

  • sqlite-fts5-snowball ๐Ÿ“ ๐ŸŒ -- a simple extension for use with FTS5 within SQLite. It allows FTS5 to use Martin Porter's Snowball stemmers (libstemmer), which are available in several languages. Check http://snowballstem.org/ for more information about them.

  • sqlite_fts_tokenizer_chinese_simple ๐Ÿ“ ๐ŸŒ -- an extension of sqlite3 fts5 that supports Chinese and Pinyin. It fully provides a solution to the multi-phonetic word problem of full-text retrieval on WeChat mobile terminal: solution 4 in the article, very simple and efficient support for Chinese and Pinyin searches.

    On this basis we also support more accurate phrase matching through cppjieba. See the introduction article at https://www.wangfenjin.com/posts/simple-jieba-tokenizer/

  • SQLiteHistograms ๐Ÿ“ ๐ŸŒ -- an SQLite extension library for creating histogram tables, tables of ratio between histograms and interpolation tables of scatter point tables.

  • sqliteodbc ๐Ÿ“ ๐ŸŒ -- SQLite ODBC Driver for the wonderful SQLite 2.8.* and SQLite 3.* Database Engine/Library.

  • sqlite-parquet-vtable ๐Ÿ“ ๐ŸŒ -- an SQLite virtual table extension to expose Parquet files as SQL tables. You may also find csv2parquet useful. This blog post provides some context on why you might use this.

  • sqlite-stats ๐Ÿ“ ๐ŸŒ -- provides common statistical functions for SQLite.

  • sqlite_wrapper ๐Ÿ“ ๐ŸŒ -- an easy-to-use, lightweight and concurrency-friendly SQLite wrapper written in C++17.

  • sqlite_zstd_vfs ๐Ÿ“ ๐ŸŒ -- SQLite VFS extension providing streaming storage compression using Zstandard (Zstd), transparently compressing pages of the main database file as they're written out and later decompressing them as they're read in. It runs page de/compression on background threads and occasionally generates dictionaries to improve subsequent compression.

  • sqlpp11 ๐Ÿ“ ๐ŸŒ -- a type safe embedded domain specific language for SQL queries and results in C++.

  • unixODBC ๐Ÿ“ ๐ŸŒ -- an Open Source ODBC sub-system and an ODBC SDK for Linux, Mac OSX, and UNIX.

  • unqlite ๐Ÿ“ ๐ŸŒ -- UnQLite is a Transactional Embedded Database Engine, an in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.

    Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform; you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

    • BSD licensed product.
    • Built with a powerful disk storage engine which support O(1) lookup.
    • Cross-platform file format.
    • Document store (JSON) database via Jx9.
    • Pluggable run-time interchangeable storage engine.
    • Serverless, NoSQL database engine.
    • Simple, Clean and easy to use API.
    • Single database file, does not use temporary files.
    • Standard Key/Value store.
    • Support cursors for linear records traversal.
    • Support for on-disk as well in-memory databases.
    • Support Terabyte sized databases.
    • Thread safe and full reentrant.
    • Transactional (ACID) database.
    • UnQLite is a Self-Contained C library without dependency.
    • Zero configuration.
  • upscaledb ๐Ÿ“ ๐ŸŒ -- a.k.a. hamsterdb: a thread-safe key/value database engine. It supports a B+Tree index structure, uses memory mapped I/O (if available), fast Cursors and variable length keys and can create In-Memory Databases.

  • zsv ๐Ÿ“ ๐ŸŒ -- the world's fastest (SIMD) CSV parser, with an extensible CLI for SQL querying, format conversion and more.

LMDB, NoSQL and key/value stores

  • arrow ๐Ÿ“ ๐ŸŒ -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • comdb2-bdb ๐Ÿ“ ๐ŸŒ -- a clustered RDBMS built on Optimistic Concurrency Control techniques. It provides multiple isolation levels, including Snapshot and Serializable Isolation.

  • ctsa ๐Ÿ“ ๐ŸŒ -- a Univariate Time Series Analysis and ARIMA Modeling Package in ANSI C: CTSA is a C software package for univariate time series analysis. ARIMA and Seasonal ARIMA models have been added as of 10/30/2014. 07/24/2020 Update: SARIMAX and Auto ARIMA added. Documentation will be added in the coming days. Software is still in beta stage and older ARIMA and SARIMA functions are now superseded by SARIMAX.

  • ejdb ๐Ÿ“ ๐ŸŒ -- an embeddable JSON database engine published under MIT license, offering a single file database, online backups support, a simple but powerful query language (JQL), based on the TokyoCabinet-inspired KV store iowow.

  • FASTER ๐Ÿ“ ๐ŸŒ -- helps manage large application state easily, resiliently, and with high performance by offering (1) FASTER Log, which is a high-performance concurrent persistent recoverable log, iterator, and random reader library, and (2) FASTER KV as a concurrent key-value store + cache that is designed for point lookups and heavy updates. FASTER supports data larger than memory, by leveraging fast external storage (local or cloud). It also supports consistent recovery using a fast non-blocking checkpointing technique that lets applications trade-off performance for commit latency. Both FASTER KV and FASTER Log offer orders-of-magnitude higher performance than comparable solutions, on standard workloads.

  • gdbm ๐Ÿ“ ๐ŸŒ -- GNU dbm is a set of NoSQL database routines that use extendable hashing and works similar to the standard UNIX dbm routines.

  • iowow ๐Ÿ“ ๐ŸŒ -- a C/11 file storage utility library and persistent key/value storage engine, supporting multiple key-value databases within a single file, online database backups and Write Ahead Logging (WAL) support. Good performance comparing its main competitors: lmdb, leveldb, kyoto cabinet.

  • libmdbx ๐Ÿ“ ๐ŸŒ -- one of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

  • libsiridb ๐Ÿ“ ๐ŸŒ -- SiriDB Connector C (libsiridb) is a library which can be used to communicate with SiriDB using the C program language. This library contains useful functions but does not handle the connection itself.

  • Lightning.NET ๐Ÿ“ ๐ŸŒ -- .NET library for OpenLDAP's LMDB key-value store

  • lmdb ๐Ÿ“ ๐ŸŒ -- OpenLDAP LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness when used correctly.

  • lmdb-safe ๐Ÿ“ ๐ŸŒ -- A safe modern & performant C++ wrapper of LMDB. LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness.. when used correctly. The design of LMDB is elegant and simple, which aids both the performance and stability. The downside of this elegant design is a nontrivial set of rules that need to be followed to not break things. In other words, LMDB delivers great things but only if you use it exactly right. This is by conscious design. The lmdb-safe library aims to deliver the full LMDB performance while programmatically making sure the LMDB semantics are adhered to, with very limited overhead.

  • lmdb.spreads.net ๐Ÿ“ ๐ŸŒ -- Low-level zero-overhead and the fastest LMDB .NET wrapper with some additional native methods useful for Spreads.

  • lmdb-store ๐Ÿ“ ๐ŸŒ -- an ultra-fast NodeJS interface to LMDB; probably the fastest and most efficient NodeJS key-value/database interface that exists for full storage and retrieval of structured JS data (objects, arrays, etc.) in a true persisted, scalable, ACID compliant database. It provides a simple interface for interacting with LMDB.

  • lmdbxx ๐Ÿ“ ๐ŸŒ -- lmdb++: a comprehensive C++11 wrapper for the LMDB embedded database library, offering both an error-checked procedural interface and an object-oriented resource interface with RAII semantics.

  • mmkv ๐Ÿ“ ๐ŸŒ -- an efficient, small, easy-to-use mobile key-value storage framework used in the WeChat application. It's currently available on Android, iOS/macOS, Win32 and POSIX.

  • PGM-index ๐Ÿ“ ๐ŸŒ -- the Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

  • pmemkv ๐Ÿ“ ๐ŸŒ -- pmemkv is a local/embedded key-value datastore optimized for persistent memory. Rather than being tied to a single language or backing implementation, pmemkv provides different options for language bindings and storage engines.

  • pmemkv-bench ๐Ÿ“ ๐ŸŒ -- benchmark for libpmemkv and its underlying libraries, based on leveldb's db_bench. The pmemkv_bench utility provides some standard read, write & remove benchmarks. It's based on the db_bench utility included with LevelDB and RocksDB, although the list of supported parameters is slightly different.

  • qlever ๐Ÿ“ ๐ŸŒ -- a SPARQL engine that can efficiently index and query very large knowledge graphs with up to 100 billion triples on a single standard PC or server. In particular, QLever is fast for queries that involve large intermediate or final results, which are notoriously hard for engines like Blazegraph or Virtuoso.

  • sdsl-lite ๐Ÿ“ ๐ŸŒ -- The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 [research publications][SDSLLIT]. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical.

  • siridb-server ๐Ÿ“ ๐ŸŒ -- SiriDB Server is a highly-scalable, robust and super fast time series database. SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDBโ€™s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series. SiriDB is scalable on the fly and has no downtime while updating or expanding your database. The scalable possibilities enable you to enlarge the database time after time without losing speed. SiriDB is developed to give an unprecedented performance without downtime. A SiriDB cluster distributes time series across multiple pools. Each pool supports active replicas for load balancing and redundancy. When one of the replicas is not available the database is still accessible.

  • unqlite ๐Ÿ“ ๐ŸŒ -- UnQLite is a Transactional Embedded Database Engine, an in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.

    Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform; you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

    • BSD licensed product.
    • Built with a powerful disk storage engine which support O(1) lookup.
    • Cross-platform file format.
    • Document store (JSON) database via Jx9.
    • Pluggable run-time interchangeable storage engine.
    • Serverless, NoSQL database engine.
    • Simple, Clean and easy to use API.
    • Single database file, does not use temporary files.
    • Standard Key/Value store.
    • Support cursors for linear records traversal.
    • Support for on-disk as well in-memory databases.
    • Support Terabyte sized databases.
    • Thread safe and full reentrant.
    • Transactional (ACID) database.
    • UnQLite is a Self-Contained C library without dependency.
    • Zero configuration.

SQLite specific modules & related materials

  • duckdb ๐Ÿ“ ๐ŸŒ -- DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more.

  • libdist ๐Ÿ“ ๐ŸŒ -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • lib_nas_lockfile ๐Ÿ“ ๐ŸŒ -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.

  • libsl3 ๐Ÿ“ ๐ŸŒ -- a C++ interface for SQLite 3.x. libsl3 is designed to enable comfortable and efficient communication with a SQLite database based on its natural language, which is SQL.

  • libsql ๐Ÿ“ ๐ŸŒ -- libSQL is an open source, open contribution fork of SQLite. We aim to evolve it to suit many more use cases than SQLite was originally designed for, and plan to use third-party OSS code wherever it makes sense.

    SQLite has solidified its place in modern technology stacks, embedded in nearly any computing device you can think of. Its open source nature and public domain availability make it a popular choice for modification to meet specific use cases. But despite having its code available, SQLite famously doesn't accept external contributors, so community improvements cannot be widely enjoyed. There have been other forks in the past, but they all focus on a specific technical difference. We aim to be a community where people can contribute from many different angles and motivations. We want to see a world where everyone can benefit from all of the great ideas and hard work that the SQLite community contributes back to the codebase.

  • libsqlfs ๐Ÿ“ ๐ŸŒ -- a POSIX style file system on top of an SQLite database. It allows applications to have access to a full read/write file system in a single file, complete with its own file hierarchy and name space. This is useful for applications which needs structured storage, such as embedding documents within documents, or management of configuration data or preferences.

  • sqlcipher ๐Ÿ“ ๐ŸŒ -- SQLCipher is a standalone fork of the SQLite database library that adds 256 bit AES encryption of database files and other security features.

  • sqlean ๐Ÿ“ ๐ŸŒ -- The ultimate set of SQLite extensions: SQLite has few functions compared to other database management systems. SQLite authors see this as a feature rather than a problem, because SQLite has an extension mechanism in place. There are a lot of SQLite extensions out there, but they are incomplete, inconsistent and scattered across the internet. sqlean brings them together, neatly packaged into domain modules, documented, tested, and built for Linux, Windows and macOS.

  • sqleet ๐Ÿ“ ๐ŸŒ -- an encryption extension for SQLite3. The encryption is transparent (on-the-fly) and based on modern cryptographic algorithms designed for high performance in software and robust side-channel resistance.

  • sqlite ๐Ÿ“ ๐ŸŒ -- the complete SQLite database engine.

  • sqlite3-compression-encryption-vfs ๐Ÿ“ ๐ŸŒ -- CEVFS: Compression & Encryption VFS for SQLite 3 is a SQLite 3 Virtual File System for compressing and encrypting data at the pager level. Once set up, you use SQLite as you normally would and the compression and encryption is transparently handled during database read/write operations via the SQLite pager.

  • sqlite3pp ๐Ÿ“ ๐ŸŒ -- a minimal ORM wrapper for SQLite et al.

  • sqlite-amalgamation ๐Ÿ“ ๐ŸŒ -- the SQLite amalgamation, which is the recommended method of building SQLite into larger projects.

  • SQLiteCpp ๐Ÿ“ ๐ŸŒ -- a smart and easy to use C++ SQLite3 wrapper. SQLiteC++ offers an encapsulation around the native C APIs of SQLite, with a few intuitive and well documented C++ classes.

  • sqlite-fts5-snowball ๐Ÿ“ ๐ŸŒ -- a simple extension for use with FTS5 within SQLite. It allows FTS5 to use Martin Porter's Snowball stemmers (libstemmer), which are available in several languages. Check http://snowballstem.org/ for more information about them.

  • sqlite_fts_tokenizer_chinese_simple ๐Ÿ“ ๐ŸŒ -- an extension of sqlite3 fts5 that supports Chinese and Pinyin. It fully provides a solution to the multi-phonetic word problem of full-text retrieval on WeChat mobile terminal: solution 4 in the article, very simple and efficient support for Chinese and Pinyin searches.

    On this basis we also support more accurate phrase matching through cppjieba. See the introduction article at https://www.wangfenjin.com/posts/simple-jieba-tokenizer/

  • SQLiteHistograms ๐Ÿ“ ๐ŸŒ -- an SQLite extension library for creating histogram tables, tables of ratio between histograms and interpolation tables of scatter point tables.

  • sqliteodbc ๐Ÿ“ ๐ŸŒ -- SQLite ODBC Driver for the wonderful SQLite 2.8.* and SQLite 3.* Database Engine/Library.

  • sqlite-parquet-vtable ๐Ÿ“ ๐ŸŒ -- an SQLite virtual table extension to expose Parquet files as SQL tables. You may also find csv2parquet useful. This blog post provides some context on why you might use this.

  • sqlite-stats ๐Ÿ“ ๐ŸŒ -- provides common statistical functions for SQLite.

  • sqlite_wrapper ๐Ÿ“ ๐ŸŒ -- an easy-to-use, lightweight and concurrency-friendly SQLite wrapper written in C++17.

  • sqlite_zstd_vfs ๐Ÿ“ ๐ŸŒ -- SQLite VFS extension providing streaming storage compression using Zstandard (Zstd), transparently compressing pages of the main database file as they're written out and later decompressing them as they're read in. It runs page de/compression on background threads and occasionally generates dictionaries to improve subsequent compression.

metadata & text (OCR et al) -- language detect, suggesting fixes, ...

  • chewing_text_cud ๐Ÿ“ ๐ŸŒ -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • cld1-language-detect ๐Ÿ“ ๐ŸŒ -- the CLD (Compact Language Detection) library, extracted from the source code for Google's Chromium library. CLD1 probabilistically detects languages in Unicode UTF-8 text.

  • cld2-language-detect ๐Ÿ“ ๐ŸŒ -- CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes. Optionally, it also returns a vector of text spans with the language of each identified. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text.

  • cld3-language-detect ๐Ÿ“ ๐ŸŒ -- CLD3 is a neural network model for language identification. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

  • compact_enc_det ๐Ÿ“ ๐ŸŒ -- Compact Encoding Detection (CED for short) is a library written in C++ that scans given raw bytes and detect the most likely text encoding.

  • cppjieba ๐Ÿ“ ๐ŸŒ -- the C++ version of the Chinese "Jieba" project:

    • Supports loading a custom user dictionary, using the '|' separator when multipathing or the ';' separator for separate, multiple, dictionaries.
    • Supports 'utf8' encoding.
    • The project comes with a relatively complete unit test, and the stability of the core function Chinese word segmentation (utf8) has been tested by the online environment.
  • cpp-unicodelib ๐Ÿ“ ๐ŸŒ -- a C++17 single-file header-only Unicode library.

  • detect-character-encoding ๐Ÿ“ ๐ŸŒ -- detect character encoding using ICU. Tip: If you donโ€™t need ICU in particular, consider using ced, which is based on Googleโ€™s lighter compact_enc_det library.

  • enca ๐Ÿ“ ๐ŸŒ -- Enca (Extremely Naive Charset Analyser) consists of two main components: libenca, an encoding detection library, and enca, a command line frontend, integrating libenca and several charset conversion libraries and tools (GNU recode, UNIX98 iconv, perl Unicode::Map, cstocs).

  • fastBPE ๐Ÿ“ ๐ŸŒ -- text tokenization / ngrams

  • fastText ๐Ÿ“ ๐ŸŒ -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • glyph_name ๐Ÿ“ ๐ŸŒ -- a library for computing Unicode sequences from glyph names according to the Adobe Glyph Naming convention: https://github.com/adobe-type-tools/agl-specification

  • libchardet ๐Ÿ“ ๐ŸŒ -- is based on Mozilla Universal Charset Detector library and, detects the character set used to encode data.

  • libchopshop ๐Ÿ“ ๐ŸŒ -- NLP/text processing with automated stop word detection and stemmer-based filtering. This library / toolkit is engineered to be able to provide both of the (often more or less disparate) n-gram token streams / vectors required for (1) initializing / training FTS databases, neural nets, etc. and (2) executing effective queries / matches on these engines.

  • libcppjieba ๐Ÿ“ ๐ŸŒ -- source code extracted from the [CppJieba] project to form a separate project, making it easier to understand and use.

  • libiconv ๐Ÿ“ ๐ŸŒ -- provides conversion between many platform, language or country dependent character encodings to & from Unicode. This library provides an iconv() implementation, for use on systems which don't have one, or whose implementation cannot convert from/to Unicode. It provides support for the encodings: European languages (ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-RU, CP{1250,1251,1252,1253,1254,1257}, CP{850,866,1131}, Mac{Roman,CentralEurope,Iceland,Croatian,Romania}, Mac{Cyrillic,Ukraine,Greek,Turkish}, Macintosh), Semitic languages (ISO-8859-{6,8}, CP{1255,1256}, CP862, Mac{Hebrew,Arabic}), Japanese (EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-1, ISO-2022-JP-MS), Chinese (EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT), Korean (EUC-KR, CP949, ISO-2022-KR, JOHAB), Armenian (ARMSCII-8), Georgian (Georgian-Academy, Georgian-PS), Tajik (KOI8-T), Kazakh (PT154, RK1048), Thai (ISO-8859-11, TIS-620, CP874, MacThai), Laotian (MuleLao-1, CP1133), Vietnamese (VISCII, TCVN, CP1258), Platform specifics (HP-ROMAN8, NEXTSTEP), Full Unicode (UTF-8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, C99, JAVA, UCS-2-INTERNAL, UCS-4-INTERNAL). It also provides support for a few extra encodings: European languages (CP{437,737,775,852,853,855,857,858,860,861,863,865,869,1125}), Semitic languages (CP864), Japanese (EUC-JISX0213, Shift_JISX0213, ISO-2022-JP-3), Chinese (BIG5-2003), Turkmen (TDS565), Platform specifics (ATARIST, RISCOS-LATIN1). It has also some limited support for transliteration, i.e. when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

  • libnatspec ๐Ÿ“ ๐ŸŒ -- The Nation Specifity Library is designed to smooth out national peculiarities when using software. Its primary objectives are: (1) Addressing encoding issues in most popular scenarios, (2) Providing various auxiliary tools that facilitate software localization.

  • libpinyin ๐Ÿ“ ๐ŸŒ -- the libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

  • libpostal ๐Ÿ“ ๐ŸŒ -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat ๐Ÿ“ ๐ŸŒ -- text language detection

  • libunibreak ๐Ÿ“ ๐ŸŒ -- an implementation of the line breaking and word breaking algorithms as described in (Unicode Standard Annex 14)[http://www.unicode.org/reports/tr14/] and (Unicode Standard Annex 29)[http://www.unicode.org/reports/tr29/].

  • line_detector ๐Ÿ“ ๐ŸŒ -- line segment detector (lsd) &. edge drawing line detector (edl) &. hough line detector (standard &. probabilistic) for detection.

  • marian ๐Ÿ“ ๐ŸŒ -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • pinyin ๐Ÿ“ ๐ŸŒ -- pฤซnyฤซn is a tool for converting Chinese characters to pinyin. It can be used for Chinese phonetic notation, sorting, and retrieval.

  • sentencepiece ๐Ÿ“ ๐ŸŒ -- text tokenization

  • sentence-tokenizer ๐Ÿ“ ๐ŸŒ -- text tokenization

  • simdutf ๐Ÿ“ ๐ŸŒ -- delivers Unicode validation and transcoding at billions of characters per second, providing fast Unicode functions such as ASCII, UTF-8, UTF-16LE/BE and UTF-32 validation, with and without error identification, Latin1 to UTF-8 transcoding and vice versa, etc. The functions are accelerated using SIMD instructions (e.g., ARM NEON, SSE, AVX, AVX-512, RISC-V Vector Extension, etc.). When your strings contain hundreds of characters, we can often transcode them at speeds exceeding a billion characters per second. You should expect high speeds not only with English strings (ASCII) but also Chinese, Japanese, Arabic, and so forth. We handle the full character range (including, for example, emojis).

  • uchardet ๐Ÿ“ ๐ŸŒ -- uchardet is an encoding and language detector library, which attempts to determine the encoding of the text. It can reliably detect many charsets. Moreover it also works as a very good and fast language detector.

  • ucto ๐Ÿ“ ๐ŸŒ -- text tokenization

    • libfolia ๐Ÿ“ ๐ŸŒ -- working with the Format for Linguistic Annotation (FoLiA). Provides a high-level API to read, manipulate, and create FoLiA documents.
    • uctodata ๐Ÿ“ ๐ŸŒ -- data for ucto library
  • uni-algo ๐Ÿ“ ๐ŸŒ -- this library handles all unicode conversion/processing problems (there are not only ill-formed sequences actually) properly and always according to The Unicode Standard: in C/C++ there is no safe type for UTF-8/UTF-16 that guarantees that the data will be well-formed; this makes the problem even worse. There are plenty of Unicode libraries for C/C++ out there that implement Unicode algorithms of varying quality, but many of them don't handle ill-formed UTF sequences at all. In the best-case scenario, you'll get an exception/error; in the worst-case, undefined behavior. The biggest problem is that in 99% cases everything will be fine. This is inappropriate for security reasons. This library doesn't work with types/strings/files/streams, it works with the data inside them and makes it safe when it's needed. Check this article if you want more information about ill-formed sequences: https://hsivonen.fi/broken-utf-8 . It is a bit outdated because ICU (International Components for Unicode) now uses W3C conformant implementation too, but the information in the article is very useful. This library does use W3C conformant implementation.

  • unicode-cldr ๐Ÿ“ ๐ŸŒ -- Unicode CLDR Project: provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

  • unicode-cldr-data ๐Ÿ“ ๐ŸŒ -- the JSON distribution of CLDR locale data for internationalization. While XML (not JSON) is the "official" format for all CLDR data, this data is programatically generated from the corresponding XML, using the CLDR tooling. This JSON data is generated using only data that has achieved draft="contributed" or draft="approved" status in the CLDR. This is the same threshhold as is used by the ICU (International Components for Unicode).

  • unicode-icu ๐Ÿ“ ๐ŸŒ -- the International Components for Unicode.

  • unicode-icu-data ๐Ÿ“ ๐ŸŒ -- International Components for Unicode: Data Repository. This is an auxiliary repository for the International Components for Unicode.

  • unicode-icu-demos ๐Ÿ“ ๐ŸŒ -- ICU Demos contains sample applications built using the International Components for Unicode (ICU) C++ library ICU4C.

  • unilib ๐Ÿ“ ๐ŸŒ -- an embeddable C++17 Unicode library.

  • utfcpp ๐Ÿ“ ๐ŸŒ -- UTF-8 with C++ in a Portable Way

  • win-iconv ๐Ÿ“ ๐ŸŒ -- an iconv implementation using Win32 API to convert.

  • worde_butcher ๐Ÿ“ ๐ŸŒ -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • xmunch ๐Ÿ“ ๐ŸŒ -- xmunch essentially does, what the 'munch' command of, for example, hunspell does, but is not compatible with hunspell affix definition files. So why use it then? What makes xmunch different from the other tools is the ability to extend incomplete word-lists. For hunspell's munch to identify a stem and add an affix mark, every word formed by the affix with the stem has to be in the original word-list. This makes sense for a compression tool. However if your word list is incomplete, you have to add all possible word-forms of a word first, before any compression is done. Using xmunch instead, you can define a subset of forms which are required to be in the word-list to allow the word to be used as stem. Like this, you can extend the word-list.

  • you-token-to-me ๐Ÿ“ ๐ŸŒ -- text tokenization

  • ztd.text ๐Ÿ“ ๐ŸŒ -- an implementation of an up and coming proposal percolating through SG16, P1629 - Standard Text Encoding. It will also include implementations of some downstream ideas covered in Previous Work in this area, including Zach Laine's Boost.Text (proposed), rmf's libogonek, and Tom Honermann's text_view.

PDF (XML) metadata editing

for round-trip annotation and other "external application editing" of known documents; metadata embedding / export

  • PDFGen ๐Ÿ“ ๐ŸŒ -- a simple PDF Creation/Generation library, contained in a single C-file with header and no external library dependencies. Useful for embedding into other programs that require rudimentary PDF output.
  • pdfgrep ๐Ÿ“ ๐ŸŒ -- a tool to search text in PDF files. It works similarly to grep.
  • pdfium ๐Ÿ“ ๐ŸŒ -- the PDF library used by the Chromium project.
  • podofo ๐Ÿ“ ๐ŸŒ -- a library to work with the PDF file format and includes also a few tools. The name comes from the first two letters of PDF (Portable Document Format). The PoDoFo library is a free portable C++ library which includes classes to parse a PDF file and modify its contents into memory. The changes can be written back to disk easily. PoDoFo is designed to avoid loading large PDF objects into memory until they are required and can write large streams immediately to disk, so it is possible to manipulate quite large files with it.
  • poppler ๐Ÿ“ ๐ŸŒ -- Poppler is a library for rendering PDF files, and examining or modifying their structure. Poppler originally came from the XPDF sources.
  • qpdf ๐Ÿ“ ๐ŸŒ -- QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption, and numerous other features. It can also be used for splitting and merging files, creating PDF files, and inspecting files for study or analysis. QPDF does not render PDFs or perform text extraction, and it does not contain higher-level interfaces for working with page contents. It is a low-level tool for working with the structure of PDF files and can be a valuable tool for anyone who wants to do programmatic or command-line-based manipulation of PDF files.
  • sioyek ๐Ÿ“ ๐ŸŒ -- a PDF viewer with a focus on textbooks and research papers.
  • sumatrapdf ๐Ÿ“ ๐ŸŒ -- SumatraPDF is a multi-format (PDF, EPUB, MOBI, CBZ, CBR, FB2, CHM, XPS, DjVu) reader for Windows.
  • XMP-Toolkit-SDK ๐Ÿ“ ๐ŸŒ -- the XMP Toolkit allows you to integrate XMP functionality into your product, supplying an API for locating, adding, or updating the XMP metadata in a file.
  • xpdf ๐Ÿ“ ๐ŸŒ -- Xpdf is an open source viewer for Portable Document Format (PDF) files.

web scraping (document extraction, cleaning, metadata extraction, BibTeX, ...)

(see also investigation notes in Qiqqa docs)

  • boost-url ๐Ÿ“ ๐ŸŒ -- a library for manipulating (RFC3986) Uniform Resource Identifiers (URIs) and Locators (URLs).

  • cURL ๐Ÿ“ ๐ŸŒ -- the ubiquitous libcurl.

  • curl-impersonate ๐Ÿ“ ๐ŸŒ -- a special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.

  • curlpp ๐Ÿ“ ๐ŸŒ -- cURLpp is a C++ wrapper for libcURL.

  • curl-www ๐Ÿ“ ๐ŸŒ -- the curl.se web site contents.

  • easyexif ๐Ÿ“ ๐ŸŒ -- EasyEXIF is a tiny, lightweight C++ library that parses basic (EXIF) information out of JPEG files. It uses only the std::string library and is otherwise pure C++. You pass it the binary contents of a JPEG file, and it parses several of the most important EXIF fields for you.

  • everything-curl ๐Ÿ“ ๐ŸŒ -- Everything curl is an extensive guide for all things curl. The project, the command-line tool, the library, how everything started and how it came to be the useful tool it is today. It explains how we work on developing it further, what it takes to use it, how you can contribute with code or bug reports and why millions of existing users use it.

  • exif ๐Ÿ“ ๐ŸŒ -- a small command-line utility to show EXIF information hidden in JPEG files, demonstrating the power of libexif.

  • exiv2 ๐Ÿ“ ๐ŸŒ -- a C++ library and a command-line utility to read, write, delete and modify Exif, IPTC, XMP and ICC image metadata.

  • extract ๐Ÿ“ ๐ŸŒ -- clone of git://git.ghostscript.com/extract.git

  • faup ๐Ÿ“ ๐ŸŒ -- Faup stands for Finally An Url Parser and is a library and command line tool to parse URLs and normalize fields with two constraints: (1) work with real-life urls (resilient to badly formated ones), and (2) be fast: no allocation for string parsing and read characters only once.

  • GQ-gumbo-css-selectors ๐Ÿ“ ๐ŸŒ -- GQ is a CSS Selector Engine for Gumbo Parser written in C++11. Using Gumbo Parser as a backend, GQ can parse input HTML and allow users to select and modify elements in the parsed document with CSS Selectors and the provided simple, but powerful mutation API.

  • gumbo-libxml ๐Ÿ“ ๐ŸŒ -- LibXML2 bindings for the Gumbo HTML5 parser: this provides a libxml2 API on top of the Gumbo parser. It lets you use a modern parser - Gumbo now passes all html5lib tests, including the template tag, and should be fully conformant with the HTML5 spec - with the full ecosystem of libxml tools, including XPath, tree modification, DTD validation, etc.

  • gumbo-parser ๐Ÿ“ ๐ŸŒ -- HTML parser

  • gumbo_pp ๐Ÿ“ ๐ŸŒ -- a C++ wrapper over Gumbo that provides a higher level query mechanism.

  • gumbo-query ๐Ÿ“ ๐ŸŒ -- HTML DOM access in C/C++

  • hescape ๐Ÿ“ ๐ŸŒ -- a C library for fast HTML escape using SSE instruction, pcmpestri. Hescape provides only one API, hesc_escape_html().

  • houdini ๐Ÿ“ ๐ŸŒ -- Houdini - The Escapist: is zero-dependency and modular. Houdini is a simple API for escaping text for the web. And unescaping it. HTML escaping follows the OWASP suggestion. All other entities are left as-is. HTML unescaping is fully RFC-compliant. Yes, that's the 253 different entities for you, and decimal/hex code point specifiers. URI escaping and unescaping is fully RFC-compliant. URL escaping and unescaping is the same as generic URIs, but spaces are changed to +.

  • htmlstreamparser ๐Ÿ“ ๐ŸŒ -- used in a demo of zsync2

  • http-parser ๐Ÿ“ ๐ŸŒ -- a parser for HTTP messages written in C. It parses both requests and responses. The parser is designed to be used in performance HTTP applications. It does not make any syscalls nor allocations, it does not buffer data, it can be interrupted at anytime. Depending on your architecture, it only requires about 40 bytes of data per message stream (in a web server that is per connection).

  • lexbor ๐Ÿ“ ๐ŸŒ -- fast HTML5 fully-conformant HTML + CSS parser.

  • libcpr ๐Ÿ“ ๐ŸŒ -- wrapper library for cURL. C++ Requests is a simple wrapper around libcurl inspired by the excellent Python Requests project. Despite its name, libcurl's easy interface is anything but, and making mistakes misusing it is a common source of error and frustration. Using the more expressive language facilities of C++11, this library captures the essence of making network calls into a few concise idioms.

  • libexif ๐Ÿ“ ๐ŸŒ -- a library for parsing, editing, and saving EXIF data. In addition, it has gettext support. All EXIF tags described in EXIF standard 2.1 (and most from 2.2) are supported. Many maker notes from Canon, Casio, Epson, Fuji, Nikon, Olympus, Pentax and Sanyo cameras are also supported.

  • libexpat ๐Ÿ“ ๐ŸŒ -- XML read/write

  • libhog ๐Ÿ“ ๐ŸŒ -- hog a.k.a. hound - fetch the (PDF,EPUB,HTML) document you seek using maximum effort: hog is a tool for fetching files from the internet, specifically PDFs. Intended to be used where you browse the 'Net and decide you want to download a given PDF from any site: this can be done through the browser itself, but is sometimes convoluted or neigh impossible (ftp links require another tool, PDFs stored at servers which report as having their SSL certificates expired are a hassle to get through for the user-in-a-hurry, etc. etc.) and hog is meant to cope with all these.

  • libidn2 ๐Ÿ“ ๐ŸŒ -- international domain name parsing

  • libpsl ๐Ÿ“ ๐ŸŒ -- handles the Public Suffix List (a collection of Top Level Domains (TLDs) suffixes, e.g. .com, .net, Country Top Level Domains (ccTLDs) like .de and .cn and Brand Top Level Domains like .apple and .google. Can be used to:

    • avoid privacy-leaking "super domain" certificates (see post from Jeffry Walton)
    • avoid privacy-leaking "supercookies"
    • domain highlighting parts of the domain in a user interface
    • sorting domain lists by site
  • libxml2 ๐Ÿ“ ๐ŸŒ -- libxml: XML read/write

  • LLhttp-parser ๐Ÿ“ ๐ŸŒ -- a port and replacement of http_parser to TypeScript. llparse is used to generate the output C source file, which could be compiled and linked with the embedder's program (like Node.js).

  • picohttpparser ๐Ÿ“ ๐ŸŒ -- PicoHTTPParser is a tiny, primitive, fast HTTP request/response parser. Unlike most parsers, it is stateless and does not allocate memory by itself. All it does is accept pointer to buffer and the output structure, and setups the pointers in the latter to point at the necessary portions of the buffer.

  • qs_parse ๐Ÿ“ ๐ŸŒ -- a set of simple and easy functions for parsing URL query strings, such as those generated in an HTTP GET form submission.

  • robotstxt ๐Ÿ“ ๐ŸŒ -- Google robots.txt Parser and Matcher Library. The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate. Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

  • sist2 ๐Ÿ“ ๐ŸŒ -- sist2 (Simple incremental search tool) is a fast, low memory usage, multi-threaded application, which scans drives and directory trees, extracts text and metadata from common file types, generates thumbnails and comes with OCR support (with tesseract) and Named-Entity Recognition (using pre-trained client-side tensorflow models).

  • tidy-html5 ๐Ÿ“ ๐ŸŒ -- clean up HTML documents before archiving/processing

  • URI-Encode-C ๐Ÿ“ ๐ŸŒ -- an optimized C library for percent encoding/decoding text, i.e. a URI encoder/decoder written in C based on RFC3986.

  • url ๐Ÿ“ ๐ŸŒ -- URI parsing and other utility functions

  • URL-Detector ๐Ÿ“ ๐ŸŒ -- Url Detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text. Keep in mind that for security purposes, its better to overdetect urls: instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari. It is also able to identify the parts of the identified urls.

  • url-parser ๐Ÿ“ ๐ŸŒ -- parse URLs much like Node's url module.

  • wget2 ๐Ÿ“ ๐ŸŒ -- GNU Wget2 is the successor of GNU Wget, a file and recursive website downloader. Designed and written from scratch it wraps around libwget, that provides the basic functions needed by a web client. Wget2 works multi-threaded and uses many features to allow fast operation. In many cases Wget2 downloads much faster than Wget1.x due to HTTP2, HTTP compression, parallel connections and use of If-Modified-Since HTTP header.

  • xml-pugixml ๐Ÿ“ ๐ŸŒ -- light-weight, simple and fast XML parser for C++ with XPath support.

audio files & processing

Not just speech processing & speech recognition, but sometimes data is easier "visualized" as audio (sound).

  • AudioFile ๐Ÿ“ ๐ŸŒ -- a simple header-only C++ library for reading and writing audio files. (WAV, AIFF)
  • dr_libs ๐Ÿ“ ๐ŸŒ -- single file audio decoding libraries for C and C++ (FLAC, MP3, WAV)
  • flac ๐Ÿ“ ๐ŸŒ -- a software that can reduce the amount of storage space needed to store digital audio signals without needing to remove information in doing so. The files read and produced by this software are called FLAC files. As these files (which follow the FLAC format) can be read from and written to by other software as well, this software is often referred to as the FLAC reference implementation.
  • libsndfile ๐Ÿ“ ๐ŸŒ -- a C library for reading and writing files containing sampled audio data, e.g. Ogg, Vorbis and FLAC.
  • minimp3 ๐Ÿ“ ๐ŸŒ -- a minimalistic, single-header library for decoding MP3. minimp3 is designed to be small, fast (with SSE and NEON support), and accurate (ISO conformant).
  • opus ๐Ÿ“ ๐ŸŒ -- an audio codec for interactive speech and audio transmission over the Internet. Opus can handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low bit-rate narrowband speech to very high quality stereo music.
  • qoa ๐Ÿ“ ๐ŸŒ -- QOA - The โ€œQuite OK Audio Formatโ€ for fast, lossy audio compression - is a single-file library for C/C++. More info at: https://qoaformat.org
  • r8brain-free-src ๐Ÿ“ ๐ŸŒ -- high-quality professional audio sample rate converter (SRC) / resampler C++ library. Features routines for SRC, both up- and downsampling, to/from any sample rate, including non-integer sample rates: it can be also used for conversion to/from SACD/DSD sample rates, and even go beyond that. Also suitable for fast general-purpose 1D time-series resampling / interpolation (with relaxed filter parameters).
  • sac ๐Ÿ“ ๐ŸŒ -- a state-of-the-art lossless audio compression model. Lossless audio compression is a complex problem, because PCM data is highly non-stationary and uses high sample resolution (typically >=16bit). That's why classic context modelling suffers from context dilution problems. Sac employs a simple OLS-NLMS predictor per frame including bias correction. Prediction residuals are encoded using a sophisticated bitplane coder including SSE and various forms of probability estimations. Meta-parameters of the predictor are optimized via binary search (or DDS) on by-frame basis. This results in a highly asymmetric codec design. We throw a lot of muscles at the problem and archive only little gains - by practically predicting noise.
  • silk-codec ๐Ÿ“ ๐ŸŒ -- a library to convert PCM to TenCent Silk files and vice versa.
  • silk-v3-decoder ๐Ÿ“ ๐ŸŒ -- decodes Silk v3 audio files (like WeChat amr, aud files, qq slk files) and converts to other formats (like mp3).
  • Solo ๐Ÿ“ ๐ŸŒ -- Agora SOLO is a speech codec, developed based on Silk with BWE(Bandwidth Extension) and MDC(Multi Description Coding). With these technologies, SOLO is able to resist weak networks at low bitrates. The main reason for SOLO to use bandwidth expansion is to reduce the computational complexity.
  • speex ๐Ÿ“ ๐ŸŒ -- a patent-free voice codec. Unlike other codecs like MP3 and Ogg Vorbis, Speex is designed to compress voice at bitrates in the 2-45 kbps range. Possible applications include VoIP, internet audio streaming, archiving of speech data (e.g. voice mail), and audio books.

file format support

  • AudioFile ๐Ÿ“ ๐ŸŒ -- a simple header-only C++ library for reading and writing audio files. (WAV, AIFF)

  • basez ๐Ÿ“ ๐ŸŒ -- encode data into/decode data from base16, base32, base32hex, base64 or base64url stream per RFC 4648; MIME base64 Content-Transfer-Encoding per RFC 2045; or PEM Printable Encoding per RFC 1421.

  • CHM-lib ๐Ÿ“ ๐ŸŒ -- as I have several HTML pages stored in this format. See also MHTML: mht-rip

  • cpp-base64 ๐Ÿ“ ๐ŸŒ -- base64 encoding and decoding with C++

  • csv-parser ๐Ÿ“ ๐ŸŒ -- Vince's CSV Parser: there's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.

  • cvmatio ๐Ÿ“ ๐ŸŒ -- an open source Matlab v7 MAT file parser written in C++, giving users the ability to interact with binary MAT files in their own projects.

  • datamash ๐Ÿ“ ๐ŸŒ -- GNU Datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files. It is designed to be portable and reliable, and aid researchers to easily automate analysis pipelines, without writing code or even short scripts.

  • djvulibre ๐Ÿ“ ๐ŸŒ -- DjVu (pronounced "dรฉjร  vu") a set of compression technologies, a file format, and a software platform for the delivery over the Web of digital documents, scanned documents, and high resolution images.

  • extract ๐Ÿ“ ๐ŸŒ -- clone of git://git.ghostscript.com/extract.git

  • fast-cpp-csv-parser ๐Ÿ“ ๐ŸŒ -- a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.

  • fastgron ๐Ÿ“ ๐ŸŒ -- fastgron makes JSON greppable super fast! fastgron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but lack documentation.

  • FFmpeg ๐Ÿ“ ๐ŸŒ -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • file ๐Ÿ“ ๐ŸŒ -- file filetype recognizer tool & mimemagic

  • flac ๐Ÿ“ ๐ŸŒ -- a software that can reduce the amount of storage space needed to store digital audio signals without needing to remove information in doing so. The files read and produced by this software are called FLAC files. As these files (which follow the FLAC format) can be read from and written to by other software as well, this software is often referred to as the FLAC reference implementation.

  • gmt ๐Ÿ“ ๐ŸŒ -- GMT (Generic Mapping Tools) is an open source collection of about 100 command-line tools for manipulating geographic and Cartesian data sets (including filtering, trend fitting, gridding, projecting, etc.) and producing high-quality illustrations ranging from simple x-y plots via contour maps to artificially illuminated surfaces, 3D perspective views and animations. The GMT supplements add another 50 more specialized and discipline-specific tools. GMT supports over 30 map projections and transformations and requires support data such as GSHHG coastlines, rivers, and political boundaries and optionally DCW country polygons.

  • gumbo-libxml ๐Ÿ“ ๐ŸŒ -- LibXML2 bindings for the Gumbo HTML5 parser: this provides a libxml2 API on top of the Gumbo parser. It lets you use a modern parser - Gumbo now passes all html5lib tests, including the template tag, and should be fully conformant with the HTML5 spec - with the full ecosystem of libxml tools, including XPath, tree modification, DTD validation, etc.

  • gumbo-parser ๐Ÿ“ ๐ŸŒ -- HTML parser

  • gumbo_pp ๐Ÿ“ ๐ŸŒ -- a C++ wrapper over Gumbo that provides a higher level query mechanism.

  • gumbo-query ๐Ÿ“ ๐ŸŒ -- HTML DOM access in C/C++

  • http-parser ๐Ÿ“ ๐ŸŒ -- a parser for HTTP messages written in C. It parses both requests and responses. The parser is designed to be used in performance HTTP applications. It does not make any syscalls nor allocations, it does not buffer data, it can be interrupted at anytime. Depending on your architecture, it only requires about 40 bytes of data per message stream (in a web server that is per connection).

  • id3-tagparser ๐Ÿ“ ๐ŸŒ -- a C++ library for reading and writing MP4 (iTunes), ID3, Vorbis, Opus, FLAC and Matroska tags.

  • jq ๐Ÿ“ ๐ŸŒ -- a lightweight and flexible command-line JSON processor.

  • jtc ๐Ÿ“ ๐ŸŒ -- jtc stand for: JSON transformational chains (used to be JSON test console) and is a cli tool to extract, manipulate and transform source JSON, offering powerful ways to select one or multiple elements from a source JSON and apply various actions on the selected elements at once (wrap selected elements into a new JSON, filter in/out, sort elements, update elements, insert new elements, remove, copy, move, compare, transform, swap around and many other operations).

  • lexbor ๐Ÿ“ ๐ŸŒ -- fast HTML5 fully-conformant HTML + CSS parser.

  • libaom ๐Ÿ“ ๐ŸŒ -- AV1 Codec Library

  • libarchive ๐Ÿ“ ๐ŸŒ -- a portable, efficient C library that can read and write streaming archives in a variety of formats. It also includes implementations of the common tar, cpio, and zcat command-line tools that use the libarchive library.

  • libase ๐Ÿ“ ๐ŸŒ -- a tiny library for interpreting the Adobe Swatch Exchange (.ase) file format for color palettes since Adobe Creative Suite 3.

  • libass ๐Ÿ“ ๐ŸŒ -- libass is a portable subtitle renderer for the ASS/SSA (Advanced Substation Alpha/Substation Alpha) subtitle format.

  • libavif ๐Ÿ“ ๐ŸŒ -- a friendly, portable C implementation of the AV1 Image File Format, as described here: https://aomediacodec.github.io/av1-avif/

  • libcmime ๐Ÿ“ ๐ŸŒ -- MIME extract/insert/encode/decode: use for MHTML support

  • libcsv2 ๐Ÿ“ ๐ŸŒ -- CSV file format reader/writer library.

  • libde265 ๐Ÿ“ ๐ŸŒ -- libde265 is an open source implementation of the h.265 video codec. It is written from scratch and has a plain C API to enable a simple integration into other software. libde265 supports WPP and tile-based multithreading and includes SSE optimizations. The decoder includes all features of the Main profile and correctly decodes almost all conformance streams (see [wiki page]).

  • libexpat ๐Ÿ“ ๐ŸŒ -- XML read/write

  • libheif ๐Ÿ“ ๐ŸŒ -- High Efficiency Image File Format (HEIF) :: a visual media container format standardized by the Moving Picture Experts Group (MPEG) for storage and sharing of images and image sequences. It is based on the well-known ISO Base Media File Format (ISOBMFF) standard. HEIF Reader/Writer Engine is an implementation of HEIF standard in order to demonstrate its powerful features and capabilities.

  • libheif-alt ๐Ÿ“ ๐ŸŒ -- an ISO/IEC 23008-12:2017 HEIF and AVIF (AV1 Image File Format) file format decoder and encoder. HEIF and AVIF are new image file formats employing HEVC (h.265) or AV1 image coding, respectively, for the best compression ratios currently possible.

  • libics ๐Ÿ“ ๐ŸŒ -- the reference library for ICS (Image Cytometry Standard), an open standard for writing images of any dimensionality and data type to file, together with associated information regarding the recording equipment or recorded subject.

    ICS stands for Image Cytometry Standard, and was first proposed in: P. Dean, L. Mascio, D. Ow, D. Sudar, J. Mullikin, "Propsed standard for image cytometry data files", Cytometry, n.11, pp.561-569, 1990.

    It writes 2 files, one is the header, with an '.ics' extension, and the other is the actual image data (with an '.ids' extension.)

    ICS version 2.0 extends this standard to allow for a more versatile placement of the image data. It can now be placed either in the same '.ics' file or inbedded in any other file, by specifying the file name and the byte offset for the data.

    The advantage of ICS over other open standards such as TIFF is that it allows data of any type and dimensionality to be stored. A TIFF file can contain a collection of 2D images; it's up to the user to determine how these relate to each other. An ICS file can contain, for exmaple, a 5D image in which the 4th dimension is the light frequency and the 5th time. Also, all of the information regarding the microscope settings (or whatever instument was used to acquire the image) and the sample preparation can be included in the file.

  • libmetalink ๐Ÿ“ ๐ŸŒ -- a library to read Metalink XML download description format. It supports both Metalink version 3 and Metalink version 4 (RFC 5854).

  • libmobi ๐Ÿ“ ๐ŸŒ -- a library for handling Mobipocket/Kindle (MOBI) ebook format documents.

  • libpsd ๐Ÿ“ ๐ŸŒ -- a library for Adobe Photoshop .psd file's decoding and rendering.

  • libsndfile ๐Ÿ“ ๐ŸŒ -- a C library for reading and writing files containing sampled audio data, e.g. Ogg, Vorbis and FLAC.

  • libwarc ๐Ÿ“ ๐ŸŒ -- C++ library to parse WARC files. WARC is the official storage format of the Internet Archive for storing scraped content. WARC format used: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

  • libxml2 ๐Ÿ“ ๐ŸŒ -- libxml: XML read/write

  • libzip ๐Ÿ“ ๐ŸŒ -- a C library for reading, creating, and modifying zip and zip64 archives.

  • LLhttp-parser ๐Ÿ“ ๐ŸŒ -- a port and replacement of http_parser to TypeScript. llparse is used to generate the output C source file, which could be compiled and linked with the embedder's program (like Node.js).

  • mcmd ๐Ÿ“ ๐ŸŒ -- MCMD (M-Command): a set of commands for handling large scale CSV data. MCMD (called as M-Command) is a set of commands that are developed for the purpose of high-speed processing of large-scale structured tabular data in CSV format. It is possible to efficiently process large scale data with hundred millions row of records on a standard PC.

  • metalink-cli ๐Ÿ“ ๐ŸŒ -- a small program which generates a metalink record on stdout for every file given on the commandline and using the mirror list from stdin.

  • metalink-mini-downloader ๐Ÿ“ ๐ŸŒ -- a small metalink downloader written in C++, using boost, libcurl and expat. It can either be compiled so that it downloads a specific file and then (optionally) launches it or be compiled into a "downloader template", which can later be used to create a custom downloader by replacing text strings inside the executable (they are marked in a special way, to make this easy).

  • mht-rip ๐Ÿ“ ๐ŸŒ -- as I have several HTML pages stored in this MHTML format. See also CHM: CHM-lib

  • mime-mega ๐Ÿ“ ๐ŸŒ -- MIME extract/insert/encode/decode: use for MHTML support

  • mimetic ๐Ÿ“ ๐ŸŒ -- MIME: use for MHTML support

  • minizip-ng ๐Ÿ“ ๐ŸŒ -- a zip manipulation library written in C that is supported on Windows, macOS, and Linux. Minizip was originally developed by Gilles Vollant in 1998. It was first included in the zlib distribution as an additional code contribution starting in zlib 1.1.2. Since that time, it has been continually improved upon and contributed to by many people. The original project can still be found in the zlib distribution that is maintained by Mark Adler.

  • netpbm ๐Ÿ“ ๐ŸŒ -- a toolkit for manipulation of graphic images, including conversion of images between a variety of different formats. There are over 300 separate tools in the package including converters for about 100 graphics formats. Examples of the sort of image manipulation we're talking about are: Shrinking an image by 10%; Cutting the top half off of an image; Making a mirror image; Creating a sequence of images that fade from one image to another, etc.

  • OpenEXR ๐Ÿ“ ๐ŸŒ -- a high dynamic-range (HDR) image file format developed by Industrial Light & Magic (ILM) for use in computer imaging applications.

  • openexr-images ๐Ÿ“ ๐ŸŒ -- collection of images associated with the OpenEXR distribution.

  • pdf2htmlEX ๐Ÿ“ ๐ŸŒ -- convert PDF to HTML without losing text or format.

  • picohttpparser ๐Ÿ“ ๐ŸŒ -- PicoHTTPParser is a tiny, primitive, fast HTTP request/response parser. Unlike most parsers, it is stateless and does not allocate memory by itself. All it does is accept pointer to buffer and the output structure, and setups the pointers in the latter to point at the necessary portions of the buffer.

  • pisa_formatter ๐Ÿ“ ๐ŸŒ -- converts list of documents to the pisa-engine binary format: {.docs, .freqs, .sizes}. Its input should be a text file where each line is a document. Each document starts with the document name (which should not have whitespaces) followed by a list of ascii terms separated by whitespaces which define the document. This also generates a binary .terms file which has the information to convert from term to index and is used by the query_transformer executable. This file stores all the unique terms from all the documents.

  • psd_sdk ๐Ÿ“ ๐ŸŒ -- a C++ library that directly reads Photoshop PSD files. The library supports:

    • Groups
    • Nested layers
    • Smart Objects
    • User and vector masks
    • Transparency masks and additional alpha channels
    • 8-bit, 16-bit, and 32-bit data in grayscale and RGB color mode
    • All compression types known to Photoshop

    Additionally, limited export functionality is also supported.

  • qs_parse ๐Ÿ“ ๐ŸŒ -- a set of simple and easy functions for parsing URL query strings, such as those generated in an HTTP GET form submission.

  • robotstxt ๐Ÿ“ ๐ŸŒ -- Google robots.txt Parser and Matcher Library. The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate. Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

  • SFML ๐Ÿ“ ๐ŸŒ -- Simple and Fast Multimedia Library (SFML) is a simple, fast, cross-platform and object-oriented multimedia API. It provides access to windowing, graphics, audio and network.

  • silk-codec ๐Ÿ“ ๐ŸŒ -- a library to convert PCM to TenCent Silk files and vice versa.

  • silk-v3-decoder ๐Ÿ“ ๐ŸŒ -- decodes Silk v3 audio files (like WeChat amr, aud files, qq slk files) and converts to other formats (like mp3).

  • sqawk ๐Ÿ“ ๐ŸŒ -- apply SQL on CSV files in the shell: sqawk imports CSV files into an on-the-fly SQLite database, and runs a user-supplied query on the data.

  • taglib ๐Ÿ“ ๐ŸŒ -- TagLib is a library for reading and editing the metadata of several popular audio formats. Currently it supports both ID3v1 and [ID3v2][] for MP3 files, [Ogg Vorbis][] comments and ID3 tags in [FLAC][], MPC, Speex, WavPack, TrueAudio, WAV, AIFF, MP4, APE, and ASF files.

  • ticpp ๐Ÿ“ ๐ŸŒ -- TinyXML++: XML read/write

  • tidy-html5 ๐Ÿ“ ๐ŸŒ -- clean up HTML documents before archiving/processing

  • tinyexr ๐Ÿ“ ๐ŸŒ -- Tiny OpenEXR: tinyexr is a small, single header-only library to load and save OpenEXR (.exr) images.

  • upskirt-markdown ๐Ÿ“ ๐ŸŒ -- MarkDown renderer

  • url-parser ๐Ÿ“ ๐ŸŒ -- parse URLs much like Node's url module.

  • warc2text ๐Ÿ“ ๐ŸŒ -- Extracts plain text, language identification and more metadata from WARC records.

  • xlnt ๐Ÿ“ ๐ŸŒ -- a modern C++ library for manipulating spreadsheets in memory and reading/writing them from/to XLSX files as described in ECMA 376 4th edition.

  • xml-pugixml ๐Ÿ“ ๐ŸŒ -- light-weight, simple and fast XML parser for C++ with XPath support.

  • xmunch ๐Ÿ“ ๐ŸŒ -- xmunch essentially does, what the 'munch' command of, for example, hunspell does, but is not compatible with hunspell affix definition files. So why use it then? What makes xmunch different from the other tools is the ability to extend incomplete word-lists. For hunspell's munch to identify a stem and add an affix mark, every word formed by the affix with the stem has to be in the original word-list. This makes sense for a compression tool. However if your word list is incomplete, you have to add all possible word-forms of a word first, before any compression is done. Using xmunch instead, you can define a subset of forms which are required to be in the word-list to allow the word to be used as stem. Like this, you can extend the word-list.

  • zsv ๐Ÿ“ ๐ŸŒ -- the world's fastest (SIMD) CSV parser, with an extensible CLI for SQL querying, format conversion and more.

  • gmime ๐ŸŒ (alternative repo here) -- multipart MIME library; serves as a fundamental building block for full MHTML file format I/O support

    • removed; reason: GNOME libraries are horrible to integrate with other codebases.

BibTeX and similar library metadata formats' support

  • bibtex-robust-decoder ๐Ÿ“ ๐ŸŒ -- BibTeX parser which is robust: it will cope well with various BibTeX input errors which may be caused by manual entry of a BibTeX record.
  • bibtool ๐Ÿ“ ๐ŸŒ -- a tool for manipulating BibTeX data bases. BibTeX provides a mean to integrate citations into LaTeX documents. BibTool allows the manipulation of BibTeX files which goes beyond the possibilities -- and intentions -- of BibTeX.
  • bibutils ๐Ÿ“ ๐ŸŒ -- the bibutils set interconverts between various bibliography formats using a common MODS-format XML intermediate. For example, one can convert RIS-format files to Bibtex by doing two transformations: RIS->MODS->Bibtex. By using a common intermediate for N formats, only 2N programs are required and not Nยฒ-N.

export / output file formats, text formatting, etc.

  • bustache ๐Ÿ“ ๐ŸŒ -- C++20 implementation of {{ mustache }}, compliant with spec v1.1.3.

  • fast_double_parser ๐Ÿ“ ๐ŸŒ -- 4x faster than strtod(). Unless you need support for RFC 7159 (JSON standard), we encourage users to adopt fast_float library instead. It has more functionality. Fast function to parse ASCII strings containing decimal numbers into double-precision (binary64) floating-point values. That is, given the string "1.0e10", it should return a 64-bit floating-point value equal to 10000000000. We do not sacrifice accuracy. The function will match exactly (down the smallest bit) the result of a standard function like strtod.

  • fast_float ๐Ÿ“ ๐ŸŒ -- fast and exact implementation of the C++ from_chars functions for float and double types: 4x faster than strtod

  • fast-hex ๐Ÿ“ ๐ŸŒ -- a fast, SIMD (vectorized) hex string encoder/decoder.

  • fmt ๐Ÿ“ ๐ŸŒ -- advanced C++ data-to-text formatter. The modern answer to classic printf().

  • hypertextcpp ๐Ÿ“ ๐ŸŒ -- string/text template engine & source-to-source compiler.

  • inja ๐Ÿ“ ๐ŸŒ -- a template engine for modern C++, loosely inspired by jinja for python. It has an easy and yet powerful template syntax with all variables, loops, conditions, includes, callbacks, and comments you need, nested and combined as you like.

  • libfort ๐Ÿ“ ๐ŸŒ -- a simple crossplatform library to create formatted text tables.

  • libqrencode ๐Ÿ“ ๐ŸŒ -- generate QRcodes from anything (e.g. URLs). libqrencode is a fast and compact library for encoding data in a QR Code, a 2D symbology that can be scanned by handy terminals such as a smartphone. The capacity of QR Code is up to 7000 digits or 4000 characters and has high robustness. libqrencode supports QR Code model 2, described in JIS (Japanese Industrial Standards) X0510:2004 or ISO/IEC 18004. Most of features in the specification are implemented: Numeric, alphabet, Japanese kanji (Shift-JIS) or any 8 bit code, Optimized encoding of a string, Structured-append of symbols, Micro QR Code (experimental).

  • PDFGen ๐Ÿ“ ๐ŸŒ -- a simple PDF Creation/Generation library, contained in a single C-file with header and no external library dependencies. Useful for embedding into other programs that require rudimentary PDF output.

  • quirc ๐Ÿ“ ๐ŸŒ -- a library for extracting and decoding QR codes, which are a type of high-density matrix barcodes, from images. It features a fast, robust and tolerant recognition algorithm. It can correctly recognise and decode QR codes which are rotated and/or oblique to the camera. It can also distinguish and decode multiple codes within the same image.

  • see-phit ๐Ÿ“ ๐ŸŒ -- a compile time HTML templating library written in modern C++/14. You write plain HTML as C++ string literals and it is parsed at compile time into a DOM like data structure. It makes your "stringly typed" HTML text into an actual strongly typed DSL.

  • sile-typesetter ๐Ÿ“ ๐ŸŒ -- SILE is a typesetting system; its job is to produce beautiful printed documents. Conceptually, SILE is similar to TeXโ€”from which it borrows some concepts and even syntax and algorithmsโ€”but the similarities end there. Rather than being a derivative of the TeX family SILE is a new typesetting and layout engine written from the ground up using modern technologies and borrowing some ideas from graphical systems such as Adobe InDesign.

  • tabulate ๐Ÿ“ ๐ŸŒ -- Table Maker for Modern C++, for when you want to display table formatted data in the terminal/console text window.

  • textflowcpp ๐Ÿ“ ๐ŸŒ -- a simple way to wrap a string at different line lengths, optionally with indents.

  • upskirt-markdown ๐Ÿ“ ๐ŸŒ -- MarkDown renderer

  • variadic_table ๐Ÿ“ ๐ŸŒ -- for "pretty-printing" a formatted table of data to the console. It uses "variadic templates" to allow you to specify the types of data in each column.

FTS (Full Text Search) and related: SOLR/Lucene et al: document content search

We'll be using SOLR mostly, but here might be some interface libraries and an intersting alternative

  • Bi-Sent2Vec ๐Ÿ“ ๐ŸŒ -- provides cross-lingual numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task with applications geared towards cross-lingual word translation, cross-lingual sentence retrieval as well as cross-lingual downstream NLP tasks. The library is a cross-lingual extension of Sent2Vec. Bi-Sent2Vec vectors are also well suited to monolingual tasks as indicated by a marked improvement in the monolingual quality of the word embeddings. (For more details, see paper)

  • BitFunnel ๐Ÿ“ ๐ŸŒ -- the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search.

  • completesearch ๐Ÿ“ ๐ŸŒ -- a fast and interactive search engine for context-sensitive prefix search on a given collection of documents. It does not only provide search results, like a regular search engine, but also completions for the last (maybe only partially typed) query word that lead to a hit.

  • edit-distance ๐Ÿ“ ๐ŸŒ -- a fast implementation of the edit distance (Levenshtein distance). The algorithm used in this library is proposed by Heikki Hyyrรถ, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf.

  • fxt ๐Ÿ“ ๐ŸŒ -- a large scale feature extraction tool for text-based machine learning.

  • groonga ๐Ÿ“ ๐ŸŒ -- an open-source fulltext search engine and column store.

  • iresearch ๐Ÿ“ ๐ŸŒ -- the IResearch search engine is meant to be treated as a standalone index that is capable of both indexing and storing individual values verbatim. Indexed data is treated on a per-version/per-revision basis, i.e. existing data version/revision is never modified and updates/removals are treated as new versions/revisions of the said data. This allows for trivial multi-threaded read/write operations on the index. The index exposes its data processing functionality via a multi-threaded 'writer' interface that treats each document abstraction as a collection of fields to index and/or store. The index exposes its data retrieval functionality via 'reader' interface that returns records from an index matching a specified query. The queries themselves are constructed query trees built directly using the query building blocks available in the API. The querying infrastructure provides the capability of ordering the result set by one or more ranking/scoring implementations. The ranking/scoring implementation logic is plugin-based and lazy-initialized during runtime as needed, allowing for addition of custom ranking/scoring logic without the need to even recompile the IResearch library.

  • libchopshop ๐Ÿ“ ๐ŸŒ -- NLP/text processing with automated stop word detection and stemmer-based filtering. This library / toolkit is engineered to be able to provide both of the (often more or less disparate) n-gram token streams / vectors required for (1) initializing / training FTS databases, neural nets, etc. and (2) executing effective queries / matches on these engines.

  • libunibreak ๐Ÿ“ ๐ŸŒ -- an implementation of the line breaking and word breaking algorithms as described in (Unicode Standard Annex 14)[http://www.unicode.org/reports/tr14/] and (Unicode Standard Annex 29)[http://www.unicode.org/reports/tr29/].

  • Manticore -- while the userbase is much smaller than for the Lucene Gang (Lucene/SOLR/ES/OpenSearch), this still got me. Can't say exactly why. All the other Lucene/SOLR alternatives out there didn't appeal to me (old tech, slow dev, ...).

    • manticore-columnar ๐Ÿ“ ๐ŸŒ -- Manticore Columnar Library is a column-oriented storage and secondary indexing library, aiming to provide decent performance with low memory footprint at big data volume. When used in combination with Manticore Search it can be beneficial for those looking for:

      1. log analytics including rich free text search capabities (which is missing in e.g. Clickhouse - great tool for metrics analytics)
      2. faster / low resource consumption log/metrics analytics. Since the library and Manticore Search are both written in C++ with low optimizations in mind, in many cases the performance / RAM consumption is better than in Lucene / SOLR / Elasticsearch
      3. running log / metric analytics in docker / kubernetes. Manticore Search + the library can work with as little as 30 megabytes of RAM which Elasticsearch / Clickhouse can't. It also starts in less than a second or a few seconds in the worst case. Since the overhead is so little you can afford having more nodes of Manticore Search + the library than Elasticsearch. More nodes and quicker start means higher high availability and agility.
      4. powerful SQL for logs/metrics analytics and everything else Manticore Search can give you
    • manticore-plugins ๐Ÿ“ ๐ŸŒ -- Manticore Search plugins and UDFs (user defined functions) -- Manticore Search can be extended with help of plugins and custom functions (aka user defined functions or UDFs).

    • manticoresearch ๐Ÿ“ ๐ŸŒ -- Manticore Search is an easy to use open source fast database for search. Good alternative for Elasticsearch. What distinguishes it from other solutions is:

      • It's very fast and therefore more cost-efficient than alternatives, for example Manticore is:
      • Modern MPP architecture and smart query parallelization capabilities allow to fully utilize all your CPU cores to lower response time as much as possible, when needed.
      • Powerful and fast full-text search which works fine for small and big datasets
      • Traditional row-wise storage for small, medium and big size datasets
      • Columnar storage support via the Manticore Columnar Library for bigger datasets (much bigger than can fit in RAM)
      • Easy to use secondary indexes (you don't need to create them manually)
      • Cost-based optimizer for search queries
      • SQL-first: Manticore's native syntax is SQL. It speaks SQL over HTTP and uses the MySQL protocol (you can use your preferred MySQL client)
      • JSON over HTTP: to provide a more programmatic way to manage your data and schemas, Manticore provides a HTTP JSON protocol
      • Written fully in C++: starts fast, doesn't take much RAM, and low-level optimizations provide good performance
      • Real-time inserts: after an INSERT is made, the document is accessible immediately
      • Interactive courses for easier learning
      • Built-in replication and load balancing
      • Can sync from MySQL/PostgreSQL/ODBC/xml/csv out of the box
      • Not fully ACID-compliant, but supports transactions and binlog for safe writes
  • mitlm ๐Ÿ“ ๐ŸŒ -- the MIT Language Modeling Toolkit (MITLM) toolkit is a set of tools designed for the efficient estimation of statistical n-gram language models involving iterative parameter estimation. It achieves much of its efficiency through the use of a compact vector representation of n-grams.

  • pg_similarity ๐Ÿ“ ๐ŸŒ -- pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function).

  • pisa ๐Ÿ“ ๐ŸŒ -- a text search engine able to run on large-scale collections of documents. It allows researchers to experiment with state-of-the-art techniques, allowing an ideal environment for rapid development. PISA is a text search engine, though the "PISA Project" is a set of tools that help experiment with indexing and query processing. Given a text collection, PISA can build an inverted index over this corpus, allowing the corpus to be searched. The inverted index, put simply, is an efficient data structure that represents the document corpus by storing a list of documents for each unique term (see here). At query time, PISA stores its index in main memory for rapid retrieval.

  • pisa_formatter ๐Ÿ“ ๐ŸŒ -- converts list of documents to the pisa-engine binary format: {.docs, .freqs, .sizes}. Its input should be a text file where each line is a document. Each document starts with the document name (which should not have whitespaces) followed by a list of ascii terms separated by whitespaces which define the document. This also generates a binary .terms file which has the information to convert from term to index and is used by the query_transformer executable. This file stores all the unique terms from all the documents.

  • sent2vec ๐Ÿ“ ๐ŸŒ -- a tool and pre-trained models related to the Bi-Sent2vec. The cross-lingual extension of Sent2Vec can be found here. This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.

  • sist2 ๐Ÿ“ ๐ŸŒ -- sist2 (Simple incremental search tool) is a fast, low memory usage, multi-threaded application, which scans drives and directory trees, extracts text and metadata from common file types, generates thumbnails and comes with OCR support (with tesseract) and Named-Entity Recognition (using pre-trained client-side tensorflow models).

  • sqlite-fts5-snowball ๐Ÿ“ ๐ŸŒ -- a simple extension for use with FTS5 within SQLite. It allows FTS5 to use Martin Porter's Snowball stemmers (libstemmer), which are available in several languages. Check http://snowballstem.org/ for more information about them.

  • sqlite_fts_tokenizer_chinese_simple ๐Ÿ“ ๐ŸŒ -- an extension of sqlite3 fts5 that supports Chinese and Pinyin. It fully provides a solution to the multi-phonetic word problem of full-text retrieval on WeChat mobile terminal: solution 4 in the article, very simple and efficient support for Chinese and Pinyin searches.

    On this basis we also support more accurate phrase matching through cppjieba. See the introduction article at https://www.wangfenjin.com/posts/simple-jieba-tokenizer/

  • typesense ๐Ÿ“ ๐ŸŒ -- a fast, typo-tolerant search engine for building delightful search experiences. Open Source alternative to Algolia and an Easier-to-Use alternative to ElasticSearch. โšก๐Ÿ”โœจ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences.

stemmers

language detection / inference

  • cld1-language-detect ๐Ÿ“ ๐ŸŒ -- the CLD (Compact Language Detection) library, extracted from the source code for Google's Chromium library. CLD1 probabilistically detects languages in Unicode UTF-8 text.
  • cld2-language-detect ๐Ÿ“ ๐ŸŒ -- CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes. Optionally, it also returns a vector of text spans with the language of each identified. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text.
  • cld3-language-detect ๐Ÿ“ ๐ŸŒ -- CLD3 is a neural network model for language identification. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.
  • libchardet ๐Ÿ“ ๐ŸŒ -- is based on Mozilla Universal Charset Detector library and, detects the character set used to encode data.
  • uchardet ๐Ÿ“ ๐ŸŒ -- uchardet is an encoding and language detector library, which attempts to determine the encoding of the text. It can reliably detect many charsets. Moreover it also works as a very good and fast language detector.

scripting user-tunable tasks such as OCR preprocessing, metadata extraction, metadata cleaning & other [post-]processing, ...

  • cel-cpp ๐Ÿ“ ๐ŸŒ -- C++ Implementations of the Common Expression Language. For background on the Common Expression Language see the cel-spec repo. Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.
  • cel-spec ๐Ÿ“ ๐ŸŒ -- Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.
  • chibi-scheme ๐Ÿ“ ๐ŸŒ -- Chibi-Scheme is a very small library intended for use as an extension and scripting language in C programs. In addition to support for lightweight VM-based threads, each VM itself runs in an isolated heap allowing multiple VMs to run simultaneously in different OS threads.
  • cppdap ๐Ÿ“ ๐ŸŒ -- a C++11 library ("SDK") implementation of the Debug Adapter Protocol, providing an API for implementing a DAP client or server. cppdap provides C++ type-safe structures for the full DAP specification, and provides a simple way to add custom protocol messages.
  • cpython ๐Ÿ“ ๐ŸŒ -- Python version 3. Note: Building a complete Python installation requires the use of various additional third-party libraries, depending on your build platform and configure options. Not all standard library modules are buildable or useable on all platforms.
  • duktape ๐Ÿ“ ๐ŸŒ -- Duktape is an embeddable Javascript engine, with a focus on portability and compact footprint. Duktape is ECMAScript E5/E5.1 compliant, with some semantics updated from ES2015+, with partial support for ECMAScript 2015 (E6) and ECMAScript 2016 (E7), ES2015 TypedArray, Node.js Buffer bindings and comes with a built-in debugger.
  • ECMA262 ๐Ÿ“ ๐ŸŒ -- ECMAScript :: the source for the current draft of ECMA-262, the ECMAScriptยฎ Language Specification.
  • exprtk ๐Ÿ“ ๐ŸŒ -- C++ Mathematical Expression Toolkit Library is a simple to use, easy to integrate and extremely efficient run-time mathematical expression parsing and evaluation engine. The parsing engine supports numerous forms of functional and logic processing semantics and is easily extensible.
  • guile ๐Ÿ“ ๐ŸŒ -- Guile is Project GNU's extension language library. Guile is an implementation of the Scheme programming language, packaged as a library that can be linked into applications to give them their own extension language. Guile supports other languages as well, giving users of Guile-based applications a choice of languages.
  • harbour-core ๐Ÿ“ ๐ŸŒ -- Harbour is the free software implementation of a multi-platform, multi-threading, object-oriented, scriptable programming language, backward compatible with Clipper/xBase. Harbour consists of a compiler and runtime libraries with multiple UI and database backends, its own make system and a large collection of libraries and interfaces to many popular APIs.
  • itcl ๐Ÿ“ ๐ŸŒ -- Itcl is an object oriented extension for Tcl.
  • jerryscript [๏ฟฝ

About

Data Science & Image Processing amalgam library in C/C++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published