Skip to content

ICIJ/datashare

Repository files navigation

Datashare

A self-hosted search engine to find stories in any files.

Status
Download Download
CI checks CircleCI
Translations Crowdin
Latest version Latest version
Release date Release date
Open issues Open issues
Documentation User Guide Storybook

Datashare

Datashare is an open‑source, self‑hosted document search and analysis platform built by the International Consortium of Investigative Journalists (ICIJ). It ingests heterogeneous data (PDFs, emails, spreadsheets, images, archives, etc.), extracts text (including via OCR), enriches it with metadata and named entities, and exposes everything through a powerful search UI and REST API. Because Datashare runs on your own machines, you keep full control over sensitive material—no external cloud services required.

📣 Help shape the next content extraction features in Datashare! Please take 10 minutes to fill out our user survey, it will directly influences our roadmap, and lets you opt‑in for early previews/beta testing.

Table of Contents

Main Features

  • 🔍 Full‑text search: Index & query PDFs, emails, office docs, images, archives, and more.
  • 🖼️ OCR on scans & images: Turn visual text into searchable text.
  • 🧠 Named‑entity extraction: Auto-detect people, orgs, locations, emails, etc.
  • Stars & tags: Mark and organize key documents.
  • 🧰 Advanced filters & operators: Combine facets with boolean, wildcard, and fuzzy queries.
  • 🤝 Team/server mode: Multi-user deployment with shared tags and recommendations.
  • 🔌 Plugin architecture: Extend Datashare with custom modules.

Developer Guide

This section explains how to set up a development environment, build the project, run tests, and manage database migrations. It assumes you are comfortable with Java/Maven projects and basic service orchestration.

Requirements

Languages & tooling

  • JDK 17
  • Apache Maven 3.8+: primary build tool for the backend
  • GNU Make (optional but recommended): convenient shortcuts (make dist, make update-db, etc.)

Services

Those services must be running to have a complete developer environement. You might want

  • PostgreSQL 13+
    • Available on host postgres:5432
    • Two DBs expected by default: datashare (dev) and test (tests)
    • A role with privileges, e.g. user: test, password: test
  • Elasticsearch 7.x
    • Available on host elasticsearch:9200
    • 8.x is not officially supported
  • Redis 5+
    • Available on host redis:6379
    • Used to store session and orchestrate async tasks.

Build

The project is modular. Typical steps:

# 1. Validate the build and resolve deps
mvn validate

# 2. Build shared testing utilities (some modules depend on these)
mvn -pl commons-test -am install

# 3. Apply DB migrations so your dev DB schema matches the code
mvn -pl datashare-db liquibase:update

# 4. Build everything (excluding tests)
mvn package -Dmaven.test.skip=true

Run Tests

Datashare has both unit and integration tests. Integration tests expect Postgres, Elasticsearch, and Redis to be reachable.

# Run the whole test suite
mvn test

# Or run a single module
mvn -pl datashare-api test

# Or a single test class
mvn -pl datashare-api -Dtest=org.icij.datashare.PropertiesProviderTest test

Database Migrations

Datashare uses Liquibase to version and apply schema changes.

Apply latest migrations:

make update-db

Start from scratch (danger: drops data):

make reset-db

Adding a new changeset:

  1. Create a new XML/YAML changeset under datashare-db/src/main/resources/db/changelog/
  2. Reference it in the master changelog file
  3. Run make update-db locally to verify
  4. Commit both the changeset and updated master file

Frontend

The web UI is built with Vue 3 and maintained in a separate repository. When building the backend, you must also build the client and copy its compiled files into the ./app directory. The backend bundles these static assets using FluentHTTP, which serves resources from ./app (relative to the repo root). If this folder is missing or empty, only the API will be available, no UI.

Prerequisites for Frontend Dev

  • Node.js 20.19+
  • Yarn 1

Build workflow

  1. Clone & enter the client repo

    git clone https://github.com/ICIJ/datashare-client.git
    cd datashare-client
  2. Install and build

    yarn
    yarn build

    The build outputs a production bundle into dist/.

  3. Copy (or symlink) into backend

    rm -rf ../datashare/app
    mkdir -p ../datashare/app
    cp -r dist/* ../datashare/app/

License

Datashare is distributed under the GNU Affero General Public License v3.0.

About ICIJ

The International Consortium of Investigative Journalists (ICIJ) is a global network of reporters and media organizations collaborating on cross‑border investigations (e.g., Panama Papers, Luanda Leaks, Uber Files, Pandora Papers). The tech team at ICIJ builds tools like Datashare to empower investigative journalism at scale, handling millions of documents securely and efficiently. We open‑sourced Datashare to empower solo reporters and small newsrooms with advanced investigative tools, enable larger organizations to audit, extend, and self‑host the platform, and foster collaboration within the investigative community to continually improve the software.

Contact & Community