A self-hosted search engine to find stories in any files.
Status | |
---|---|
Download | |
CI checks | |
Translations | |
Latest version | |
Release date | |
Open issues | |
Documentation |
Datashare is an open‑source, self‑hosted document search and analysis platform built by the International Consortium of Investigative Journalists (ICIJ). It ingests heterogeneous data (PDFs, emails, spreadsheets, images, archives, etc.), extracts text (including via OCR), enriches it with metadata and named entities, and exposes everything through a powerful search UI and REST API. Because Datashare runs on your own machines, you keep full control over sensitive material—no external cloud services required.
📣 Help shape the next content extraction features in Datashare! Please take 10 minutes to fill out our user survey, it will directly influences our roadmap, and lets you opt‑in for early previews/beta testing.
- 🔍 Full‑text search: Index & query PDFs, emails, office docs, images, archives, and more.
- 🖼️ OCR on scans & images: Turn visual text into searchable text.
- 🧠 Named‑entity extraction: Auto-detect people, orgs, locations, emails, etc.
- ⭐ Stars & tags: Mark and organize key documents.
- 🧰 Advanced filters & operators: Combine facets with boolean, wildcard, and fuzzy queries.
- 🤝 Team/server mode: Multi-user deployment with shared tags and recommendations.
- 🔌 Plugin architecture: Extend Datashare with custom modules.
This section explains how to set up a development environment, build the project, run tests, and manage database migrations. It assumes you are comfortable with Java/Maven projects and basic service orchestration.
Languages & tooling
- JDK 17
- Apache Maven 3.8+: primary build tool for the backend
- GNU Make (optional but recommended): convenient shortcuts (
make dist
,make update-db
, etc.)
Services
Those services must be running to have a complete developer environement. You might want
- PostgreSQL 13+
- Available on host
postgres:5432
- Two DBs expected by default:
datashare
(dev) andtest
(tests) - A role with privileges, e.g. user:
test
, password:test
- Available on host
- Elasticsearch 7.x
- Available on host
elasticsearch:9200
- 8.x is not officially supported
- Available on host
- Redis 5+
- Available on host
redis:6379
- Used to store session and orchestrate async tasks.
- Available on host
The project is modular. Typical steps:
# 1. Validate the build and resolve deps
mvn validate
# 2. Build shared testing utilities (some modules depend on these)
mvn -pl commons-test -am install
# 3. Apply DB migrations so your dev DB schema matches the code
mvn -pl datashare-db liquibase:update
# 4. Build everything (excluding tests)
mvn package -Dmaven.test.skip=true
Datashare has both unit and integration tests. Integration tests expect Postgres, Elasticsearch, and Redis to be reachable.
# Run the whole test suite
mvn test
# Or run a single module
mvn -pl datashare-api test
# Or a single test class
mvn -pl datashare-api -Dtest=org.icij.datashare.PropertiesProviderTest test
Datashare uses Liquibase to version and apply schema changes.
Apply latest migrations:
make update-db
Start from scratch (danger: drops data):
make reset-db
Adding a new changeset:
- Create a new XML/YAML changeset under
datashare-db/src/main/resources/db/changelog/
- Reference it in the master changelog file
- Run
make update-db
locally to verify - Commit both the changeset and updated master file
The web UI is built with Vue 3 and maintained in a separate repository. When building the backend, you must also build the client and copy its compiled files into the ./app
directory. The backend bundles these static assets using FluentHTTP, which serves resources from ./app
(relative to the repo root). If this folder is missing or empty, only the API will be available, no UI.
- Node.js 20.19+
- Yarn 1
-
Clone & enter the client repo
git clone https://github.com/ICIJ/datashare-client.git cd datashare-client
-
Install and build
yarn yarn build
The build outputs a production bundle into
dist/
. -
Copy (or symlink) into backend
rm -rf ../datashare/app mkdir -p ../datashare/app cp -r dist/* ../datashare/app/
Datashare is distributed under the GNU Affero General Public License v3.0.
The International Consortium of Investigative Journalists (ICIJ) is a global network of reporters and media organizations collaborating on cross‑border investigations (e.g., Panama Papers, Luanda Leaks, Uber Files, Pandora Papers). The tech team at ICIJ builds tools like Datashare to empower investigative journalism at scale, handling millions of documents securely and efficiently. We open‑sourced Datashare to empower solo reporters and small newsrooms with advanced investigative tools, enable larger organizations to audit, extend, and self‑host the platform, and foster collaboration within the investigative community to continually improve the software.
Contact & Community
- Issues & feature requests: GitHub Issues
- Email:
[email protected]
- Security reports: please email us and avoid filing public issues for vulnerabilities.