What is it?

Dokugaku (読学) is a tool to help Japanese learners read manga and novels.

You upload the manga and novels you want to read and it will show you (1) what percentage of the words in that work you already know and (2) what words are most common in that particular work, in the series it belongs to or across all uploaded works. This way you can make an informed decision about what works to read first and what words would be most useful to learn. It also includes a manga reader and novel reader with support for third-party tools such as Yomitan.

Upload your manga and novels

Auto-fill the volume number, title and author so it is easier to upload the next volume in a series.
Skip the Mokuro (OCR) step if your host computer can get it done faster.

Browse your manga and novels

Choose your next work based on the percentage of vocab you already know.
Filter the results by title, author and read status.

Manage your progress

Keep track of your progress in the works you are currently reading.
Choose the next words to learn based on their frequency across all your unread works.

Get frequency lists

View the most frequent unknown vocab in a particular work or in an entire series.
Mark words as known, ignored (within that specific work/series) or excluded (everywhere).
Search for a specific word or filter vocab by minimum frequency, ignored status or JLPT level.
Automatically save your filters per work or series.

Get glossaries

View the unknown vocab for the pages you are reading next in order of their occurrence.
Check the frequency score for each word to decide if it's worth making a flashcard for.

Read your manga with dictionary lookups

Use tools such as Yomitan to look up vocab in the OCR'd text bubbles.
Lock overlapping text boxes for easier lookups.
Show one page or two pages at a time.
Show the manga in fullscreen mode.
Navigate between pages and view modes with keyboard shortcuts.
Automatically save your progress on every page change.

Read your novels with dictionary lookups

Read your novels vertically or horizontally.
Automatically save your reading direction preference per work.
Customise the font size and line height.
Save your progress by clicking the bookmark buttons.
Select text to get a character count.
Hover on a bookmark to see the paragraph number

Upload your known vocab

Mark thousands of words as known in one go.

Scope of this project

This project is expressly intended to be used locally by a single user (it is called Dokugaku after all). While the backend has been designed with multiple users in mind, I would strongly advise against it due to the risk of copyright infringement.

How to install

Make sure you have Docker installed.
Click on the green 'Code' button in the top right and download the ZIP file.
Use your text editor of choice to create a file named .env inside of the dokugaku folder you just unzipped. It should contain the following variables:

DB_PG_PASSWORD=***
ICHIRAN_PG_PASSWORD=***

ALLOW_OTHER_DEVICES=***
HOST_IP=***

WEB_PORT=3000
GRAPHQL_PORT=3001
ICHIRAN_PORT=3002
MOKURO_PORT=3003
WORK_PROCESSOR_PORT=3004

For DB_PG_PASSWORD and ICHIRAN_PG_PASSWORD supply a password of your own choosing.

If you want to use Dokugaku on this device only, set ALLOW_OTHER_DEVICES=0 and HOST_IP=localhost; if you want to use Dokugaku on multiple devices within your local network, set ALLOW_OTHER_DEVICES=1 and set HOST_IP to the local IP address of the device that is running Dokugaku. You'll want to make sure this is a static IP.

Upon naming the file .env and saving it, it might disappear in your file manager if you are using macOS or Linux. This is standard behaviour for files starting with a dot. On macOS you can show them like this.

In your terminal of choice, type cd (including the space), drag and drop the Dokugaku folder into the terminal window and hit enter. Then type docker compose up and hit enter.
Go to http://localhost:3000 in your browser. If ALLOW_OTHER_DEVICES was set to 1, you can access Dokugaku by going to the IP you entered for HOST_IP with :3000 added at the end (e.g. 192.168.0.0:3000).

Uploading files

Manga

File formats

The manga page files can be .jpg, .jpeg, .png or .webp.

Page numbers

When manga files are uploaded each vocab item is assigned to the page on which it was found. It is assumed that the (alphabetically) first image is page 1, the second image is page 2, etc. Unfortunately this assumption is not always correct, so it is worthwhile to check this before you upload the images. It will sometimes be necessary to remove some of the first few pages to make the page numbers line up properly.

Cover

The alphabetically first image will be used as the cover image.

Pre-mokuro'd manga

It probably makes more sense to run Mokuro (the OCR processor) on your host machine instead of leaving it to the Mokuro container, especially if you are using Apple Silicon. In that case, check the relevant checkbox and upload the .json files along with the images.

Novels

File formats

The image files can be .jpg, .jpeg, .png or .webp. The text files can be .html, .md or .txt.

Multiple files

It is possible to upload a novel across multiple files (for example, when using Calibre's edit function to extract a novel's .html files). These separate files will automatically be stitched together. It is important that they are named in alphabetically ascending order, otherwise they will be stitched together in the wrong order.

Text formatting

Uploaded text will be converted into basic semantic HTML. For the best result it is often necessary to make some manual adjustments to the text before uploading, especially when dealing with .html files that were extracted via Calibre (which have unpredictable markup).

Headings

A top-level heading (h1 or #) will automatically be inserted based on the title provided in the upload form. This means that headings should start at level two (h2 or ##).

Paragraphs

In .html files paragraphs are distinguished by their p tags. In .md files paragraphs must be separated by a blank line. In .txt files a single newline is sufficient, but a blank line works as well.

Blank lines

Sometimes it is desirable to have extra blank lines between paragraphs. This can be accomplished in .md files by inserting a thematic break and in .html files by inserting an hr tag.

Emphasis dots

Emphasis dots can be added in .md files by wrapping the relevant text in single asterisks or underscores and in .html files by wrapping it in em tags.

Indentation

Blockquotes (or indentation in a more general sense) can be added in .md files by prefacing the relevant lines with > and in .html files by wrapping the paragraph in blockquote tags.

Images

Images can be uploaded alongside the text file(s) and included in the text file(s) with the usual markdown or HTML syntax. Instead of supplying a relative or absolute path, supply only the filename. By default images are displayed as a block, but they can be displayed inline by adding an "inline" title attribute:

In .md files:

![](filename.jpg)

![](filename.jpg "inline")

In .html files:

<img src="filename.jpg" />

<img src="filename.jpg" title="inline" />

Checklist for `.html` files extracted from Calibre

There is no title at the start of the work
Headings start at h2
Blank lines (usually <p><br /></p>) are replaced with hr
Text with emphasis dots (look for text-emphasis in the .css files to find the class name) is wrapped in em tags
Indented paragraphs (look for div with classes like start-3em) are wrapped in blockquote tags
SVG images (<svg><image /></svg>) are replaced with img tags
Images that contain only text are replaced with the appropriate text string (where possible)
Images that are supposed to be displayed inline contain title="inline"
Pointless elements such as empty a tags are removed

'Page numbers'

In the case of novels the pageNumber variable actually refers to the paragraph number. It is used across Dokugaku for the same purposes as the page numbers in manga (i.e. tracking reading progress and ordering glossary words) so it did not seem worth it to muddy up the types by adding a different property name to the mix.

Processing time estimate

There is currently no way to track the processing of an uploaded work other than keeping an eye on the work-processor logs, where you will find an up-to-date estimate for the segmentation stage. Unfortunately, there is currently no way to track progress within the (often lengthy!) OCR stage beyond a simple pass/fail message.

Managing vocab

Words can be marked as excluded, ignored and known. In all three cases they will be filtered out of frequency lists, glossaries and recommended vocab, but they are intended for different use cases.

Excluded

Excluding words is intended to be used for words that have no business being in a frequency list or glossary. This includes particles (e.g. が, わよ), grammatical constructs (e.g. そう, たい) and exclamations (e.g. え, あー). Marking a word as excluded means it will be filtered out of all frequency lists and glossaries and will not count toward the frequency score in the list of recommended vocab.

Ignored

Ignoring words is intended to be used for words that have been spuriously parsed. In most cases this will be names (e.g. 臼井儀人 being interpreted as four individual single-kanji words), but it can also happen when a word is written is an unusual script (e.g. カゼ being parsed as 'casein' instead of the intended 風邪 'cold'). Marking a word as ignored is work or series specific; it will only be filtered out of the frequency lists and glossaries of the same work (and of other works within the same series) and instances in unrelated works will still count towards the frequency score in the list of recommended vocab.

Known

Self-explanatory. These words are filtered out from all frequency lists, glossaries and recommended vocab because they are (fortunately!) no longer worth learning. Marking these words manually would be cumbersome, which is why there is an option to upload a list of known words and have them automatically marked as known.

Uploading known words

On the 'known words' page there is an option to mark words as known in bulk by uploading them as a (white)space-separated or comma-separated list.

Exporting from Anki

In the Browse window you can use filters to select only those words you feel you truly know. For example, the filter prop:ivl>=30 prop:r>0.9 -is:suspended will select non-suspended cards with an interval of 30 days or more and a retention rate of 90% or more.
Go to Notes > Export Notes... and Notes in Plain Text (.txt) as the export format. There is no need to check any of the boxes.
Open the resulting .txt in a spreadsheet app of your choice.
remove all columns except for the one that represents the Japanese headword with kanji and without furigana syntax
Export the remaining column as a .csv file.
If you open this file in a text editor, you will be able to copy/paste the contents into the upload form.

Reading manga

Overlapping text boxes

Sometimes Mokuro (the OCR engine) determines text box boundaries incorrectly, causing text boxes to overlap and making it impossible to scan the affected characters with Yomitan or other dictionary tools. To prevent this, it is possible to 'lock' a text box so that it will always remain visible and on top of the other text boxes. Simply hold Shift and click on a text box to lock it and Shift-click it again to unlock it.

Keyboard shortcuts

Key	Description
`←`	Go to next page
`[`	Go to last page
`→`	Go to previous page
`]`	Go to first page
`0`	Reset zoom level
`1`	Enable single plage mode
`2`	Enable double page mode
`F`	Enable fullscreen mode

Roadmap

New features

word search across the entire corpus, linking to the appropriate page or paragraph
dashboard to track the processing status of uploaded works

Improvements

improve vocab pagination (specifically when there are no more records to load)
improve the management of common environment variables between backend and frontend
optimise the performance of Docker images
improve validation and error handling
lift component state up; improve stories

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.vscode		.vscode
apps/web		apps/web
packages		packages
services		services
.dockerignore		.dockerignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmrc		.npmrc
README.md		README.md
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.yaml		docker-compose.yaml
graphql.config.ts		graphql.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
turbo.json		turbo.json

nienkeboomsma/dokugaku

Folders and files

Latest commit

History

Repository files navigation

What is it?

Scope of this project

How to install

Uploading files

Manga

File formats

Page numbers

Cover

Pre-mokuro'd manga

Novels

File formats

Multiple files

Text formatting

Headings

Paragraphs

Blank lines

Emphasis dots

Indentation

Images

Checklist for .html files extracted from Calibre

'Page numbers'

Processing time estimate

Managing vocab

Excluded

Ignored

Known

Uploading known words

Exporting from Anki

Reading manga

Overlapping text boxes

Keyboard shortcuts

Roadmap

New features

Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Checklist for `.html` files extracted from Calibre

Packages