Dokugaku (読学) is a tool to help Japanese learners read manga and novels.
You upload the manga and novels you want to read and it will show you (1) what percentage of the words in that work you already know and (2) what words are most common in that particular work, in the series it belongs to or across all uploaded works. This way you can make an informed decision about what works to read first and what words would be most useful to learn. It also includes a manga reader and novel reader with support for third-party tools such as Yomitan.
Upload your manga and novels
- Auto-fill the volume number, title and author so it is easier to upload the next volume in a series.
- Skip the Mokuro (OCR) step if your host computer can get it done faster.
Browse your manga and novels
- Choose your next work based on the percentage of vocab you already know.
- Filter the results by title, author and read status.
Manage your progress
- Keep track of your progress in the works you are currently reading.
- Choose the next words to learn based on their frequency across all your unread works.
Get frequency lists
- View the most frequent unknown vocab in a particular work or in an entire series.
- Mark words as known, ignored (within that specific work/series) or excluded (everywhere).
- Search for a specific word or filter vocab by minimum frequency, ignored status or JLPT level.
- Automatically save your filters per work or series.
Get glossaries
- View the unknown vocab for the pages you are reading next in order of their occurrence.
- Check the frequency score for each word to decide if it's worth making a flashcard for.
Read your manga with dictionary lookups
- Use tools such as Yomitan to look up vocab in the OCR'd text bubbles.
- Lock overlapping text boxes for easier lookups.
- Show one page or two pages at a time.
- Show the manga in fullscreen mode.
- Navigate between pages and view modes with keyboard shortcuts.
- Automatically save your progress on every page change.
Read your novels with dictionary lookups
- Read your novels vertically or horizontally.
- Automatically save your reading direction preference per work.
- Customise the font size and line height.
- Save your progress by clicking the bookmark buttons.
- Select text to get a character count.
- Hover on a bookmark to see the paragraph number
Upload your known vocab
- Mark thousands of words as known in one go.
This project is expressly intended to be used locally by a single user (it is called Dokugaku after all). While the backend has been designed with multiple users in mind, I would strongly advise against it due to the risk of copyright infringement.
-
Make sure you have Docker installed.
-
Click on the green 'Code' button in the top right and download the ZIP file.
-
Use your text editor of choice to create a file named
.env
inside of thedokugaku
folder you just unzipped. It should contain the following variables:
DB_PG_PASSWORD=***
ICHIRAN_PG_PASSWORD=***
ALLOW_OTHER_DEVICES=***
HOST_IP=***
WEB_PORT=3000
GRAPHQL_PORT=3001
ICHIRAN_PORT=3002
MOKURO_PORT=3003
WORK_PROCESSOR_PORT=3004
For DB_PG_PASSWORD
and ICHIRAN_PG_PASSWORD
supply a password of your own choosing.
If you want to use Dokugaku on this device only, set ALLOW_OTHER_DEVICES=0
and HOST_IP=localhost
; if you want to use Dokugaku on multiple devices within your local network, set ALLOW_OTHER_DEVICES=1
and set HOST_IP
to the local IP address of the device that is running Dokugaku. You'll want to make sure this is a static IP.
Upon naming the file .env
and saving it, it might disappear in your file manager if you are using macOS or Linux. This is standard behaviour for files starting with a dot. On macOS you can show them like this.
-
In your terminal of choice, type
cd
(including the space), drag and drop the Dokugaku folder into the terminal window and hit enter. Then typedocker compose up
and hit enter. -
Go to
http://localhost:3000
in your browser. IfALLOW_OTHER_DEVICES
was set to 1, you can access Dokugaku by going to the IP you entered forHOST_IP
with:3000
added at the end (e.g. 192.168.0.0:3000).
The manga page files can be .jpg
, .jpeg
, .png
or .webp
.
When manga files are uploaded each vocab item is assigned to the page on which it was found. It is assumed that the (alphabetically) first image is page 1, the second image is page 2, etc. Unfortunately this assumption is not always correct, so it is worthwhile to check this before you upload the images. It will sometimes be necessary to remove some of the first few pages to make the page numbers line up properly.
The alphabetically first image will be used as the cover image.
It probably makes more sense to run Mokuro (the OCR processor) on your host machine instead of leaving it to the Mokuro container, especially if you are using Apple Silicon. In that case, check the relevant checkbox and upload the .json
files along with the images.
The image files can be .jpg
, .jpeg
, .png
or .webp
. The text files can be .html
, .md
or .txt
.
It is possible to upload a novel across multiple files (for example, when using Calibre's edit function to extract a novel's .html
files). These separate files will automatically be stitched together. It is important that they are named in alphabetically ascending order, otherwise they will be stitched together in the wrong order.
Uploaded text will be converted into basic semantic HTML. For the best result it is often necessary to make some manual adjustments to the text before uploading, especially when dealing with .html
files that were extracted via Calibre (which have unpredictable markup).
A top-level heading (h1
or #
) will automatically be inserted based on the title provided in the upload form. This means that headings should start at level two (h2
or ##
).
In .html
files paragraphs are distinguished by their p
tags. In .md
files paragraphs must be separated by a blank line. In .txt
files a single newline is sufficient, but a blank line works as well.
Sometimes it is desirable to have extra blank lines between paragraphs. This can be accomplished in .md
files by inserting a thematic break and in .html
files by inserting an hr
tag.
Emphasis dots can be added in .md
files by wrapping the relevant text in single asterisks or underscores and in .html
files by wrapping it in em
tags.
Blockquotes (or indentation in a more general sense) can be added in .md
files by prefacing the relevant lines with > and in .html
files by wrapping the paragraph in blockquote
tags.
Images can be uploaded alongside the text file(s) and included in the text file(s) with the usual markdown or HTML syntax. Instead of supplying a relative or absolute path, supply only the filename. By default images are displayed as a block, but they can be displayed inline by adding an "inline"
title attribute:
In .md
files:
![](filename.jpg)
![](filename.jpg "inline")
In .html
files:
<img src="filename.jpg" />
<img src="filename.jpg" title="inline" />
- There is no title at the start of the work
- Headings start at
h2
- Blank lines (usually
<p><br /></p>
) are replaced withhr
- Text with emphasis dots (look for
text-emphasis
in the.css
files to find the class name) is wrapped inem
tags - Indented paragraphs (look for
div
with classes likestart-3em
) are wrapped inblockquote
tags - SVG images (
<svg><image /></svg>
) are replaced withimg
tags - Images that contain only text are replaced with the appropriate text string (where possible)
- Images that are supposed to be displayed inline contain
title="inline"
- Pointless elements such as empty
a
tags are removed
In the case of novels the pageNumber
variable actually refers to the paragraph number. It is used across Dokugaku for the same purposes as the page numbers in manga (i.e. tracking reading progress and ordering glossary words) so it did not seem worth it to muddy up the types by adding a different property name to the mix.
There is currently no way to track the processing of an uploaded work other than keeping an eye on the work-processor
logs, where you will find an up-to-date estimate for the segmentation stage. Unfortunately, there is currently no way to track progress within the (often lengthy!) OCR stage beyond a simple pass/fail message.
Words can be marked as excluded, ignored and known. In all three cases they will be filtered out of frequency lists, glossaries and recommended vocab, but they are intended for different use cases.
Excluding words is intended to be used for words that have no business being in a frequency list or glossary. This includes particles (e.g. が, わよ), grammatical constructs (e.g. そう, たい) and exclamations (e.g. え, あー). Marking a word as excluded means it will be filtered out of all frequency lists and glossaries and will not count toward the frequency score in the list of recommended vocab.
Ignoring words is intended to be used for words that have been spuriously parsed. In most cases this will be names (e.g. 臼井 儀人 being interpreted as four individual single-kanji words), but it can also happen when a word is written is an unusual script (e.g. カゼ being parsed as 'casein' instead of the intended 風邪 'cold'). Marking a word as ignored is work or series specific; it will only be filtered out of the frequency lists and glossaries of the same work (and of other works within the same series) and instances in unrelated works will still count towards the frequency score in the list of recommended vocab.
Self-explanatory. These words are filtered out from all frequency lists, glossaries and recommended vocab because they are (fortunately!) no longer worth learning. Marking these words manually would be cumbersome, which is why there is an option to upload a list of known words and have them automatically marked as known.
On the 'known words' page there is an option to mark words as known in bulk by uploading them as a (white)space-separated or comma-separated list.
- In the Browse window you can use filters to select only those words you feel you truly know. For example, the filter
prop:ivl>=30 prop:r>0.9 -is:suspended
will select non-suspended cards with an interval of 30 days or more and a retention rate of 90% or more. - Go to
Notes > Export Notes...
andNotes in Plain Text (.txt)
as the export format. There is no need to check any of the boxes. - Open the resulting
.txt
in a spreadsheet app of your choice. - remove all columns except for the one that represents the Japanese headword with kanji and without furigana syntax
- Export the remaining column as a
.csv
file. - If you open this file in a text editor, you will be able to copy/paste the contents into the upload form.
Sometimes Mokuro (the OCR engine) determines text box boundaries incorrectly, causing text boxes to overlap and making it impossible to scan the affected characters with Yomitan or other dictionary tools. To prevent this, it is possible to 'lock' a text box so that it will always remain visible and on top of the other text boxes. Simply hold Shift
and click on a text box to lock it and Shift
-click it again to unlock it.
Key | Description |
---|---|
← |
Go to next page |
[ |
Go to last page |
→ |
Go to previous page |
] |
Go to first page |
0 |
Reset zoom level |
1 |
Enable single plage mode |
2 |
Enable double page mode |
F |
Enable fullscreen mode |
- word search across the entire corpus, linking to the appropriate page or paragraph
- dashboard to track the processing status of uploaded works
- improve vocab pagination (specifically when there are no more records to load)
- improve the management of common environment variables between backend and frontend
- optimise the performance of Docker images
- improve validation and error handling
- lift component state up; improve stories