Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better way to sort by difficulty #453

Open
quinnlas opened this issue Jul 14, 2024 · 1 comment
Open

A better way to sort by difficulty #453

quinnlas opened this issue Jul 14, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@quinnlas
Copy link

Is your feature request related to a problem? Please describe.

Despite being intermediate in my TL, I was having difficulty sorting the books I had uploaded to find the easiest one.

Describe the solution you'd like

I believe the best metric for book difficulty would be

Unique unknown words / Total words
Over a short number of pages

This reflects how often you will need to look words up while reading. It would be nice if you could sort the active book lists by this.

Describe alternatives you've considered

I ran the project from source and tried several alternatives, the above was by far the most useful for finding a good book to read. But here are my thoughts on the options I tried.

Unique unknown / unique total over 5 pages (default)
This doesn't seem very useful to me. I think the denominator should definitely be total words since that takes into account the repetition of common words.

Unique unknown / unique total over the whole book
This has the same issue while also being slower. I'm not suggesting to calculate stats over a large number of pages anyway. But I thought I'd mention that if a hashmap of unique words to their counts were saved for each book, it would be a lot quicker to calculate any "whole book" stats. So while this method was slow with a basic implementation, it wouldn't need to be with a better one. But still, not that useful.

Total unknown / total over the whole book
The issue with "total unknown" is that it doesn't tell you how often they're repeated. Which in turn means that you don't know how many lookups you'll have to do. For example, I had a short wikipedia article with 54 new words, and 45 unique new words. The article was only 110 words long. Meaning that a very large percentage of them would need to be looked up. But you could have a book where the new words are repeated often, and you only need to look them up once each. That would be way easier, with the same calculated difficulty.

Unique unknown / total over the whole book
This does give you the "lookup percent" across the whole book. The downside is that this heavily favors very long books. In my case, the TL translation of War and Peace. This is obviously not a good choice for an intermediate learner. But given a long enough book, almost all words will be repeated and bring down the calculated difficulty. But we really just want to know how difficult the book will be immediately when you start reading it.

Unique unknown / total over 5 pages (Winner)
This works really well in my experience. The ideal number of pages I'm not sure of. But you would want it to be enough to compensate for progress on the current page, since that will bring down the number. Another idea could be to just skip the current page and use the next X pages.

Too low of an X value will not account for variation in difficulty between pages. But too high of a number would start to cause the issues of the previous method. 5 seemed to work well for me, in any case.

This metric doesn't account for variation of difficulty in say, different chapters of a book (or on any pages it didn't consider). But books tend to not have that, so I think it's ok.

@jzohrab jzohrab added the enhancement New feature or request label Aug 12, 2024
@jzohrab jzohrab added this to Lute-v3 Aug 12, 2024
@jzohrab
Copy link
Collaborator

jzohrab commented Aug 12, 2024

@quinnlas - a belated thanks for this issue which dropped off my radar. Good analysis too, it's hard to find the right metric for something like this, especially since it combined a few dimensions (X unique items that can repeat Y times).

If the next 5 pages had 1000 words, with 2 unknown words repeated 10 times each, say, and the rest of the text consisting of 100 repeated words: Unique unknown / total over 5 pages = 0.002. Unique unknown / unique total over 5 pages = ~ 0.02

I think that currently the code does something like actually renders the pages and then does its calculation based on the statuses of the words it sees. This is important for languages that are character-based, because the characters get combined into multi-word terms. But your logic should still be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

2 participants