Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

echo "gibberish string" | enca --language XX --certainty 100 --> Returns probabilistic output on source charset and undergone transformation(s) #64

Open
porg opened this issue Oct 28, 2024 · 0 comments

Comments

@porg
Copy link

porg commented Oct 28, 2024

this is a feature request.

Use case

  • You have a gibberish text string in the here and now, e.g. "éä"
  • You know the source language with absolute certainty, e.g. "he" ("Hebrew"), because it's a song sung in Hebrew.
  • You have no idea how many false interpretation/conversions that string has undergone in its lifecycle from creation to how you currently see it on screen, e.g. I found this in a song title in Apple Music, which was in the ID3 tag of an MP3 file. ID3 has different versions allowing different charsets, who knows what that has undergone (ID3 updates over decades usage of iTunes/Apple Music, syncing between different ID3 versions present in the same file, etc), and who knows which charset Apple Music uses on it now for reading and presenting.

User goal

I'd like a mode where you can paste a text snippet via stdin into enca and state --language <language-code> --certainty <percentage> and enca cleverly tells me something like:

  • 85% plausibility: charsetX → charsetY → charsetZ
  • 79% plausibility: charsetX falsely interpreted as charsetY → then converted to charsetZ

I have no faint idea whether that's feasibility at all, no idea at all by which heuristics / analysis something like this could be done, but hey 😂 in the age of machine learning and huge data correlating engines, I can at least formulate the use case and hope it may be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant