GitHub - es7s/holms: CLI UTF-8 to Unicode code points decomposer

CLI UTF-8 decomposer for text analysis capable of displaying Unicode code point names and categories, along with ASCII control characters, UTF-16 surrogate pair pieces, invalid UTF-8 sequences parts as separate bytes, etc.

Motivation

A necessity for a tool that can quickly identify otherwise indistinguishable Unicode code points.

Installation

With `pipx` (recommended)

pipx install holms

From git repository

curl -sS https://github.com/es7s/holms/blob/master/install.sh | sh

Basic usage

Usage: holms run [OPTIONS] [INPUT]

  Read data from INPUT file, find all valid UTF-8 byte sequences, decode them and display as
  separate Unicode code points. Use '-' as INPUT to read from stdin instead.

Plain text output

  > holms run  -u - <<<'1₂³⅘↉⏨'

  0  U+  31 ▕ 1 ▏ Nd DIGIT ONE
  1  U+2082 ▕ ₂ ▏ No SUBSCRIPT TWO
  4  U+  B3 ▕ ³ ▏ No SUPERSCRIPT THREE
  6  U+2158 ▕ ⅘ ▏ No VULGAR FRACTION FOUR FIFTHS
  9  U+2189 ▕ ↉ ▏ No VULGAR FRACTION ZERO THIRDS
  c  U+23E8 ▕ ⏨ ▏ So DECIMAL EXPONENT SYMBOL

  > holms run  -u - <<<'🌯👄🤡🎈🐳🐍'

  00  U1F32F ▕🌯 ▏ So BURRITO
  04  U1F444 ▕👄 ▏ So MOUTH
  08  U1F921 ▕🤡 ▏ So CLOWN FACE
  0c  U1F388 ▕🎈 ▏ So BALLOON
  10  U1F433 ▕🐳 ▏ So SPOUTING WHALE
  14  U1F40D ▕🐍 ▏ So SNAKE

  > holms run  -u - <<<'aаͣāãâȧäåₐᵃａ'

  00  U+  61 ▕ a ▏ Ll LATIN SMALL LETTER A
  01  U+ 430 ▕ а ▏ Ll CYRILLIC SMALL LETTER A
  03  U+ 363 ▕  ͣ ▏ Mn COMBINING LATIN SMALL LETTER A
  05  U+ 101 ▕ ā ▏ Ll LATIN SMALL LETTER A WITH MACRON
  07  U+  E3 ▕ ã ▏ Ll LATIN SMALL LETTER A WITH TILDE
  09  U+  E2 ▕ â ▏ Ll LATIN SMALL LETTER A WITH CIRCUMFLEX
  0b  U+ 227 ▕ ȧ ▏ Ll LATIN SMALL LETTER A WITH DOT ABOVE
  0d  U+  E4 ▕ ä ▏ Ll LATIN SMALL LETTER A WITH DIAERESIS
  0f  U+  E5 ▕ å ▏ Ll LATIN SMALL LETTER A WITH RING ABOVE
  11  U+2090 ▕ ₐ ▏ Lm LATIN SUBSCRIPT SMALL LETTER A
  14  U+1D43 ▕ ᵃ ▏ Lm MODIFIER LETTER SMALL A
  17  U+FF41 ▕ａ ▏ Ll FULLWIDTH LATIN SMALL LETTER A

  > holms run  -u - <<<'%‰∞8᪲?¿‽⚠⚠️'

  00  U+  25 ▕ % ▏ Po PERCENT SIGN
  01  U+2030 ▕ ‰ ▏ Po PER MILLE SIGN
  04  U+221E ▕ ∞ ▏ Sm INFINITY
  07  U+  38 ▕ 8 ▏ Nd DIGIT EIGHT
  08  U+1AB2 ▕  ᪲ ▏ Mn COMBINING INFINITY
  0b  U+  3F ▕ ? ▏ Po QUESTION MARK
  0c  U+  BF ▕ ¿ ▏ Po INVERTED QUESTION MARK
  0e  U+203D ▕ ‽ ▏ Po INTERROBANG
  11  U+26A0 ▕ ⚠ ▏ So WARNING SIGN
  14  U+26A0 ▕ ⚠ ▏ So WARNING SIGN
  17  U+FE0F ▕  ️ ▏ Mn VARIATION SELECTOR-16

Buffering

The application works in two modes: buffered (the default if INPUT is a file) and unbuffered (default when reading from stdin). Options -b/-u explicitly override output mode regardless of the default setting.

In buffered mode the result begins to appear only after EOF is encountered (i.e., the WHOLE file has been read to the buffer). This is suitable for short and predictable inputs and produces the most compact output with fixed column sizes.

The unbuffered mode comes in handy when input is an endless piped stream: the results will be displayed in real time, as soon as the type of each byte sequence is determined, but the output column widths are not fixed and can vary as the process goes further.

Despite the name, the app actually uses tiny (4 bytes) input buffer, but it's the only way to handle UTF-8 stream and distinguish valid sequences from broken ones; in truly unbuffered mode the output would consist of ASCII-7 characters (0x00-0x7F) and unrecognized binary data (0x80-0xFF) only, which is not something the application was made for.

Configuration / Advanced usage

Options:
  -b, --buffered / -u, --unbuffered
                        Explicitly set to wait for EOF before processing the
                        output (buffered), or to stream the results in
                        parallel with reading, as soon as possible
                        (unbuffered). See BUFFERING section above for the
                        details.
  -m, --merge           Replace all sequences of repeating characters with one
                        of each, together with initial length of the sequence.
  -g, --group           Group the input by code points (=count unique), sort
                        descending and display counts instead of normal
                        output. Implies '--merge' and forces buffered ('-b')
                        mode. Specifying the option twice ('-gg') results in
                        grouping by code point category instead, while doing
                        it thrice ('-ggg') makes the app group the input by
                        super categories.
  -f, --format          Comma-separated list of columns to show (order is
                        preserved). Run 'holms format' to see the details.
  -n, --names           Display names instead of abbreviations. Affects `cat`
                        and `block` columns, but only if column in question is
                        already present on the screen. Note that these columns
                        can still display only the beginning of the attribute,
                        unless '-r' is provided.
  -a, --all             Display ALL columns.
  -r, --rigid           By default some columns can be compressed beyond the
                        nominal width, if all current values fit and there is
                        still space left. This option disables column
                        shrinking (but they still will be expanded when
                        needed).
  --decimal             Use decimal byte offsets instead of hexadecimal.
  --alt                 Use alternative notation for control characters: caret
                        notation for ASCII C0, octal notation for ASCII C1.
  --oneline             Discard all newline characters (0x0a LINE FEED) from
                        the input.
  --no-table            Do not format results as a table, just apply the
                        colors to characters (equivalent to '-f char', implies
                        '-b'). Compatible with '-merge', '--format' and even '
                        --group'.
  --no-override         Do not replace control/whitespace code point markers
                        with distinguishable characters ('▯' to '↵', '␣' etc).
                        Run 'holms legend' to see the details.
  -?, --help            Show this message and exit.

Examples

Output column selection

Option -f/--filter can be used to specify what columns to display. As an alternative, there is an -a/--all option that enables displaying of all currently available columns.

Column availability depending on operating mode

Also -m/--merge option is demonstrated, which tells the app to collapse repetitive characters into one line of the output while counting them:

Plain text output

  > holms run -m  phpstan.txt

  000  U+2B ▕ + ▏ Sm     PLUS SIGN
  001+ U+2D ▕ - ▏ Pd 27× HYPHEN-MINUS
  01c  U+2B ▕ + ▏ Sm     PLUS SIGN
  01d  U+20 ▕ ␣ ▏ Zs     SPACE
  01e  U+2B ▕ + ▏ Sm     PLUS SIGN
  01f+ U+2D ▕ - ▏ Pd 27× HYPHEN-MINUS
  03a  U+2B ▕ + ▏ Sm     PLUS SIGN
  03b  U+ A ▕ ↵ ▏ Cc     ASCII C0 [LF] LINE FEED
  03c  U+7C ▕ | ▏ Sm     VERTICAL LINE
  03d+ U+20 ▕ ␣ ▏ Zs 27× SPACE
 ...

Reading from pipeline

There is an official Unicode Consortium data file included in the repository for test purposes, named confusables.txt. In the next example we extract line #3620 using sed, delete all TAB (0x08) characters and feed the result to the application. The result demonstrates various Unicode dot/bullet code points:

Plain text output

  > sed confusables.txt -Ee 'sg' -e '3620!d' |
    holms run  -

  00  U+  B7 ▕ · ▏ Po MIDDLE DOT
  02  U+1427 ▕ ᐧ ▏ Lo CANADIAN SYLLABICS FINAL MIDDLE DOT
  05  U+ 387 ▕ · ▏ Po GREEK ANO TELEIA
  07  U+2022 ▕ • ▏ Po BULLET
  0a  U+2027 ▕ ‧ ▏ Po HYPHENATION POINT
  0d  U+2219 ▕ ∙ ▏ Sm BULLET OPERATOR
  10  U+22C5 ▕ ⋅ ▏ Sm DOT OPERATOR
  13  U+30FB ▕・ ▏ Po KATAKANA MIDDLE DOT
  16  U10101 ▕ 𐄁 ▏ Po AEGEAN WORD SEPARATOR DOT
  1a  U+FF65 ▕ ･ ▏ Po HALFWIDTH KATAKANA MIDDLE DOT
  1d  U+   A ▕ ↵ ▏ Cc ASCII C0 [LF] LINE FEED

Code points / categories statistics

-g/--group option can be used to count unique code points, and to compute the occurrence rate of each one:

Plain text output

  > holms run -g  ./tests/data/confusables.txt

 U+  20 ▕ ␣ ▏ Zs  12.5% ███ 62732× SPACE
 U+   9 ▕ ⇥ ▏ Cc   7.3% █▊  36745× ASCII C0 [HT] HORIZONTAL TABULATION
 U+  41 ▕ A ▏ Lu   6.1% █▍  30555× LATIN CAPITAL LETTER A
 U+  49 ▕ I ▏ Lu   5.2% █▏  26063× LATIN CAPITAL LETTER I
 U+  45 ▕ E ▏ Lu   5.0% █▏  24992× LATIN CAPITAL LETTER E
 U+  54 ▕ T ▏ Lu   3.7% ▉   18776× LATIN CAPITAL LETTER T
 U+  4C ▕ L ▏ Lu   3.7% ▉   18763× LATIN CAPITAL LETTER L
 U+200E ▕ ▯ ▏ Cf   3.7% ▉   18494× LEFT-TO-RIGHT MARK
 U+   A ▕ ↵ ▏ Cc   2.9% ▋   14609× ASCII C0 [LF] LINE FEED
 U+  43 ▕ C ▏ Lu   2.9% ▋   14450× LATIN CAPITAL LETTER C
 ...

When used twice (-gg) or thrice (-ggg), the application groups the input by code point category or code point super category, respectively, which can be used e.g. for frequency domain analysis:

Plain text output

  > holms run -gg  ./tests/data/confusables.txt

  53.1% ██████████ 266233×  Uppercase_Letter
  12.5% ██▎         62748×  Space_Separator
  10.2% █▉          51356×  Control
   8.5% █▌          42511×  Decimal_Number
   3.7% ▋           18497×  Format
   3.0% ▌           14832×  Other_Letter
   2.0% ▎            9778×  Math_Symbol
   1.8% ▎            9261×  Close_Punctuation
   1.8% ▎            9259×  Open_Punctuation
   1.5% ▎            7525×  Other_Punctuation
 ...

  > holms run -ggg  ./tests/data/confusables.txt

  56.7% ██████████ 284074×  Letter
  13.9% ██▍         69853×  Other(C)
  12.5% ██▏         62750×  Separator(Z)
   8.5% █▌          42796×  Number
   5.9% █           29571×  Punctuation
   2.2% ▍           11072×  Symbol
   0.2% ▏             965×  Mark

In-place type highlighting

When --format is specified exactly as a single char column: --format=char, the application omits all the columns and prints the original file contents, while highligting each character with a color that indicates its' Unicode category.

Note that ASCII control codes, as well as Unicode ones, are kept untouched and invisible.

Plain text output

  > sed chars.txt -nEe 1,12p |
    holms run --format=char  -

   ! " # $ % & ' ( ) * + , - . /
 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
 @ A B C D E F G H I J K L M N O
 P Q R S T U V W X Y Z [ \ ] ^ _
 ` a b c d e f g h i j k l m n o
 p q r s t u v w x y z { | } ~
   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬  ® ¯
 ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
 Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
 à á â ã ä å æ ç è é ê ë ì í î ï
 ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

ASCII latin letters (A-Za-z) are colored in 50% gray color instead of regular white on purpose — this can be extremely helpful when the task is to find non-ASCII character(s) in an massive text of plain ASCII ones, or vice versa.

Below is a real example of broken characters which are the result of two operations being applied in the wrong order: UTF-8 decoding and URL %-based unescaping. This error is different from incorrect codepage selection errors, which mess up the whole text or a part of it; all byte sequences are valid UTF-8 encoded code points, but the result differs from the origin and is completely unreadable nevertheless.

ASCII C0 / C1 details

While developing the application I encountered strange (as it seemed to be at the beginning) behaviour of Python interpreter, which encoded C1 control bytes as two bytes of UTF-8, while C0 control bytes were displayed as sole bytes, like it would have been encoded in a plain ASCII. Then there was a bit of researching done.

According to ISO/IEC 6429 (ECMA-48), there are two types of ASCII control codes (to be precise, much more, but for our purposes it's mostly irrelevant) — C0 and C1. The first one includes ASCII code points 0x00-0x1F and 0x7F (some authors also include a regular space character 0x20 in this list), and the characteristic property of this type is that all C0 code points are encoded in UTF-8 exactly the same as they do in 7-bit US-ASCII (ISO/IEC 646). This helps to disambiguate exactly what type of encoding is used even for broken byte sequences, considering the task is to tell if a byte represents sole code point or is actually a part of multibyte UTF-8 sequence.

However, C1 control codes are represented by 0x80-0x9F bytes, which also are valid bytes for multibyte UTF-8 sequences. In order to distinguish the first type from the second UTF-8 encodes them as two-byte sequences instead (0x80 → 0xC280, etc.); also this applies not only to control codes, but to all other ISO/IEC 8859 code points starting from 0x80.

With this in mind, let's see how the application reflects these differences. First command produces several 8-bit ASCII C1 control codes, which are classified as raw binary/non-UTF-8 data, while the second command's output consists of the very same code points but being encoded in UTF-8 (thanks to Python's full transparent Unicode support, we don't even need to bother much about the encodings and such):

Plain text output

  > printf "\x80\x90\x9f" && python3 -c 'print("\x80\x90\x9f", end="")' |
    holms run --names --decimal --all  -

 ⏨0  #0   0x    80  --  ▕ ▯ ▏ NON UTF-8 BYTE 0x80                                      -- Binary
 ⏨1  #1   0x    90  --  ▕ ▯ ▏ NON UTF-8 BYTE 0x90                                      -- Binary
 ⏨2  #2   0x    9f  --  ▕ ▯ ▏ NON UTF-8 BYTE 0x9F                                      -- Binary

 ⏨3  #3   0x c2 80 U+80 ▕ ▯ ▏ ASCII C1 [PC] PADDING CHARACTER            Latin-1 Supplem‥ Control
 ⏨5  #4   0x c2 90 U+90 ▕ ▯ ▏ ASCII C1 [DCS] DEVICE CONTROL STRING       Latin-1 Supplem‥ Control
 ⏨7  #5   0x c2 9f U+9F ▕ ▯ ▏ ASCII C1 [APC] APPLICATION PROGRAM COMMAND Latin-1 Supplem‥ Control

Legend

The image below illustrates the color scheme developed for the app specifically, to simplify distinguishing code points of one category from others.

Most frequently encountering control codes also have a unique character replacements, which allows to recognize them without reading the label or memorizing code point identifiers:

Unicode Blocks

Changelog

CHANGES.rst

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
holms		holms
misc		misc
rawdata		rawdata
requirements		requirements
scripts		scripts
tests		tests
.env.build.dist		.env.build.dist
.gitignore		.gitignore
.gitmessage		.gitmessage
.locate-venv.sh		.locate-venv.sh
CHANGES.rst		CHANGES.rst
Makefile		Makefile
README.md		README.md
install.sh		install.sh
pydeps.sh		pydeps.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run		run
ucp-lookup		ucp-lookup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Installation

With `pipx` (recommended)

From git repository

Basic usage

Buffering

Configuration / Advanced usage

Examples

Output column selection

Reading from pipeline

Code points / categories statistics

In-place type highlighting

ASCII C0 / C1 details

Legend

Changelog

About

Releases 11

Languages

es7s/holms

Folders and files

Latest commit

History

Repository files navigation

Motivation

Installation

With pipx (recommended)

From git repository

Basic usage

Buffering

Configuration / Advanced usage

Examples

Output column selection

Reading from pipeline

Code points / categories statistics

In-place type highlighting

ASCII C0 / C1 details

Legend

Changelog

About

Topics

Resources

Stars

Watchers

Forks

Releases 11

Languages

With `pipx` (recommended)