There are two aspects to this analysis: comparing the results of actual sorts in en_US locale, and comparing the LC_COLLATE section of the Operating System locale data files.
Comparing the results of actual sorts should catch any changes to default sorting which is not defined in the OS collation data. A simple perl script is used to generate a text file containing 91 different strings for every legal unicode character. The unix "sort" utility processes this file with the locale configured to en_US for collation. This process is repeated on each release from the past 10 years, and then the unix "diff" utility is used to compare the sorted output files and count how many characters have different positions after sorting. The results show how many individual code points have changed positions in the sorted data across different Operating System releases and which Unicode Blocks contain the changed code points.
The Operating System locale data files from /usr/share/i18n/locales
are
compared directly. The results show the total number of lines in the data files
that are changed, and which locales contain the changes.
Analysis for ICU only compares the results of actual sorts (no raw locale data is analyzed), but it uses multiple locales: en-US, ja-JP, zh-Hans-CN, ru-RU, fr-FR, de-DE, and es-ES. The methodology for comparing the sorted strings is identical to that used for the GNU C Library.
GLIBC Version | SUMMARY: Unicode Blocks in Diff of en_US Sort | DETAIL: Codepoints in Diff of en_US Sort | SUMMARY: Locales in Diff of OS Collation Data | DETAIL: Lines in Diff of OS Collation Data | DETAIL: Number of Locales | Operating System | AMI |
---|---|---|---|---|---|---|---|
2.11.1-0ubuntu7.10 | 281 | Ubuntu 10.04.4 LTS | ami-0baf7662 | ||||
2.12.1-0ubuntu10.4 | 0 | 0 | et_EE | 1987 | 284 | Ubuntu 10.10 | ami-c412cead |
2.13-0ubuntu13.1 | 0 | 0 | 0 | 287 | Ubuntu 11.04 | ami-6d9f3604 | |
2.13-20ubuntu5.1 | (16 blocks) | 982 (Full Diff) | dz_BT, iso14651_t1_common, se_NO | 2555 | 295 | Ubuntu 11.10 | ami-4fad7426 |
2.15-0ubuntu10.18 | 0 | 0 | hu_HU, ug_CN | 243 | 301 | Ubuntu 12.04.5 LTS | ami-024a2614 |
2.15-0ubuntu20 | 0 | 0 | 0 | 301 | Ubuntu 12.10 | ami-02df496b | |
2.17-0ubuntu5 | 0 | 0 | la_AU (removed), tlh_GB (removed) | 0 | 299 | Ubuntu 13.04 | ami-12314d7b |
2.17-93ubuntu4 | 0 | 0 | 0 | 299 | Ubuntu 13.10 | ami-137e4f7a | |
2.19-0ubuntu6.15 | 0 | 0 | 0 | 299 | Ubuntu 14.04.6 LTS | ami-000b3a073fc20e415 | |
2.19-10ubuntu2 | 0 | 0 | 0 | 300 | Ubuntu 14.10 | ami-12a3247a | |
2.21-0ubuntu4 | (39 blocks) | 22743 (Full Diff) | 0 | 301 | Ubuntu 15.04 | ami-04a6816e | |
2.21-0ubuntu4 | 0 | 0 | 0 | 302 | Ubuntu 15.10 | ami-002f0f6a | |
2.23-0ubuntu11.3 | 0 | 0 | cs_CZ, et_EE, gd_GB, hsb_DE, sv_SE, uk_UA, ia (removed) | 4061 | 326 | Ubuntu 16.04.7 LTS | ami-0b0ea68c435eb488d |
2.24-3ubuntu2.2 | 0 | 0 | C, eo, kk_KZ, ln_CD, iw_IL (removed), pap_AN (removed) | 392728 | 328 | Ubuntu 16.10 | ami-055d7213 |
2.24-9ubuntu2.2 | 0 | 0 | C | 33 | 328 | Ubuntu 17.04 | ami-10d4f76b |
2.26-0ubuntu2.1 | Malayalam | 7 (Full Diff) | hu_HU, iso14651_t1_common, the_NP | 176 | 336 | Ubuntu 17.10 | ami-10eadd6a |
2.27-3ubuntu1.4 | (19 blocks) | 279 (Full Diff) | bs_BA, cmn_TW, cs_CZ, de_DE, et_EE, fr_CA, hr_HR, hsb_DE, hu_HU, is_IS, iso14651_t1_common, ky_KG, lb_LU, lt_LT, lv_LV, om_KE, pl_PL, sr_RS, tr_TR, uk_UA | 6523 | 345 | Ubuntu 18.04.6 LTS | ami-0279c3b3186e54acd |
2.28-0ubuntu1 | (265 blocks) | 75183 (Full Diff) | (More than 20 languages) | 94308 | 347 | Ubuntu 18.10 | ami-00191485461dfb374 |
2.29-0ubuntu2 | 0 | 0 | 0 | 347 | Ubuntu 19.04 | ami-001084c942f9e0391 | |
2.30-0ubuntu2.1 | 0 | 0 | 0 | 347 | Ubuntu 19.10 | ami-013728cad753192a4 | |
2.31-0ubuntu9.2 | 0 | 0 | 0 | 348 | Ubuntu 20.04.3 LTS | ami-083654bd07b5da81d | |
2.32-0ubuntu3 | 0 | 0 | ckb_IQ, or_IN | 738 | 348 | Ubuntu 20.10 | ami-00630aa67c689d2ab |
2.33-0ubuntu5 | 0 | 0 | 0 | 348 | Ubuntu 21.04 | ami-02bd521ab3d72d1c6 | |
2.34-0ubuntu3 | 0 | 0 | sv_SE | 2 | 348 | Ubuntu 21.10 | ami-00482f016b2410dc8 |
2.35-0ubuntu3 | 0 | 0 | C | 822 | 349 | Ubuntu 22.04 LTS | ami-0ba8e031ca32ab37f |
The filter.sh script was used to run an additional comparison between sorted lists using only strings that are composed entirely/purely of ISO-8859-1 characters, across all of the above versions of Ubuntu. Note that ISO-8859-1 is a superset of ASCII, so pure ASCII was also covered by this comparison.
Glibc 2.28 is the only version which changed comparisons of any pure ASCII strings in this test. Glibc 2.27 did not change pure ASCII, but it changed ISO-8859-1 strings. No other versions of glibc made sort order changes for the ISO-8859-1 strings generated in this test.
Note: Generated with an older version of scripts; not yet updated. This Red Hat table may be missing some changes.
GLIBC Version | Total Detected en_US Sort Order Changes | Unicode Blocks of Detected en_US Sort Order Changes | Total Detected Collation Data File Changes | Locales of Detected Data File Changes | Number of Locales | Operating System | AMI |
---|---|---|---|---|---|---|---|
2.5-49.el5_5.7 | 231 | Red Hat Enterprise Linux Server release 5.5 (Tikanga) | ami-eb84ed82 | ||||
2.5-1232.5-123 | 0 | 0 | 231 | Red Hat Enterprise Linux Server release 5.11 (Tikanga) | ami-3268da5a | ||
2.12-1.7.el6_0.8 | 22908 | 4 Basic Latin, 10 Latin-1 Supplement, 18 Latin Extended-A, 131 Latin Extended-B, 9 IPA Extensions, 206 Cyrillic, 16 Cyrillic Supplement, 76 Armenian, 26 Hebrew, 45 Arabic, 108 Devanagari, 86 Bengali, 79 Gurmukhi, 82 Gujarati, 58 Tamil, 93 Telugu, 86 Kannada, 82 Malayalam, 80 Sinhala, 130 Myanmar, 82 Georgian, 246 Latin Extended Additional, 1 Miscellaneous Symbols, 38 Georgian Supplement, 55 Tifinagh, 20902 CJK Unified Ideographs, 34 Arabic Presentation Forms-A, 125 Arabic Presentation Forms-B | 16282 | (More than 20 languages) | 275 | Red Hat Enterprise Linux Server release 6.0 (Santiago) | ami-09680160 |
2.12-1.212.el6_10.3 | 0 | 42 | fi_FI | 275 | Red Hat Enterprise Linux Server release 6.10 (Santiago) | ami-0351faf7328fdb373 | |
2.17-55.el7_0.5 | 107 | 107 Tibetan | 2168 | dz_BT, hu_HU, iso14651_t1_common, se_NO, ug_CN, no_NO (removed) | 300 | Red Hat Enterprise Linux Server release 7.0 (Maipo) | ami-60a1e808 |
2.17-317.el7 | 0 | 0 | 300 | Red Hat Enterprise Linux Server release 7.9 (Maipo) | ami-005b7876121b7244d | ||
2.28-42.el8_0.1 | 282167 | (Blocks not listed for this many en_US sort order changes) | 112164 | (More than 20 languages) | 341 | Red Hat Enterprise Linux release 8.0 (Ootpa) | ami-043fbed28a389c721 |
2.28-164.el8 | 0 | 10 | C | 341 | Red Hat Enterprise Linux release 8.5 (Ootpa) | ami-06644055bed38ebd9 | |
2.34-7.el9_b | 0 | 543 | C, or_IN, sv_SE | 343 | Red Hat Enterprise Linux release 9.0 Beta (Plow) | ami-0fb33ec3ead0b8e3f |
For every legal unicode code point, the following 91 string patterns are generated:
(Each unicode character is substituted for the wine glass in the strings below.)
S-199: π·
S-200: π·B
S-201: π·O
S-202: π·3
S-203: π·.
S-204: π·
S-205: π·ζ§
S-206: π·γ―
S-210: Bπ·
S-211: Oπ·
S-212: 3π·
S-213: .π·
S-214: π·
S-215: ζ§π·
S-216: γ―π·
S-299: π·π·
S-300: π·BB
S-301: π·OO
S-302: π·33
S-303: π·..
S-304: π·
S-305: π·ζ§ζ§
S-306: π·γ―γ―
S-310: Bπ·B
S-311: Oπ·O
S-312: 3π·3
S-313: .π·.
S-314: π·
S-315: ζ§π·ζ§
S-316: γ―π·γ―
S-320: BBπ·
S-321: OOπ·
S-322: 33π·
S-323: ..π·
S-324: π·
S-325: ζ§ζ§π·
S-326: γ―γ―π·
S-330: π·π·B
S-331: π·π·O
S-332: π·π·3
S-333: π·π·.
S-334: π·π·
S-335: π·π·ζ§
S-336: π·π·γ―
S-340: π·Bπ·
S-341: π·Oπ·
S-342: π·3π·
S-343: π·.π·
S-344: π· π·
S-345: π·ζ§π·
S-346: π·γ―π·
S-350: Bπ·π·
S-351: Oπ·π·
S-352: 3π·π·
S-353: .π·π·
S-354: π·π·
S-355: ζ§π·π·
S-356: γ―π·π·
S-380: 3Bπ·
S-399: π·π·π·
S-400: π·π·BB
S-401: π·π·OO
S-402: π·π·33
S-403: π·π·..
S-404: π·π·
S-405: π·π·ζ§ζ§
S-406: π·π·γ―γ―
S-410: Bπ·π·B
S-411: Oπ·π·O
S-412: 3π·π·3
S-413: .π·π·.
S-414: π·π·
S-415: ζ§π·π·ζ§
S-416: γ―π·π·γ―
S-420: BBπ·π·
S-421: OOπ·π·
S-422: 33π·π·
S-423: ..π·π·
S-424: π·π·
S-425: ζ§ζ§π·π·
S-426: γ―γ―π·π·
S-480: 3Bπ·B
S-481: 3B-π·
S-499: π·π·π·π·
S-580: BBπ·π·[tab]
S-581: [tab]BBπ·π·
S-582: BB-π·π·
S-583: πππ·β€β’
S-584: π·π·.33
S-585: 3B-π·B
S-599: π·π·π·π·π·
These patterns are based on some knowledge of collation algorithms and areas where change is common or likely, informed by a review of actual changes in past versions of glibc. For example: we intentionally generate interactions between character classes like consonants, vowels, numbers, punctuation and whitespace; we generate similar strings of different lengths; we generate some strings with CJK characters only; and we include a few miscellaneous strings to add some specific extra patterns based on known past corner case changes. Some characters may behave differently when doubled so we also include combinations with letters twice in a row. While not comprehensive, this set of strings has caught a very high number of changes across many versions of glibc going back more than 10 years.
The test suite will generate a sorted list of all strings (around 25 million) on various systems. It will then use the unix "diff" utility to look for a minimal set of differences between the sorted lists and create reports summarizing those differences.
Each pattern is numbered, and the pattern numbers are referenced in the report produced by this code. You can see lists of exactly which strings changed, as well as summaries of which patterns appeared in which unicode blocks.
This is fairly thorough but may not be completely comprehensive. Unicode collation includes a capability to change the sort order based on combinations of characters. For example, some languages have characters which modify the letter before or after that letter. Nonetheless, while not comprehensive, this is still helpful because it gives a little more perspective on how collation is changing over multiple versions of glibc.
Example:
$ dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-======================-================-================-==================================================
ii libc6:amd64 2.27-3ubuntu1.4 amd64 GNU C Library: Shared libraries
$ ( echo 1-; echo 11; echo 1-1; echo 111; echo 1a; echo 1b; echo 1-aa; echo 1-a) | LC_COLLATE=en_US.UTF-8 sort
1-
11
1-1
111
1a
1-a
1-aa
1b
From a different version:
$ dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-======================-================-================-==================================================
ii libc6:amd64 2.28-0ubuntu1 amd64 GNU C Library: Shared libraries
$ ( echo 1-; echo 11; echo 1-1; echo 111; echo 1a; echo 1b; echo 1-aa; echo 1-a) | LC_COLLATE=en_US.UTF-8 sort
1-
1-1
11
111
1-a
1a
1-aa
1b
The script table.sh
generates the table above.
The data is generated by running the following command using the DNS or IP of a linux server:
test-host.sh [ubuntu|rhel] $USER@$HOST
I searched public community AMIs on AWS to find old versions of linux. Older
versions of RHEL might not have an ec2-user account (I just used root), and
newer versions of RHEL might not come with perl or glibc-locale-source installed
by default. Newer versions of Ubuntu require keyboard input when running some
dpkg commands (a warning about this appears when running the test-host.sh
script).
sudo yum install perl
sudo yum install glibc-locale-source-$(rpm -q glibc --queryformat '%{version}-%{release}')