Skip to content

Commit 3502252

Browse files
seismanmichaelgrundyvonnefroehlich
authored
Support non-ASCII characters in ISO-8859-x charset encodings (#3310)
Co-authored-by: Michael Grund <[email protected]> Co-authored-by: Yvonne Fröhlich <[email protected]>
1 parent e746156 commit 3502252

File tree

7 files changed

+215
-38
lines changed

7 files changed

+215
-38
lines changed

doc/techref/encodings.md

+30-8
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,12 @@
11
# Supported Encodings and Non-ASCII Characters
22

3-
GMT supports a number of encodings and each encoding contains a set of ASCII and non-ASCII
4-
characters. Below are some of the most common encodings and characters that are supported.
3+
GMT supports a number of encodings and each encoding contains a set of ASCII and
4+
non-ASCII characters. In PyGMT, you can use any of these ASCII and non-ASCII characters
5+
in arguments and text strings. When using non-ASCII characters in PyGMT, the easiest way
6+
is to copy and paste the character from the encoding tables below.
57

6-
In PyGMT, you can use any of these ASCII and non-ASCII characters in arguments and text
7-
strings. When using non-ASCII characters in PyGMT, the easiest way is to copy and paste
8-
the character from the tables below.
9-
10-
**Note**: The special character &#xfffd; (REPLACEMENT CHARACTER) is used to indicate that
11-
the character is not defined in the encoding.
8+
**Note**: The special character &#xfffd; (REPLACEMENT CHARACTER) is used to indicate
9+
that the character is not defined in the encoding.
1210

1311
## Adobe ISOLatin1+ Encoding
1412

@@ -106,3 +104,27 @@ the Unicode character set.
106104
| **\35x** | &#x27a8; | &#x27a9; | &#x27aa; | &#x27ab; | &#x27ac; | &#x27ad; | &#x27ae; | &#x27af; |
107105
| **\36x** | &#xfffd; | &#x27b1; | &#x27b2; | &#x27b3; | &#x27b4; | &#x27b5; | &#x27b6; | &#x27b7; |
108106
| **\37x** | &#x27b8; | &#x27b9; | &#x27ba; | &#x27bb; | &#x27bc; | &#x27bd; | &#x27be; | &#xfffd; |
107+
108+
## ISO/IEC 8859
109+
110+
GMT also supports the ISO/IEC 8859 standard for 8-bit character encodings. Refer to
111+
<https://en.wikipedia.org/wiki/ISO/IEC_8859> for descriptions of the different parts of
112+
the standard.
113+
114+
For a list of the characters in each part of the standard, refer to the following links:
115+
116+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>
117+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-2>
118+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-3>
119+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-4>
120+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-5>
121+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-6>
122+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-7>
123+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-8>
124+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-9>
125+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-10>
126+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-11>
127+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-13>
128+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-14>
129+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
130+
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-16>

pygmt/encodings.py

+22-10
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
"""
2-
Adobe character encodings supported by GMT.
2+
Character encodings supported by GMT.
33
4-
Currently, only Adobe Symbol, Adobe ZapfDingbats, and Adobe ISOLatin1+ encodings are
5-
supported.
4+
Currently, Adobe Symbol, Adobe ZapfDingbats, Adobe ISOLatin1+ and ISO-8859-x (x can be
5+
1-11, 13-16) encodings are supported. Adobe Standard encoding is not supported.
66
7-
The corresponding Unicode characters in each Adobe character encoding are generated
8-
from the mapping table and conversion script in the GMT-octal-codes
9-
(https://github.com/seisman/GMT-octal-codes) repository. Refer to that repository for
10-
details.
7+
The corresponding Unicode characters in each Adobe character encoding are generated from
8+
the mapping tables and conversion scripts in the
9+
`GMT-octal-codes repository <https://github.com/seisman/GMT-octal-codes>`__. Refer to
10+
that repository for details.
1111
1212
Some code points are undefined and are assigned with the replacement character
1313
(``\ufffd``).
@@ -16,14 +16,17 @@
1616
----------
1717
1818
- GMT-octal-codes: https://github.com/seisman/GMT-octal-codes
19-
- GMT official documentation: https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html
19+
- GMT documentation: https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html
2020
- Adobe Postscript Language Reference: https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf
21-
- ISOLatin1+: https://en.wikipedia.org/wiki/PostScript_Latin_1_Encoding
21+
- Adobe ISOLatin1+: https://en.wikipedia.org/wiki/PostScript_Latin_1_Encoding
2222
- Adobe Symbol: https://en.wikipedia.org/wiki/Symbol_(typeface)
23-
- Zapf Dingbats: https://en.wikipedia.org/wiki/Zapf_Dingbats
23+
- Adobe ZapfDingbats: https://en.wikipedia.org/wiki/Zapf_Dingbats
2424
- Adobe Glyph List: https://github.com/adobe-type-tools/agl-aglfn
25+
- ISO-8859: https://en.wikipedia.org/wiki/ISO/IEC_8859
2526
"""
2627

28+
import codecs
29+
2730
# Dictionary of character mappings for different encodings.
2831
charset: dict = {}
2932

@@ -129,3 +132,12 @@
129132
strict=False,
130133
)
131134
)
135+
136+
# ISO-8859-x charsets and x can be 1-11, 13-16.
137+
for i in range(1, 17):
138+
if i == 12: # ISO-8859-12 was abandoned.
139+
continue
140+
charset[f"ISO-8859-{i}"] = {
141+
code: codecs.decode(bytes([code]), f"iso8859_{i}", errors="replace")
142+
for code in [*range(0o040, 0o200), *range(0o240, 0o400)]
143+
}

pygmt/helpers/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
unique_name,
1616
)
1717
from pygmt.helpers.utils import (
18+
_check_encoding,
1819
_validate_data_input,
1920
args_in_kwargs,
2021
build_arg_list,

pygmt/helpers/utils.py

+123-10
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,78 @@ def _validate_data_input(
115115
raise GMTInvalidInput("data must provide x, y, and z columns.")
116116

117117

118+
def _check_encoding(
119+
argstr: str,
120+
) -> Literal[
121+
"ascii",
122+
"ISOLatin1+",
123+
"ISO-8859-1",
124+
"ISO-8859-2",
125+
"ISO-8859-3",
126+
"ISO-8859-4",
127+
"ISO-8859-5",
128+
"ISO-8859-6",
129+
"ISO-8859-7",
130+
"ISO-8859-8",
131+
"ISO-8859-9",
132+
"ISO-8859-10",
133+
"ISO-8859-11",
134+
"ISO-8859-13",
135+
"ISO-8859-14",
136+
"ISO-8859-15",
137+
"ISO-8859-16",
138+
]:
139+
"""
140+
Check the charset encoding of a string.
141+
142+
All characters in the string must be in the same charset encoding, otherwise the
143+
default ``ISOLatin1+`` encoding is returned. Characters in the Adobe Symbol and
144+
ZapfDingbats encodings are also checked because they're independent on the choice of
145+
encodings.
146+
147+
Parameters
148+
----------
149+
argstr
150+
The string to be checked.
151+
152+
Returns
153+
-------
154+
encoding
155+
The encoding of the string.
156+
157+
Examples
158+
--------
159+
>>> _check_encoding("123ABC+-?!") # ASCII characters only
160+
'ascii'
161+
>>> _check_encoding("12AB±β①②") # Characters in ISOLatin1+
162+
'ISOLatin1+'
163+
>>> _check_encoding("12ABāáâãäåβ①②") # Characters in ISO-8859-4
164+
'ISO-8859-4'
165+
>>> _check_encoding("12ABŒā") # Mix characters in ISOLatin1+ (Œ) and ISO-8859-4 (ā)
166+
'ISOLatin1+'
167+
>>> _check_encoding("123AB中文") # Characters not in any charset encoding
168+
'ISOLatin1+'
169+
"""
170+
# Return "ascii" if the string only contains ASCII characters.
171+
if all(32 <= ord(c) <= 126 for c in argstr):
172+
return "ascii"
173+
# Loop through all supported encodings and check if all characters in the string
174+
# are in the charset of the encoding. If all characters are in the charset, return
175+
# the encoding. The ISOLatin1+ encoding is checked first because it is the default
176+
# and most common encoding.
177+
adobe_chars = set(charset["Symbol"].values()) | set(
178+
charset["ZapfDingbats"].values()
179+
)
180+
for encoding in ["ISOLatin1+"] + [f"ISO-8859-{i}" for i in range(1, 17)]:
181+
if encoding == "ISO-8859-12": # ISO-8859-12 was abandoned. Skip it.
182+
continue
183+
if all(c in (set(charset[encoding].values()) | adobe_chars) for c in argstr):
184+
return encoding # type: ignore[return-value]
185+
# Return the "ISOLatin1+" encoding if the string contains characters from multiple
186+
# charset encodings or contains characters that are not in any charset encoding.
187+
return "ISOLatin1+"
188+
189+
118190
def data_kind(
119191
data: Any = None, required: bool = True
120192
) -> Literal["arg", "file", "geojson", "grid", "image", "matrix", "vectors"]:
@@ -192,17 +264,41 @@ def data_kind(
192264
return kind
193265

194266

195-
def non_ascii_to_octal(argstr: str) -> str:
267+
def non_ascii_to_octal(
268+
argstr: str,
269+
encoding: Literal[
270+
"ascii",
271+
"ISOLatin1+",
272+
"ISO-8859-1",
273+
"ISO-8859-2",
274+
"ISO-8859-3",
275+
"ISO-8859-4",
276+
"ISO-8859-5",
277+
"ISO-8859-6",
278+
"ISO-8859-7",
279+
"ISO-8859-8",
280+
"ISO-8859-9",
281+
"ISO-8859-10",
282+
"ISO-8859-11",
283+
"ISO-8859-13",
284+
"ISO-8859-14",
285+
"ISO-8859-15",
286+
"ISO-8859-16",
287+
] = "ISOLatin1+",
288+
) -> str:
196289
r"""
197290
Translate non-ASCII characters to their corresponding octal codes.
198291
199-
Currently, only characters in the ISOLatin1+ charset and Symbol/ZapfDingbats fonts
200-
are supported.
292+
Currently, only non-ASCII characters in the Adobe ISOLatin1+, Adobe Symbol, Adobe
293+
ZapfDingbats, and ISO-8850-x (x can be in 1-11, 13-17) encodings are supported.
294+
The Adobe Standard encoding is not supported yet.
201295
202296
Parameters
203297
----------
204298
argstr
205299
The string to be translated.
300+
encoding
301+
The encoding of characters in the string.
206302
207303
Returns
208304
-------
@@ -219,9 +315,11 @@ def non_ascii_to_octal(argstr: str) -> str:
219315
'@%34%\\041@%%@%34%\\176@%%@%34%\\241@%%@%34%\\376@%%'
220316
>>> non_ascii_to_octal("ABC ±120° DEF α ♥")
221317
'ABC \\261120\\260 DEF @~\\141@~ @%34%\\252@%%'
318+
>>> non_ascii_to_octal("12ABāáâãäåβ①②", encoding="ISO-8859-4")
319+
'12AB\\340\\341\\342\\343\\344\\345@~\\142@~@%34%\\254@%%@%34%\\255@%%'
222320
""" # noqa: RUF002
223-
# Return the string if it only contains printable ASCII characters from 32 to 126.
224-
if all(32 <= ord(c) <= 126 for c in argstr):
321+
# Return the input string if it only contains ASCII characters.
322+
if encoding == "ascii" or all(32 <= ord(c) <= 126 for c in argstr):
225323
return argstr
226324

227325
# Dictionary mapping non-ASCII characters to octal codes
@@ -232,15 +330,15 @@ def non_ascii_to_octal(argstr: str) -> str:
232330
mapping.update(
233331
{c: f"@%34%\\{i:03o}@%%" for i, c in charset["ZapfDingbats"].items()}
234332
)
235-
# Adobe ISOLatin1+ charset. Put at the end.
236-
mapping.update({c: f"\\{i:03o}" for i, c in charset["ISOLatin1+"].items()})
333+
# ISOLatin1+ or ISO-8859-x charset.
334+
mapping.update({c: f"\\{i:03o}" for i, c in charset[encoding].items()})
237335

238336
# Remove any printable characters
239337
mapping = {k: v for k, v in mapping.items() if k not in string.printable}
240338
return argstr.translate(str.maketrans(mapping))
241339

242340

243-
def build_arg_list(
341+
def build_arg_list( # noqa: PLR0912
244342
kwdict: dict[str, Any],
245343
confdict: dict[str, str] | None = None,
246344
infile: str | pathlib.PurePath | Sequence[str | pathlib.PurePath] | None = None,
@@ -310,6 +408,10 @@ def build_arg_list(
310408
... )
311409
... )
312410
['f1.txt', 'f2.txt', '-A0', '-B', '--FORMAT_DATE_MAP=o dd', '->out.txt']
411+
>>> build_arg_list(dict(B="12ABāβ①②"))
412+
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-4']
413+
>>> build_arg_list(dict(B="12ABāβ①②"), confdict=dict(PS_CHAR_ENCODING="ISO-8859-5"))
414+
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-5']
313415
>>> print(build_arg_list(dict(R="1/2/3/4", J="X4i", watre=True)))
314416
Traceback (most recent call last):
315417
...
@@ -324,11 +426,22 @@ def build_arg_list(
324426
elif value is True:
325427
gmt_args.append(f"-{key}")
326428
elif is_nonstr_iter(value):
327-
gmt_args.extend(non_ascii_to_octal(f"-{key}{_value}") for _value in value)
429+
gmt_args.extend(f"-{key}{_value}" for _value in value)
328430
else:
329-
gmt_args.append(non_ascii_to_octal(f"-{key}{value}"))
431+
gmt_args.append(f"-{key}{value}")
432+
433+
# Convert non-ASCII characters (if any) in the arguments to octal codes
434+
encoding = _check_encoding("".join(gmt_args))
435+
if encoding != "ascii":
436+
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]
330437
gmt_args = sorted(gmt_args)
331438

439+
# Set --PS_CHAR_ENCODING=encoding if necessary
440+
if encoding not in {"ascii", "ISOLatin1+"} and not (
441+
confdict and "PS_CHAR_ENCODING" in confdict
442+
):
443+
gmt_args.append(f"--PS_CHAR_ENCODING={encoding}")
444+
332445
if confdict:
333446
gmt_args.extend(f"--{key}={value}" for key, value in confdict.items())
334447

pygmt/src/text.py

+21-10
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from pygmt.clib import Session
77
from pygmt.exceptions import GMTInvalidInput
88
from pygmt.helpers import (
9+
_check_encoding,
910
build_arg_list,
1011
data_kind,
1112
fmt_docstring,
@@ -59,13 +60,12 @@ def text_( # noqa: PLR0912
5960
- ``x``/``y``, and ``text``
6061
- ``position`` and ``text``
6162
62-
The text strings passed via the ``text`` parameter can contain ASCII
63-
characters and non-ASCII characters defined in the ISOLatin1+ encoding
64-
(i.e., IEC_8859-1), and the Symbol and ZapfDingbats character sets.
65-
See :gmt-docs:`reference/octal-codes.html` for the full list of supported
66-
non-ASCII characters.
63+
The text strings passed via the ``text`` parameter can contain ASCII characters and
64+
non-ASCII characters defined in the Adobe ISOLatin1+, Adobe Symbol, Adobe
65+
ZapfDingbats and ISO-8859-x (x can be 1-11, 13-16) encodings. Refer to
66+
:doc:`techref/encodings` for the full list of supported non-ASCII characters.
6767
68-
Full option list at :gmt-docs:`text.html`
68+
Full option list at :gmt-docs:`text.html`.
6969
7070
{aliases}
7171
@@ -226,13 +226,24 @@ def text_( # noqa: PLR0912
226226
kwargs["t"] = ""
227227

228228
# Append text at last column. Text must be passed in as str type.
229+
confdict = {}
229230
if kind == "vectors":
230-
extra_arrays.append(
231-
np.vectorize(non_ascii_to_octal)(np.atleast_1d(text).astype(str))
232-
)
231+
text = np.atleast_1d(text).astype(str)
232+
encoding = _check_encoding("".join(text))
233+
if encoding != "ascii":
234+
text = np.vectorize(non_ascii_to_octal, excluded="encoding")(
235+
text, encoding=encoding
236+
)
237+
extra_arrays.append(text)
238+
239+
if encoding not in {"ascii", "ISOLatin1+"}:
240+
confdict = {"PS_CHAR_ENCODING": encoding}
233241

234242
with Session() as lib:
235243
with lib.virtualfile_in(
236244
check_kind="vector", data=textfiles, x=x, y=y, extra_arrays=extra_arrays
237245
) as vintbl:
238-
lib.call_module(module="text", args=build_arg_list(kwargs, infile=vintbl))
246+
lib.call_module(
247+
module="text",
248+
args=build_arg_list(kwargs, infile=vintbl, confdict=confdict),
249+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
outs:
2+
- md5: a0f35a1d58c95e6589c7397e7660e946
3+
size: 17089
4+
hash: md5
5+
path: test_text_nonascii_iso8859.png

pygmt/tests/test_text.py

+13
Original file line numberDiff line numberDiff line change
@@ -434,3 +434,16 @@ def test_text_quotation_marks():
434434
fig.basemap(projection="X4c/2c", region=[0, 4, 0, 2], frame=0)
435435
fig.text(x=2, y=1, text='\\234 ‘ ’ " “ ”', font="20p") # noqa: RUF001
436436
return fig
437+
438+
439+
@pytest.mark.mpl_image_compare
440+
def test_text_nonascii_iso8859():
441+
"""
442+
Test passing text strings with non-ascii characters in ISO-8859-4 encoding.
443+
"""
444+
fig = Figure()
445+
fig.basemap(region=[0, 10, 0, 10], projection="X10c", frame=["WSEN+tAāáâãäåB"])
446+
fig.text(position="TL", text="position-text:1ÉĘËĖ2")
447+
fig.text(x=1, y=1, text="xytext:1éęëė2")
448+
fig.text(x=[5, 5], y=[3, 5], text=["xytext1:ųúûüũūαζ∆❡", "xytext2:íîī∑π∇✉"])
449+
return fig

0 commit comments

Comments
 (0)