Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect character encoding of YAML files and ignore files #630

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,13 @@ jobs:
- run: pip install .
# https://github.com/AndreMiras/coveralls-python-action/issues/18
- run: echo -e "[run]\nrelative_files = True" > .coveragerc
- run: coverage run -m unittest discover
- run: >-
python
-X warn_default_encoding
-W error::EncodingWarning
-m coverage
run
-m unittest
discover
- name: Coveralls
uses: AndreMiras/coveralls-python-action@develop
32 changes: 32 additions & 0 deletions docs/character_encoding_override.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Character Encoding Override
===========================

When yamllint reads a file, it will try to automatically detect that file’s
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea:

Suggested change
When yamllint reads a file, it will try to automatically detect that file’s
When yamllint reads a file (either configuration or a file to lint), it will try
to automatically detect that file’s …

(this goes along with my next comment ↓)

character encoding. In order for the automatic detection to work properly,
files must follow these two rules (see `this section of the YAML specification
for details <https://yaml.org/spec/1.2.2/#52-character-encodings>`_):

* The file must be encoded in UTF-8, UTF-16 or UTF-32.

* The file must begin with either a byte order mark or an ASCII character.

Previous versions of yamllint did not try to autodetect the character encoding
of files. Previous versions of yamllint assumed that files used the current
locale’s character encoding. This meant that older versions of yamllint would
sometimes correctly decode files that didn’t follow those two rules. For the
sake of backwards compatibility, the current version of yamllint allows you to
disable automatic character encoding detection by setting the
``YAMLLINT_FILE_ENCODING`` environment variable. If you set the
``YAMLLINT_FILE_ENCODING`` environment variable to the `the name of one of
Python’s standard character encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_, then
yamllint will use that character encoding instead of trying to autodetect the
character encoding.

The ``YAMLLINT_FILE_ENCODING`` environment variable should only be used as a
stopgap solution. If you need to use ``YAMLLINT_FILE_ENCODING``, then you
should really update your YAML files so that their character encoding can
automatically be detected, or else you may run into compatibility problems.
Future versions of yamllint may remove support for the
``YAMLLINT_FILE_ENCODING`` environment variable, and other YAML processors may
misinterpret your YAML files.
15 changes: 15 additions & 0 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,21 @@ or:

.. note:: However, this is mutually exclusive with the ``ignore`` key.

.. note:: Files on the ``ignore-from-file`` list should use either UTF-8,
UTF-16 or UTF-32. Additionally, they should start with either an ASCII
character or a byte order mark.

If you have an ignore file that doesn’t follow those two rules, then you can
set the ``YAMLLINT_FILE_ENCODING`` environment variable to the name of the
character encoding that you want yamllint to use for ignore files.
Specifically, ``YAMLLINT_FILE_ENCODING`` should be set to `the name of one
of Python’s standard character encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_. Please
note, this should only be used as a temporary solution in order to make it
easier to migrate from older versions of yamllint to newer versions of
yamllint. See :doc:`Character Encoding Override
<character_encoding_override>` for details.

Comment on lines +231 to +245
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning here? Why not, you decide!
By the way, the same also applies to the configuration file .yamllint.

If you keep it, maybe we can simplify it like this?

.. note:: Files on the ``ignore-from-file`` list should use either UTF-8, UTF-16
   or UTF-32. See :doc:`Character Encoding Override
   <character_encoding_override>` for details and wordarounds.

If you need to know the exact list of files that yamllint would process,
without really linting them, you can use ``--list-files``:

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ Table of contents
development
text_editors
integration
character_encoding_override
188 changes: 154 additions & 34 deletions tests/common.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2016 Adrien Vergé
# Copyright (C) 2023–2025 Jason Yundt
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand All @@ -13,20 +14,173 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

import codecs
import contextlib
from io import StringIO
import os
import shutil
import sys
import tempfile
import unittest
import warnings
from codecs import CodecInfo

import yaml

from yamllint import linter
from yamllint.config import YamlLintConfig


# Encoding related stuff:
UTF_CODECS = (
'utf_32_be',
'utf_32_be_sig',
'utf_32_le',
'utf_32_le_sig',
'utf_16_be',
'utf_16_be_sig',
'utf_16_le',
'utf_16_le_sig',
'utf_8',
'utf_8_sig'
)


def encode_utf_32_be_sig(obj):
return (
codecs.BOM_UTF32_BE + codecs.encode(obj, 'utf_32_be', 'strict'),
len(obj)
)


def encode_utf_32_le_sig(obj):
return (
codecs.BOM_UTF32_LE + codecs.encode(obj, 'utf_32_le', 'strict'),
len(obj)
)


def encode_utf_16_be_sig(obj):
return (
codecs.BOM_UTF16_BE + codecs.encode(obj, 'utf_16_be', 'strict'),
len(obj)
)


def encode_utf_16_le_sig(obj):
return (
codecs.BOM_UTF16_LE + codecs.encode(obj, 'utf_16_le', 'strict'),
len(obj)
)


test_codec_infos = {
'utf_32_be_sig':
CodecInfo(encode_utf_32_be_sig, codecs.getdecoder('utf_32')),
'utf_32_le_sig':
CodecInfo(encode_utf_32_le_sig, codecs.getdecoder('utf_32')),
'utf_16_be_sig':
CodecInfo(encode_utf_16_be_sig, codecs.getdecoder('utf_16')),
'utf_16_le_sig':
CodecInfo(encode_utf_16_le_sig, codecs.getdecoder('utf_16')),
}


def register_test_codecs():
codecs.register(test_codec_infos.get)


def unregister_test_codecs():
if sys.version_info >= (3, 10, 0):
codecs.unregister(test_codec_infos.get)
else:
warnings.warn(
"This version of Python doesn’t allow us to unregister codecs.",
stacklevel=1
)


def is_test_codec(codec):
return codec in test_codec_infos.keys()


def test_codec_built_in_equivalent(test_codec):
return_value = test_codec
for suffix in ('_sig', '_be', '_le'):
return_value = return_value.replace(suffix, '')
return return_value


def uses_bom(codec):
for suffix in ('_32', '_16', '_sig'):
if codec.endswith(suffix):
return True
return False


def encoding_detectable(string, codec):
"""
Returns True if encoding can be detected after string is encoded

Encoding detection only works if you’re using a BOM or the first character
is ASCII. See yamllint.decoder.auto_decode()’s docstring.
"""
return uses_bom(codec) or (len(string) > 0 and string[0].isascii())


# Workspace related stuff:
class Blob:
def __init__(self, text, encoding):
self.text = text
self.encoding = encoding


def build_temp_workspace(files):
tempdir = tempfile.mkdtemp(prefix='yamllint-tests-')

for path, content in files.items():
path = os.fsencode(os.path.join(tempdir, path))
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))

if isinstance(content, list):
os.mkdir(path)
elif isinstance(content, str) and content.startswith('symlink://'):
os.symlink(content[10:], path)
else:
if isinstance(content, Blob):
content = content.text.encode(content.encoding)
elif isinstance(content, str):
content = content.encode('utf_8')
with open(path, 'wb') as f:
f.write(content)

return tempdir


@contextlib.contextmanager
def temp_workspace(files):
"""Provide a temporary workspace that is automatically cleaned up."""
backup_wd = os.getcwd()
wd = build_temp_workspace(files)

try:
os.chdir(wd)
yield
finally:
os.chdir(backup_wd)
shutil.rmtree(wd)


def temp_workspace_with_files_in_many_codecs(path_template, text):
workspace = {}
for codec in UTF_CODECS:
if encoding_detectable(text, codec):
workspace[path_template.format(codec)] = Blob(text, codec)
return workspace


# Miscellaneous stuff:
class RuleTestCase(unittest.TestCase):
def build_fake_config(self, conf):
if conf is None:
Expand Down Expand Up @@ -81,37 +235,3 @@ def __exit__(self, *exc_info):
@property
def returncode(self):
return self._raises_ctx.exception.code


def build_temp_workspace(files):
tempdir = tempfile.mkdtemp(prefix='yamllint-tests-')

for path, content in files.items():
path = os.path.join(tempdir, path).encode('utf-8')
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))

if isinstance(content, list):
os.mkdir(path)
elif isinstance(content, str) and content.startswith('symlink://'):
os.symlink(content[10:], path)
else:
mode = 'wb' if isinstance(content, bytes) else 'w'
with open(path, mode) as f:
f.write(content)

return tempdir


@contextlib.contextmanager
def temp_workspace(files):
"""Provide a temporary workspace that is automatically cleaned up."""
backup_wd = os.getcwd()
wd = build_temp_workspace(files)

try:
os.chdir(wd)
yield
finally:
os.chdir(backup_wd)
shutil.rmtree(wd)
Loading
Loading