Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

andsel
Copy link
Contributor

@andsel andsel commented Jan 28, 2025

Release notes

[rn:skip]

What does this PR do?

This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:

  • accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
  • respect the encoding of the input string. Use concat method instead of addAll, which avoid to convert RubyString to String and back to RubyString. When return the head StringBuilder it enforce the encoding with the input charset.

Why is it important/What is the impact to the user?

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • [ ]

How to test this PR locally

The test plan has two sides:

How to test the encoding is respected

Startup a REPL with Logstash and exercise the tokenizer:

$> bin/logstash -i irb
> buftok = FileWatch::BufferedTokenizer.new
> buftok.extract("\xA3".force_encoding("ISO8859-1")); buftok.flush.bytes

or use the following script

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

with the Logstash run as

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

In the output the £ as to be present and not £

Related issues

@andsel andsel self-assigned this Jan 28, 2025
@andsel andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from 69bd4f4 to a84656e Compare January 28, 2025 16:02
Copy link
Contributor

It looks like this PR modifies one or more .asciidoc files. These files are being migrated to Markdown, and any changes merged now will be lost. See the migration guide for details.

Copy link
Contributor

📃 DOCS PREVIEWhttps://logstash_bk_16968.docs-preview.app.elstc.co/diff

@andsel andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from b682c20 to b42ca05 Compare January 31, 2025 12:40
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @andsel

@andsel andsel changed the title Fix/buffered tokenizer clean state in case of line too big respecting character encoding Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string Jan 31, 2025
@andsel andsel added the bug label Jan 31, 2025
@andsel andsel marked this pull request as ready for review January 31, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Character encoding issues with refactored BufferedTokenizerExt
2 participants