Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly update and trim codepoint indices after trimming data #62

Merged
merged 6 commits into from
Feb 8, 2024

Conversation

sidkurella
Copy link
Contributor

@sidkurella sidkurella commented Feb 8, 2024

Shift codepoint indices when trimming from front of string

Trim codepoint indices to be consistent with trimmed data string

Closes #61

sidkurella and others added 4 commits February 8, 2024 09:46
These codepoint indices are now invalid when trimming data from the
front of the string. Shift these left to match the new byte offsets
after trim.

Fixes potential panic or invalid data when using UTF-8 encoded strings
with nested struct members.
…when-trim-front

Shift codepoint indices when trimming data from front
Trim these indices to be consistent with the trimmed string.
This prevents a potential index out of bounds when
padding is trimmed.
…es-when-trimming-line-data

Trim codepoint indices when trimming line data
@ianlopshire
Copy link
Owner

@sidkurella I think there is still a bug in this where padding at the start of a nested struct will cause a misalignment. Consider this test case:

func TestDecode_Nested(t *testing.T) {
	type Nested struct {
		First  string `fixed:"1,3"`
		Second string `fixed:"4,6"`
	}

	type Test struct {
		Nested Nested `fixed:"1,6"`
	}

	for _, tt := range []struct {
		name string
		raw  []byte

		expected Test
	}{
		{
			name: "Multi-byte with trimmed values",
			raw:  []byte(" ☃2B☃\n"),
			expected: Test{
				Nested: Nested{
					First:  "☃2",
					Second: "B☃",
				},
			},
		},
	} {
		t.Run(tt.name, func(t *testing.T) {
			d := NewDecoder(bytes.NewReader(tt.raw))
			d.SetUseCodepointIndices(true)
			var s Test
			err := d.Decode(&s)
			if err != nil {
				t.Errorf("Unexpected err: %v", err)
			}
			if !reflect.DeepEqual(tt.expected, s) {
				t.Errorf("Decode(%v) want %v, have %v", tt.raw, tt.expected, s)
			}
		})
	}
}

A mechanism must be added to prevent trimming on values decoded into the nested struct. I've opened the PR draft #63 as an example.

@sidkurella
Copy link
Contributor Author

@sidkurella I think there is still a bug in this where padding at the start of a nested struct will cause a misalignment. Consider this test case:

func TestDecode_Nested(t *testing.T) {
	type Nested struct {
		First  string `fixed:"1,3"`
		Second string `fixed:"4,6"`
	}

	type Test struct {
		Nested Nested `fixed:"1,6"`
	}

	for _, tt := range []struct {
		name string
		raw  []byte

		expected Test
	}{
		{
			name: "Multi-byte with trimmed values",
			raw:  []byte(" ☃2B☃\n"),
			expected: Test{
				Nested: Nested{
					First:  "☃2",
					Second: "B☃",
				},
			},
		},
	} {
		t.Run(tt.name, func(t *testing.T) {
			d := NewDecoder(bytes.NewReader(tt.raw))
			d.SetUseCodepointIndices(true)
			var s Test
			err := d.Decode(&s)
			if err != nil {
				t.Errorf("Unexpected err: %v", err)
			}
			if !reflect.DeepEqual(tt.expected, s) {
				t.Errorf("Decode(%v) want %v, have %v", tt.raw, tt.expected, s)
			}
		})
	}
}

A mechanism must be added to prevent trimming on values decoded into the nested struct. I've opened the PR draft #63 as an example

Yeah, I think you're right, if you're decoding into a nested struct I think you just don't want to do any padding trimming at all

I feel like that's an issue also in the pure-ASCII case too, no?

@sidkurella
Copy link
Contributor Author

		{
			name: "Pure ASCII with trimmed values",
			raw:  []byte(" A2BC\n"),
			expected: Test{
				Nested: Nested{
					First:  "A2",
					Second: "BC",
				},
			},
		},

Yeah, just tested with this, this fails too.

This may break existing users though?

@ianlopshire
Copy link
Owner

So it seems there are two issues at play: (a) the potential panic caused by not correcting the relevant indices and (b) trimming for nested structs.

I feel fine fixing the first; any current users affected would be seeing a panic. I'm hesitant to change the behavior of the second without some sort of opt-in setting.

Would adding none as an alignment work for your usecase?

type Nested struct {
	First  string `fixed:"1,3"`
	Second string `fixed:"4,6"`
}

type Test struct {
	Nested `fixed:"1,6,none"`
}

@sidkurella
Copy link
Contributor Author

Yeah, that works

@ianlopshire
Copy link
Owner

Cool,

Let's use this PR to solve the first of those issues, and I can update the second PR to add the none tag. I'll review this one now.

Copy link
Owner

@ianlopshire ianlopshire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trimming logic cluttering up the rawValueFromLine and making it hard to understand. Can we factor out the trimming logic into methods on rawValue?

func (rawValue) trimPrefix(prefix string) rawValue {}

func (rawValue) trimSuffix(suffix string) rawValue {}

func (rawValue) trim(cutset string) rawValue {}

Alternatively, you could use the newRawValue function to rebuild the indices after trimming. I think that would be a performance hit, but I'd be fine with it for now.

@sidkurella
Copy link
Contributor Author

The trimming logic cluttering up the rawValueFromLine and making it hard to understand. Can we factor out the trimming logic into methods on rawValue?

func (rawValue) trimPrefix(prefix string) rawValue {}

func (rawValue) trimSuffix(suffix string) rawValue {}

func (rawValue) trim(cutset string) rawValue {}

Alternatively, you could use the newRawValue function to rebuild the indices after trimming. I think that would be a performance hit, but I'd be fine with it for now.

Updated

@ianlopshire ianlopshire merged commit 66e7247 into ianlopshire:master Feb 8, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct
2 participants