Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct #61

sidkurella · 2024-02-08T14:57:40Z

Hello,

I have noticed a bug that causes a panic when decoding into a nested struct when using codepoint indices as your boundaries rather than bytes. Take the following example:

func TestDecodeSetUseCodepointIndices_Nested(t *testing.T) {
	type Nested struct {
		First  string `fixed:"1,3"`
		Second string `fixed:"4,6"`
	}

	type Test struct {
		First  string `fixed:"1,3"`
		Second Nested `fixed:"4,9"`
		Third  string `fixed:"10,12"`
		Fourth Nested `fixed:"13,18"`
		Fifth  string `fixed:"19,21"`
	}

	for _, tt := range []struct {
		name     string
		raw      []byte
		expected Test
	}{
		{
			name: "Multi-byte characters",
			raw:  []byte("123x☃x456x☃x789x☃x012\n"),
			expected: Test{
				First:  "123",
				Second: Nested{First: "x☃x", Second: "456"},
				Third:  "x☃x",
				Fourth: Nested{First: "789", Second: "x☃x"},
				Fifth:  "012",
			},
		},
	} {
		t.Run(tt.name, func(t *testing.T) {
			d := NewDecoder(bytes.NewReader(tt.raw))
			d.SetUseCodepointIndices(true)
			var s Test
			err := d.Decode(&s)
			if err != nil {
				t.Errorf("Unexpected err: %v", err)
			}
			if !reflect.DeepEqual(tt.expected, s) {
				t.Errorf("Decode(%v) want %v, have %v", tt.raw, tt.expected, s)
			}
		})
	}
}

Currently, this causes a panic due to codepoint indices not being adjusted when trimming data from the front of the string in decode.go:rawValueFromLine.

I believe the issue is here (decode.go Ln. 217):

	if value.codepointIndices != nil {
		if len(value.codepointIndices) == 0 || startPos > len(value.codepointIndices) {
			return rawValue{data: ""}
		}
		var relevantIndices []int
		var lineData string
		if endPos >= len(value.codepointIndices) {
			relevantIndices = value.codepointIndices[startPos-1:]
			lineData = value.data[relevantIndices[0]:]
		} else {
			relevantIndices = value.codepointIndices[startPos-1 : endPos]
			lineData = value.data[relevantIndices[0]:value.codepointIndices[endPos]]
		}
	} else { // truncated
	}

Note that lineData is trimmed from the left but the codepoint indices are not adjusted to match, which can cause an index out of bounds, or reading from the wrong part of the data string.

I have created a fix in PR #60 for your review.

The text was updated successfully, but these errors were encountered:

sidkurella mentioned this issue Feb 8, 2024

Correctly update and trim codepoint indices after trimming data #62

Merged

ianlopshire closed this as completed in #62 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct #61

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct #61

sidkurella commented Feb 8, 2024

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct #61

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct #61

Comments

sidkurella commented Feb 8, 2024