Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Extra content at the end of the document #161

Merged
merged 2 commits into from
Jul 7, 2024

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Jul 4, 2024

Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

[27]    Misc       ::=          Comment | PI | S

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

[16]    PI         ::=          '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

[17]    PITarget           ::=          Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

## Why?

XML with multiple root elements is invalid.

See: ruby#160 (comment)
lib/rexml/parsers/baseparser.rb Outdated Show resolved Hide resolved
lib/rexml/parsers/baseparser.rb Outdated Show resolved Hide resolved
lib/rexml/parsers/baseparser.rb Outdated Show resolved Hide resolved
test/parser/test_base_parser.rb Outdated Show resolved Hide resolved
test/parser/test_base_parser.rb Outdated Show resolved Hide resolved
test/parser/test_base_parser.rb Outdated Show resolved Hide resolved
test/parser/test_base_parser.rb Outdated Show resolved Hide resolved
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 641d9d1 to 4e9de51 Compare July 5, 2024 04:19
@naitoh naitoh requested a review from kou July 5, 2024 04:30
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 6, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
test/parse/test_processing_instruction.rb Outdated Show resolved Hide resolved
test/parse/test_comment.rb Outdated Show resolved Hide resolved
test/parse/test_element.rb Outdated Show resolved Hide resolved
test/parse/test_element.rb Outdated Show resolved Hide resolved
test/parse/test_text.rb Outdated Show resolved Hide resolved
return [ :start_element, tag, attributes ]
end
else
text = @source.read_until("<")
if text.chomp!("<")
@source.position -= "<".bytesize
end
if @tags.empty? and @have_root
if text.strip != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip allocates a new string. Can we avoid it?
For example: /\A\s*\z/.match?(text)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
I see.

## Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

```
[27]   	Misc	   ::=   	Comment | PI | S
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

```
[16]   	PI	   ::=   	'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

```
[17]   	PITarget	   ::=   	Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
```
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 4e9de51 to c094825 Compare July 7, 2024 00:40
@naitoh naitoh requested a review from kou July 7, 2024 00:45
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
@kou kou merged commit eb45c8d into ruby:master Jul 7, 2024
61 checks passed
@kou
Copy link
Member

kou commented Jul 7, 2024

Thanks.

kou pushed a commit that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: #161 (comment)
@naitoh naitoh deleted the fix_multiple_root_elements branch July 7, 2024 21:02
@DmitryPogrebnoy
Copy link
Contributor

DmitryPogrebnoy commented Oct 15, 2024

@naitoh @kou After this change, parsing of '<threads><thread id="1"/></threads><threads><thread id="1"/></threads>' using REXML::Parsers::PullParser.new(socket).pull throw an illegal seek exception. Should I create a separate issue for this problem?

The main cause is this if statement at lib/rexml/parsers/baseparser.rb:498

if @tags.empty? and @have_root
  raise ParseException.new("Malformed XML: Extra tag at the end of the document (got '<#{tag}')", @source)
end

@kou
Copy link
Member

kou commented Oct 16, 2024

<threads><thread id="1"/></threads><threads><thread id="1"/></threads> is an invalid XML.
Are you suggesting that we should support an invalid XML?

@DmitryPogrebnoy
Copy link
Contributor

DmitryPogrebnoy commented Oct 16, 2024

@kou Well, I use socket to get xml messages from the server and parse them using PullParser. Each message is complete and valid. Before change it worked like a charm. Now it doesn't work anymore.
If the current way with REXML::Parsers::PullParser.new(socket).pull does not work, what is the right approach for such a use case?

@kou
Copy link
Member

kou commented Oct 16, 2024

OK. Could you open a new issue for it?

@DmitryPogrebnoy
Copy link
Contributor

@kou Yep, here it is - #214

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants