Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose byte range(s) for attributes? #113

Closed
Jayonas opened this issue Jan 16, 2024 · 5 comments
Closed

Expose byte range(s) for attributes? #113

Jayonas opened this issue Jan 16, 2024 · 5 comments

Comments

@Jayonas
Copy link
Contributor

Jayonas commented Jan 16, 2024

I'm evaluating XML libraries for use on a project, and one thing that I'm looking for is the ability to get the byte ranges where various XML items came from in the original string/document, for use in error reporting. I found Node.range(), which is exactly what I'm looking for. However, attributes aren't Nodes so they don't benefit from this API. I found Attribute.position(), but it only provides the first byte rather than a range. Would you consider adding a range() method to Attribute? Or does something like that already exist and I missed it?

Even better -- would you consider adding separate ranges for an attribute's name and value? It would be nice to be able to pinpoint one or the other in different kinds of error messages. E.g. an "invalid attribute" error could pinpoint the name, whereas an "invalid value" error could pinpoint the value. I don't personally care very much whether the name's range includes its prefix (if any) or not, but I realize that's a thing that would need to be considered.

Another library that I've looked at is your xmlparser, which is far and away the best I've seen in terms of pinpointing the ranges of various items, not just separate prefix/name/value ranges for attributes, but even separate name/prefix for elements, separate full/contents-only ranges for comments and CDATA, and so forth. Though as you point out in its documentation, that library is very low-level and therefore very limited in actual XML processing and validation capabilities. I noticed that you recently went so far as to deprecate xmlparser in favor of simply merging it into this roxmltree library. For anyone that has been using xmlparser and the extra range information it provides, they might find it easier to transition to this library if it also made that extra information available. The main hang up for me is attribute ranges, but I would also applaud exposing subranges of various types of Nodes as well.

@RazrFalcon
Copy link
Owner

The reason there are no attribute ranges is memory overhead. We basically have to add 8 bytes to each attribute, which would significantly increase memory usage. Same with attribute value ranges.
See https://github.com/RazrFalcon/roxmltree?tab=readme-ov-file#memory-overhead

I don't mind replacing an attribute position with a range. But adding name and value ranges would require additional 24 bytes. Same with more granular positions.

@Jayonas
Copy link
Contributor Author

Jayonas commented Jan 16, 2024

Thanks for your response! Even just replacing an attribute's position with a range would be a big boost for me, so I'm very happy to hear that you don't mind doing that. It's been surprisingly challenging for me to find an XML library that's both relatively feature complete and also provides good range information (or even a way for me to cobble it together myself). I'm pretty sure that having an Attribute.range() API would sell me on roxmltree.

Regarding more granular ranges like separate attribute names and values, I definitely understand your concern about memory usage and realize that most users probably wouldn't want to make that trade-off. It's a shame for people like me that would be willing to pay the memory cost, though. What do you think about potentially using a feature flag to enable extra range tracking? I realize that would add some complexity to the crate, and maybe more than you'd be comfortable with, but I wonder if it might not be too bad? I have not dived into the code myself to look, but assuming that the underlying parser (xmlparser) already discovers this information, then maybe the only stuff the feature flag would need to enable/disable would be:

  1. extra fields on a few structs
  2. code to copy data into those fields from the parser
  3. accessor methods for those fields
  4. extra tests to verify the data

That's obviously not trivial, but it also doesn't seem especially complex at first glance either (though again, I acknowledge that I haven't checked out the code myself, and that XML always seems to be more complicated than you'd think). Any chance that it's at least worth consideration? Maybe the same idea could even be used to save memory by placing the range data that already exists behind a feature flag as well (assuming that it isn't required by crate internals); such a flag could default to enabled (for backward compatibility) but people who don't even care about the basic ranges could save memory just by turning the flag off.

Anyway, thanks again for your reply, and I'm excited about the possibility of an Attribute.range() API regardless.

@Jayonas
Copy link
Contributor Author

Jayonas commented May 21, 2024

Created PR #119 for adding Attribute.range(). I also marked Attribute.position() as deprecated; let me know if you'd rather not do that, or have any other concerns.

I also have a branch for adding subranges for an attribute's qname and value behind a new feature flag, and would be happy for any feedback.

Let me know if I missed any protocol or such. Despite being a relatively experienced dev, I'm pretty new to contributing on GitHub like this. E.g. I couldn't figure out how to link the PR to this issue, I guess only this repo's maintainers can do that?

@RazrFalcon
Copy link
Owner

Thanks. Looks good to me.

As for qname/value ranges, you have an interesting trick to reduce memory usage. I like it.
div_range is a bit misleading name. I immediately thought about <div/> and not =. I would rather rename it, but I don't have good candidates for now.
Also, I think we could do some tricks and maybe use Range<u8> or Range<u16> at most. What chances that attribute name is larger than 255 or 65535 characters? usize would be an overkill.
I think we can settle for Range<u16> for now with some guard rails to make sure ranges larger than 65535 are clamped. And I will accept such PR.

I couldn't figure out how to link the PR to this issue, I guess only this repo's maintainers can do that?

I don't think you can "link" it. You can mention an issue by id/number in a PR and GitHub will auto-reference it. That's about it.

@Jayonas
Copy link
Contributor Author

Jayonas commented May 22, 2024

Good point that Range<usize> is actually overkill for this. I think we can get by with just two u8s, and once I started thinking of them as separate values instead of a single Range then better names for them became much more obvious: qname_len and eq_len. I submitted PR #120 for it. Let me know if you think either or both of those need to be u16 instead (maxing out a u8 for either of them already seems pretty extreme to me, but I suspect that you have more experience seeing wacky XML data).

Also -- will your automated build/test workflow on GitHub need to be manually updated to build/test with the new feature enabled? If so, I'm guessing that I won't be able to do that for you as part of the PR?

Thanks for your time and feedback, and accepting these PRs. I'm happy to be able to have this functionality in my app without needing my own fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants