Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

Open
1 task done
pjfanning opened this issue Dec 2, 2024 · 6 comments
Open
1 task done

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

pjfanning opened this issue Dec 2, 2024 · 6 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@pjfanning
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Document document = db.parse(new ByteArrayInputStream(resp.getBytes(StandardCharsets.UTF_8)));

If you have a String, you don't need to convert to byte array (which is almost a waste of memory).

DocumentBuilder has a parse(InputSource) method.
https://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-org.xml.sax.InputSource-

InputSources can be constructed to wrap StringWriters that wrap the String.

Expected Behavior

Don't convert Strings to byte arrays unnecessarily wasting memory and causing parse issues. Imagine if the XML has an XML declaration that has an encoding that is not UTF-8. If you already have the String, the parser will ignore the value. If you convert to a byte array, the parser will use the XML encoding value but you have explicitly converted to UTF-8 in your code so these encodings may not match.

Steps To Reproduce

No response

Environment

HertzBeat version(s): latest

Debug logs

No response

Anything else?

No response

@pjfanning
Copy link
Contributor Author

The underlying issue is more that you convert to a String in the first place.

// todo This code converts an InputStream directly to a String. For large data in Prometheus exporters,
// this could create large objects, potentially impacting JVM memory space significantly.
// Option 1: Parse using InputStream, but this requires significant code changes;
// Option 2: Manually trigger garbage collection, similar to how it's done in Dubbo for large inputs.
String resp = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

Using an InputStream or a byte array should be more more efficient than a String. It would definitely not be worse.

@tomsun28
Copy link
Contributor

Sorry for missing that, got it thanks, we will consider parsing directly from the bytes stream later, just like described in todo

@tomsun28 tomsun28 added the good first issue Good for newcomers label Jan 26, 2025
@Suvrat1629
Copy link
Contributor

@tomsun28
Instead of converting the InputStream into a String, we directly parse the InputStream as an XML stream and then use DocumentBuilder.parse() method directly with the InputStream, which avoids the intermediate String conversion.
Is this approach right?
Can i work on this issue?

@tomsun28
Copy link
Contributor

@tomsun28 Instead of converting the InputStream into a String, we directly parse the InputStream as an XML stream and then use DocumentBuilder.parse() method directly with the InputStream, which avoids the intermediate String conversion. Is this approach right? Can i work on this issue?

of course, welcome.

@Suvrat1629
Copy link
Contributor

@tomsun28
Well I think to convert

String resp = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

to

InputStream inputStream = response.getEntity().getContent();

would require some significant changes to the code.

@pjfanning
Copy link
Contributor Author

The parseResponseBySiteMap method cannot readily be changed to read an InputStream due to its failover code (to handle non-XML).

// if XML parsing fails, parse in TXT format
if (!isXmlFormat) {
try {
String[] urls = resp.split("\n");
// validate whether the given value is a URL
if (IpDomainUtil.isHasSchema(urls[0])) {
siteUrls.addAll(Arrays.asList(urls));
}
} catch (Exception e) {
log.warn(e.getMessage(), e);
}
}

InputStreams can only be read once while Strings can be used as input more than once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
Development

No branches or pull requests

3 participants