(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

justinstoller · 2024-09-23T23:08:25Z

We have triggered a nasty performance regression in the JVM with some of our Regex usage in the MetricsPuppetProfiler. Regardless of the performance regression, our Regex usage is unnecessary and unneedlessly expensive.

justinstoller · 2024-09-25T15:40:53Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

+                Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 2)));
+                this.catalog_timers.put(metricId[1], metric);
+            }
+        } else if ("puppetdb".equals(metricId[0])) {


This puppetdb clause does not contain all the necessary complexity of the former PUPPETDB_PATTERN

These are the root metrics we report on based the PUPPETDB_PATTERN:

puppetdb.catalog.munge puppetdb.command.submit.replace facts puppetdb.command.submit.replace catalog puppetdb.command.submit.store report puppetdb.facts.encode puppetdb.query puppetdb.resource.search puppetdb.report.convert_to_wire_format_hash

The code in the current PR dumbs this down to:

puppetdb.catalog puppetdb.command puppetdb.facts puppetdb.query puppetdb.resource puppetdb.report

Going through the PuppetDB Ruby code (at this commit puppetlabs/puppetdb@8a1e2ef) I see these metrics created:

[:puppetdb, :aliases, :map_to_title] [:puppetdb, :catalog, :munge] [:puppetdb, :catalog, :save, request.key] [:puppetdb, :command, :submit] [:puppetdb, :command, :submit, catalog_command_name, catalog_version] [:puppetdb, :command, :submit, command_name, version] [:puppetdb, :command, :submit, inputs_command_name, inputs_version] [:puppetdb, :edges, :munge] [:puppetdb, :edges, :synthesize] [:puppetdb, :edges, :synthesize, :make_unique] [:puppetdb, :edges, :synthesize, :primary_synthesis] [:puppetdb, :edges, :synthesize, :resource_table, :build] [:puppetdb, :events_list, :build] [:puppetdb, :facts, :encode] [:puppetdb, :facts, :find, :parse_response, request.key] [:puppetdb, :facts, :find, :query_nodes, request.key] [:puppetdb, :facts, :find, request.key] [:puppetdb, :facts, :save, request.key] [:puppetdb, :facts, :search, :parse_query_response, request.key,] [:puppetdb, :facts, :search, :query_request, request.key] [:puppetdb, :facts, :search, request.key] [:puppetdb, :keys, :filter_extraneous] [:puppetdb, :logs_list, :build] [:puppetdb, :metaparams, :sort] [:puppetdb, :metrics_list, :build] [:puppetdb, :namevar_aliases, :add] [:puppetdb, :parameters, :add_missing] [:puppetdb, :payload, :format] [:puppetdb, :query, query] [:puppetdb, :report, :convert_to_wire_format_hash] [:puppetdb, :report, :process] [:puppetdb, :resource, :search, :build_up_collected_objects, request.key] [:puppetdb, :resource, :search, :parse_query_response, request.key] [:puppetdb, :resource, :search, :query, request.key] [:puppetdb, :resource, :search, request.key] [:puppetdb, :resources_list, :build] [:puppetdb, :titles, :stringify]

Are the current metrics correct? We could add a few while we're here. Some easier ones to add would be:

puppetdb.catalog.save puppetdb.edges.munge puppetdb.edges.synthesize puppetdb.facts.find puppetdb.facts.save puppetdb.facts.search puppetdb.payload.format puppetdb.report.process

But I don't know if they duplicate some existing reported metrics, or aren't very valuable.

Thoughts @austb, @rbrw , @Sharpie ?

The reports.process, catalog.save and facts.save metrics would likely be useful as they are the highest level metrics closest to "how long did the puppetdb terminus take?".

We should be sure to keep all the versions of submit like puppetdb.command.submit.store report as those, along with puppetdb.facts.encode, puppetdb.catalog.munge, and puppetdb.report.convert_to_wire_format_hash can help support determine if its spending its time in our Ruby code or trying to send the data over the net.

puppetdb.payload.format would be good to include as well in case our json conversion is slow. We do have to do some string manipulation there, so it isn't just Ruby's to_json function.

I'm less familiar with the terminus's use for querying, but I imagine puppetdb.query, puppetdb.facts.find, and puppetdb.facts.search would be good to include assuming those terminus functions are used somewhere.

To me, the rest fall into the nice-to-have category. If their portion is what's slow, they would be helpful to determine that, but I don't know how much value they'd have outside that.

@Sharpie , how does adding those new puppetdb metrics sound?

Here's the updated metrics:

"puppetdb-metrics": [ { "metric": "catalog_save", "count": 2, "mean": 243, "aggregate": 486 }, { "metric": "resource_search", "count": 8, "mean": 46, "aggregate": 368 }, { "metric": "query", "count": 1, "mean": 355, "aggregate": 355 }, { "metric": "facts_find", "count": 2, "mean": 117, "aggregate": 234 }, { "metric": "catalog_munge", "count": 2, "mean": 94, "aggregate": 188 }, { "metric": "facts_save", "count": 2, "mean": 76, "aggregate": 152 }, { "metric": "command_submit_replace_catalog", "count": 2, "mean": 54, "aggregate": 108 }, { "metric": "command_submit_replace_facts", "count": 2, "mean": 49, "aggregate": 98 }, { "metric": "report_process", "count": 2, "mean": 41, "aggregate": 82 }, { "metric": "command_submit_store_report", "count": 2, "mean": 30, "aggregate": 60 }, { "metric": "payload_format", "count": 8, "mean": 2, "aggregate": 16 }, { "metric": "report_convert_to_wire_format_hash", "count": 2, "mean": 4, "aggregate": 8 }, { "metric": "facts_encode", "count": 2, "mean": 2, "aggregate": 4 } ], "inline-metrics": [] } }, "active_alerts": [], "service_name": "puppet-profiler" }

The updated set of metrics looks good to me 👍

donoghuc · 2024-09-25T16:17:23Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

+    }
+
+    private void updateMetricsTrackers(String[] metricId, Map<String, Timer> metricsByID) {
+        if ("functions".equals(metricId[0])) {


Just a note on how to start reviewing this because it was hard to identify for me... We need to look for areas like https://github.com/puppetlabs/puppet/blob/1fba2505e6fd5c7ed17257c07667da98d6ed1dfb/lib/puppet/pops/functions/function.rb#L115 and evaluate if all the cases are handled correctly. Using array indices for each of these fields requires that we have a good sense of the possible arrays that are coming in.

I think there will be some in puppet and puppetdb. Not sure if there are any in puppetserver as well?

Overall, i like the idea of doing the handling when an event comes in such that there is no cost for querying :).

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

justinstoller · 2024-09-25T22:28:12Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

+            } else if ("static_compile_inlining".equals(metricId[1])) {
+                Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 3)));
+                this.inlining_timers.put(metricId[2], metric);
+            } else {


The previous regex based selection explicitly grabbed these submetrics: static_compile_postprocessing, static_compile, compile, and find_node. This else will grab all metrics that aren't evaluate_resource or static_compile_inlining, don't know if that is a good thing or a bad thing. Thoughts @Sharpie ?

Does this mean these more specific metrics will no longer be counted in catalog timers? Does that matter?

They will still be there, but we'll report a lot more now. I don't know if that will be helpful or just noise. Here are the ones I can easily find with grep:

init_server_facts find_facts iterate_on_generators evaluate_definitions evaluate_collections validate_final finish_catalog validate_pre_finish evaluate_generators evaluate_node_classes evaluate_ast_node evaluate_main create_settings_scope set_node_params

justinstoller · 2024-09-25T23:53:54Z

I think I've addressed the logic issues with the existing puppetdb metrics. Looking forward to hear from Charlie, et al if others would also be valuable. I need to add some tests, it appears that we test the function, catalog, and resource metrics from the profiler interface, but test the puppetdb metrics only via the graphite exporter (unaffected by this PR).

justinstoller · 2024-09-26T23:48:55Z

The unit/integration tests for puppetdb metrics I was thinking of are testing allow listing, not actual creation of those metrics (which is hard since we don't have a PDB during our unit/integration tests.

However there is an acceptance test to validate these metrics, and that's been helpful locally to make sure these are formatted correctly. The acceptance test is passing with a package built prior a small refactor (puppetserver 7.17.3.SNAPSHOT.2024.09.26T1533 on builds).

I'll add those metrics Austin mentioned (unless I hear otherwise from Charlie) and then build a final package to acceptance test and that others can validate if they wish.

We have triggered a nasty performance regression in the JVM[1] with some of our Regex usage in the MetricsPuppetProfiler (particularly the PuppetDB Patterns). Regardless of the performance regression, our Regex usage is unnecessary and unneedlessly expensive (I believe it was taking up ~3% of samples from JFR in an unrelated support escalation). This patch moves from iterating over all metrics and doing regex matches to pull out only those that match the desired pattern when metrics are requested, to caching references to desired metrics in maps as they are recorded. The keys of these maps should be the same as the match group from the previous patterns.

Based on discussion with Austin Blatt these additional metrics could be helpful in diagnosing issues with PuppetDB. > The reports.process, catalog.save and facts.save metrics would likely be > useful as they are the highest level metrics closest to "how long did the > puppetdb terminus take?". > > puppetdb.payload.format would be good to include as well in case our json > conversion is slow. We do have to do some string manipulation there, so it > isn't just Ruby's to_json function. > > I'm less familiar with the terminus's use for querying, but I imagine > puppetdb.query, puppetdb.facts.find, and puppetdb.facts.search would be > good to include assuming those terminus functions are used somewhere.

clj-kondo version 2024.03.05 fixed a bug which suppressed the :java-static-field-call warning in some cases. Linting is now erroring in CI and this patch should resolve that issue.

justinstoller · 2024-09-27T19:30:16Z

Okay. I think this is ready to go. There's a package internally available with 7.17.3.SNAPSHOT.2024.09.27T1145 that works for me.

I had to update our acceptance tests - it looks like our PDB integration isn't correct. I have a feeling that mean our PDB tests aren't really doing anything in CI, but first we need to get CI working and then I take up updating the PDB integration acceptance tests after that.

fwiw, these are those local changes to the acceptance suite: https://github.com/justinstoller/puppetserver/compare/PE-38408...justinstoller:puppetserver:pdbupdate?expand=1

donoghuc · 2024-09-27T20:37:57Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

    }

    @Override
    public void shutdown() {
    }

+    private List<String> sliceOfArrayToList(String[] idSegments, int lengthOfID) {


I cant see how subList does not work for this?

Seems like for example the getMetricName could be called with the result of a metricId.subList(0,2) as the result of subList I think would be mutable?

We start with a String[] which doesn't have the subList() method defined, but we can turn an array into a List with Arrays.asList().

I originally tried Arrays.asList(ids).subList(0,2). But the problem is asList() returns a fixed sized List and subList() simply returns a view of the original fixed size list. So we end up erring when getMetricName attempts to prepend the hostname to the list here: https://github.com/puppetlabs/puppetserver/pull/2880/files#diff-300731d432bcbd4678957c962c2aa20d068698fff7a4441434c3d1ac5e55fdfdR196

I should probably clean up getMetricName() but I didn't want to lead with a larger refactor than was strictly necessary (at least after I had changed getTimers)

Ah, i see, i misread that parameter type.

donoghuc · 2024-09-27T20:47:23Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

+	    } else {
+		String thirdElemet = safeGet(metricId, 2);
+
+                if (


Its probably overkill, but i wonder if it would be worth comparing metrics from two reference workflows. Matching the regex to this logic is hurting my brain. It may not matter too much though... These are hard to reason about 🫠

The regex matching definitely hurt my brain as well! I got it wrong a couple of times. I can run the acceptance test and collect what metrics are reported before and after this change. I've ran it after to make sure they look right (see this comment #2880 (comment)) , but haven't done a full comparison. I'll do that.

Here are the previous metrics from 7.17.2:

{"service_version":"7.17.2","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":369,"aggregate":369},{"function":"lookup","count":2,"mean":45,"aggregate":90}],"resource-metrics":[{"resource":"Class[main]","count":2,"mean":193,"aggregate":386}],"catalog-metrics":[{"metric":"compile","count":2,"mean":886,"aggregate":1772},{"metric":"find_node","count":2,"mean":11,"aggregate":22}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":366,"aggregate":366},{"metric":"resource_search","count":8,"mean":31,"aggregate":248},{"metric":"command_submit_replace_catalog","count":2,"mean":68,"aggregate":136},{"metric":"command_submit_store_report","count":2,"mean":51,"aggregate":102},{"metric":"command_submit_replace_facts","count":2,"mean":43,"aggregate":86},{"metric":"catalog_munge","count":2,"mean":9,"aggregate":18},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":5,"aggregate":10},{"metric":"facts_encode","count":2,"mean":2,"aggregate":4}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}

and here are the metrics returned after this pr:

{"service_version":"7.17.3-SNAPSHOT","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":411,"aggregate":411},{"function":"lookup","count":2,"mean":46,"aggregate":92}],"resource-metrics":[{"resource":"Node[weakest-byword.delivery.puppetlabs.net]","count":1,"mean":421,"aggregate":421},{"resource":"Class[main]","count":2,"mean":184,"aggregate":368},{"resource":"Node[resource-exporter.test]","count":1,"mean":208,"aggregate":208}],"catalog-metrics":[{"metric":"compile","count":2,"mean":912,"aggregate":1824},{"metric":"evaluate_ast_node","count":2,"mean":321,"aggregate":642},{"metric":"evaluate_main","count":2,"mean":188,"aggregate":376},{"metric":"evaluate_generators","count":2,"mean":115,"aggregate":230},{"metric":"iterate_on_generators","count":3,"mean":75,"aggregate":225},{"metric":"evaluate_collections","count":2,"mean":108,"aggregate":216},{"metric":"create_settings_scope","count":2,"mean":54,"aggregate":108},{"metric":"set_node_params","count":2,"mean":14,"aggregate":28},{"metric":"find_facts","count":2,"mean":7,"aggregate":14},{"metric":"find_node","count":2,"mean":6,"aggregate":12},{"metric":"finish_catalog","count":2,"mean":1,"aggregate":2},{"metric":"init_server_facts","count":1,"mean":0,"aggregate":0},{"metric":"validate_pre_finish","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_node_classes","count":2,"mean":0,"aggregate":0},{"metric":"validate_final","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_definitions","count":3,"mean":0,"aggregate":0}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":409,"aggregate":409},{"metric":"resource_search","count":8,"mean":36,"aggregate":288},{"metric":"facts_find","count":2,"mean":139,"aggregate":278},{"metric":"catalog_save","count":2,"mean":131,"aggregate":262},{"metric":"facts_save","count":2,"mean":91,"aggregate":182},{"metric":"command_submit_replace_catalog","count":2,"mean":67,"aggregate":134},{"metric":"command_submit_replace_facts","count":2,"mean":56,"aggregate":112},{"metric":"report_process","count":2,"mean":50,"aggregate":100},{"metric":"command_submit_store_report","count":2,"mean":39,"aggregate":78},{"metric":"payload_format","count":8,"mean":3,"aggregate":24},{"metric":"facts_encode","count":2,"mean":10,"aggregate":20},{"metric":"catalog_munge","count":2,"mean":10,"aggregate":20},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":4,"aggregate":8}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}

These aren't pretty printed, not as an f-u, but because I think putting them pretty printed in a github comment would make the content very hard to compare. I put them into files, piped them through jq, and compared two terminals side by side. I think they look correct.

donoghuc · 2024-09-27T20:56:24Z

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java

@@ -90,8 +176,8 @@ private boolean shouldTime(String[] metric_id) {
        return true;
    }

-    private List<Timer> getTimers(String[] metric_id) {
-        List<Timer> timers = new ArrayList<Timer>();
+    private Map<String, Timer> getOrCreateTimersByIDs(String[] metric_id) {


This name is not used anywhere else right? Its not important for

puppetserver/src/clj/puppetlabs/services/jruby/jruby_metrics_core.clj

Line 212 in ebc41eb

(->> (.getTimers metric-registry)

?

I'm not sure i'm following the "or" in the name here. It seems like this just creates a new map and returns it unconditionally. Am I missing something?

This name is not used anywhere else right? Its not important for

It's not, the method is private and external callers are calling this method on the MetricRegistry https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#getTimers--

I think the two cases of "getTimers()" here where inspired by that method. But, the call to MetricRegister.timer() in this method is actually the place in this class were we create timers (if they don't exist). See https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#timer-java.lang.String-

The getOrCreateTimers is in reference to that. A more accurate name would probably be getOrCreateTimersAndReturnThemById , maybe?

Ah, thanks for the reference. I think leaving it as is will be fine!

donoghuc · 2024-09-27T21:12:31Z

Do you feel strongly about this going in 7.x? Would it make sense just to target main?

justinstoller · 2024-09-27T23:04:03Z

I don't have strong feelings. The escalation that started this research was using 7.x and so I started there.

donoghuc · 2024-09-30T18:51:17Z

I think given the main benefit will be in 7.x (even though its close to the end) i think leaving it at that stream is best. Thanks!

justinstoller requested review from a team as code owners September 23, 2024 23:08

justinstoller commented Sep 25, 2024

View reviewed changes

donoghuc reviewed Sep 25, 2024

View reviewed changes

justinstoller commented Sep 25, 2024

View reviewed changes

src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java Outdated Show resolved Hide resolved

justinstoller commented Sep 25, 2024

View reviewed changes

justinstoller added 3 commits September 27, 2024 11:31

(maint) Address new clj-kondo issues

f85021c

clj-kondo version 2024.03.05 fixed a bug which suppressed the :java-static-field-call warning in some cases. Linting is now erroring in CI and this patch should resolve that issue.

justinstoller force-pushed the PE-38408 branch from c3f314e to f85021c Compare September 27, 2024 18:42

donoghuc reviewed Sep 27, 2024

View reviewed changes

donoghuc approved these changes Sep 30, 2024

View reviewed changes

justinstoller merged commit acd44da into puppetlabs:7.x Sep 30, 2024
12 checks passed

justinstoller deleted the PE-38408 branch September 30, 2024 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

justinstoller commented Sep 23, 2024

justinstoller Sep 25, 2024

justinstoller Sep 25, 2024 •

edited

Loading

austb Sep 26, 2024

justinstoller Sep 26, 2024

justinstoller Sep 27, 2024

Sharpie Sep 30, 2024

donoghuc Sep 25, 2024

justinstoller Sep 25, 2024

donoghuc Sep 27, 2024

justinstoller Sep 27, 2024

justinstoller commented Sep 25, 2024

justinstoller commented Sep 26, 2024

justinstoller commented Sep 27, 2024 •

edited

Loading

donoghuc Sep 27, 2024

justinstoller Sep 27, 2024

donoghuc Sep 30, 2024

donoghuc Sep 27, 2024

justinstoller Sep 27, 2024

justinstoller Sep 27, 2024

donoghuc Sep 27, 2024

donoghuc Sep 27, 2024

justinstoller Sep 27, 2024

donoghuc Sep 30, 2024

donoghuc commented Sep 27, 2024

justinstoller commented Sep 27, 2024

donoghuc commented Sep 30, 2024

(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

Conversation

justinstoller commented Sep 23, 2024

Choose a reason for hiding this comment

justinstoller Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinstoller commented Sep 25, 2024

justinstoller commented Sep 26, 2024

justinstoller commented Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donoghuc commented Sep 27, 2024

justinstoller commented Sep 27, 2024

donoghuc commented Sep 30, 2024

justinstoller Sep 25, 2024 •

edited

Loading

justinstoller commented Sep 27, 2024 •

edited

Loading