ZTC-1478: track the number of times a logger has been replaced with a child logger #47

bcottrell8738 · 2024-06-12T02:44:19Z

Every time we call the add_fields! macro, it adds one to the depth of the nested structure of Arcs inside the logger object. If the structure gets too deeply nested, it causes a stack overflow on drop.

Fixing this to make it keep the structure flat rather than nested would be difficult; so, this commit makes it keep track of how many times it has replaced a logger with a child logger, and it stops you (with a panic) before you reach a dangerous level of nesting.

The panic string is intended to be informative enough so that you get a hint as to where to go look in your code.

bcottrell8738 · 2024-06-12T02:57:07Z

foundations/src/telemetry/log/internal.rs

 // NOTE: we intentionally use a lock without poisoning here to not
 // panic the threads if they just share telemetry with failed thread.
-pub(crate) type SharedLog = Arc<parking_lot::RwLock<Logger>>;
+pub(crate) type SharedLog = Arc<parking_lot::RwLock<LogContext>>;


The comment above explains why we cannot get away with just a single Arc and a single RwLock, with both the Logger and the u32 living underneath the RwLock.

If we really wanted to, we could get rid of the outer RwLock, and just have SharedLog be defined as an Arc<LogContext>. That would mean we would need interior mutability inside LogContext for updating the generation number; so that could be changed to an AtomicU32.

I initially did it that way; but I found myself writing voluminous comments everywhere in order to justify/excuse what I was doing. It felt like a code smell, to me. So I added the extra RwLock layer. I feel like this way of doing it is more readily comprehensible to people who might be reading the code.

Personally, I think it would be better to use an AtomicU32 on generation. It's weird to me to have a struct that has only partial interior mutability, as it forces what you did here - two layers of locks. Since we're forced into the Arc<RwLock<Logger>> for storing the logger itself, I think the next best choice is making the other field(s) on LogContext have their own locks.

One benefit is that it would reduce work on the hot path of actually writing logs, while it would probably be a similar amount of work (not sure how locking a RwLock compares to incrementing + reading an atomic) on the add_fields! path.

I've updated the PR accordingly. My gut says "no, it was better the way it was before" but I am curious what you (and others) think.

I think this is an improvement. As you note, it's not mission critical that generation is exactly in sync - we don't care if it triggers at 1005 and not 1000, really. Maybe we could even used Ordering::Relaxed, but I'm no expert on what exactly that does and whether it's safe. I really like that we don't have an extra lock on the hot path of logging now. overwrite_from is a little gross, but I think it's worth it.

OK! I'm glad we did this, rather than just me making assumptions about what would be best! :-)

With the current implementation, the generation field is indeed accessed always under the logger lock, which creates the necessary happens-before relationship. Those atomic accesses could be replaced by relaxed operations without any impact.

As for what those mean, basically it's how strong of a memory barrier the compiler needs to issue depending on the architecture of the processor. For e.g., with strong consistency you don't really get any more expensive than with relaxed in x86_64. But for ARM, strong consistency is more expensive than relaxed because the memory model of the processor is weaker than x86_64.

As for atomic increments vs lock acquisition, the former may be cheaper in some architectures (e.g. x86_64 has fetch_and_inc which is commutative and much more scalable in terms of concurrency than compare_and_swap, with the latter being the primitive used for locks).

OmegaJak · 2024-06-12T15:50:30Z

foundations/tests/logging.rs

+// panics (with an error that gives you a hint as to where to go look in your
+// code).
+#[with_test_telemetry(test)]
+fn test_limit_num_generations(_ctx: TestTelemetryContext) {


I think an important property of MAX_LOG_GENERATION that we want to verify (but I don't think the current test is) is: "MAX_LOG_GENERATION is small enough to prevent a stackoverflow on drop". It seems like this test is verifying "if we call add_fields more than the max, we get our own panic". Could we also have a test that verifies that "if you call add_fields just under the max, it doesn't panic on drop"?

Unit test case is added!

OmegaJak · 2024-06-12T15:56:08Z

foundations/src/telemetry/log/internal.rs

-    let log = parent.read().new(slog::o!());
-
-    Arc::new(parking_lot::RwLock::new(log))
+    let log_lock = parent.read();


This doesn't strike me as being functionally the same...

Unless I'm missing something, I think these changes will make it so that fork_log still writes to the same output (since it's just cloning the arcs and locks, not creating a new log with a new output)

I'm not sure why it wouldn't be functionally the same. Calling Logger::new() with an empty KV will clone the drain and add zero fields to the KV. Which is basically what <LogContext as Clone>::clone will do, too.

I'm open to being convinced otherwise! but I will leave this as is for now.

Ah - I mistakenly thought o! was making a new output or some such, not the KV. Looking at slog's source now, I agree, I think you're right and what you're doing is functionally the same.

OmegaJak · 2024-06-12T16:11:09Z

foundations/src/telemetry/log/internal.rs

 // NOTE: we intentionally use a lock without poisoning here to not
 // panic the threads if they just share telemetry with failed thread.
-pub(crate) type SharedLog = Arc<parking_lot::RwLock<Logger>>;
+pub(crate) type SharedLog = Arc<parking_lot::RwLock<LogContext>>;


Personally, I think it would be better to use an AtomicU32 on generation. It's weird to me to have a struct that has only partial interior mutability, as it forces what you did here - two layers of locks. Since we're forced into the Arc<RwLock<Logger>> for storing the logger itself, I think the next best choice is making the other field(s) on LogContext have their own locks.

One benefit is that it would reduce work on the hot path of actually writing logs, while it would probably be a similar amount of work (not sure how locking a RwLock compares to incrementing + reading an atomic) on the add_fields! path.

jgpaiva

Nice, this looks way better than if I had done it.
My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

bcottrell8738 · 2024-06-12T20:57:03Z

Nice, this looks way better than if I had done it. My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

How does that look?

jgpaiva · 2024-06-12T22:09:52Z

Nice, this looks way better than if I had done it. My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

How does that look?

LGTM!

…child Every time we call the add_fields! macro, it adds one to the depth of the nested structure of Arcs inside the logger object. If the structure gets too deeply nested, it causes a stack overflow on drop. Fixing this to make it keep the structure flat rather than nested would be difficult; so, this commit makes it keep track of how many times it has replaced a logger with a child logger, and it stops you (with a panic) before you reach a dangerous level of nesting. The panic string is intended to be informative enough so that you get a hint as to where to go look in your code.

…d with a child logger (cloudflare#47)" This reverts commit 341befe. (This change is being reimplemented in a different way, see the next commit)

bcottrell8738 commented Jun 12, 2024

View reviewed changes

bcottrell8738 marked this pull request as ready for review June 12, 2024 03:02

nmldiegues requested review from ipostelnik, inikulin, nmldiegues, norwoodj, Noah-Kennedy and fisherdarling June 12, 2024 14:59

OmegaJak reviewed Jun 12, 2024

View reviewed changes

jgpaiva approved these changes Jun 12, 2024

View reviewed changes

nmldiegues approved these changes Jun 12, 2024

View reviewed changes

fisherdarling merged commit 341befe into cloudflare:main Jun 13, 2024
16 of 17 checks passed

bcottrell8738 mentioned this pull request Jun 14, 2024

Release 3.4.0 #48

Closed

bcottrell8738 mentioned this pull request Jul 9, 2024

ZTC-1478: rework PR #47 in light of the fact that the next release we do is going to have to be a major version bump anyway, no matter what #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

bcottrell8738 commented Jun 12, 2024

bcottrell8738 Jun 12, 2024

OmegaJak Jun 12, 2024

bcottrell8738 Jun 12, 2024

OmegaJak Jun 12, 2024

bcottrell8738 Jun 12, 2024

nmldiegues Jun 12, 2024

OmegaJak Jun 12, 2024

bcottrell8738 Jun 12, 2024

OmegaJak Jun 12, 2024

bcottrell8738 Jun 12, 2024

OmegaJak Jun 12, 2024

OmegaJak Jun 12, 2024

jgpaiva left a comment

bcottrell8738 commented Jun 12, 2024

jgpaiva commented Jun 12, 2024

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

Conversation

bcottrell8738 commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgpaiva left a comment

Choose a reason for hiding this comment

bcottrell8738 commented Jun 12, 2024

jgpaiva commented Jun 12, 2024