Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

Merged
merged 1 commit into from
Jun 13, 2024
Merged

ZTC-1478: track the number of times a logger has been replaced with a child logger #47

merged 1 commit into from
Jun 13, 2024

Conversation

bcottrell8738
Copy link
Contributor

Every time we call the add_fields! macro, it adds one to the depth of the nested structure of Arcs inside the logger object. If the structure gets too deeply nested, it causes a stack overflow on drop.

Fixing this to make it keep the structure flat rather than nested would be difficult; so, this commit makes it keep track of how many times it has replaced a logger with a child logger, and it stops you (with a panic) before you reach a dangerous level of nesting.

The panic string is intended to be informative enough so that you get a hint as to where to go look in your code.

// NOTE: we intentionally use a lock without poisoning here to not
// panic the threads if they just share telemetry with failed thread.
pub(crate) type SharedLog = Arc<parking_lot::RwLock<Logger>>;
pub(crate) type SharedLog = Arc<parking_lot::RwLock<LogContext>>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above explains why we cannot get away with just a single Arc and a single RwLock, with both the Logger and the u32 living underneath the RwLock.

If we really wanted to, we could get rid of the outer RwLock, and just have SharedLog be defined as an Arc<LogContext>. That would mean we would need interior mutability inside LogContext for updating the generation number; so that could be changed to an AtomicU32.

I initially did it that way; but I found myself writing voluminous comments everywhere in order to justify/excuse what I was doing. It felt like a code smell, to me. So I added the extra RwLock layer. I feel like this way of doing it is more readily comprehensible to people who might be reading the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it would be better to use an AtomicU32 on generation. It's weird to me to have a struct that has only partial interior mutability, as it forces what you did here - two layers of locks. Since we're forced into the Arc<RwLock<Logger>> for storing the logger itself, I think the next best choice is making the other field(s) on LogContext have their own locks.

One benefit is that it would reduce work on the hot path of actually writing logs, while it would probably be a similar amount of work (not sure how locking a RwLock compares to incrementing + reading an atomic) on the add_fields! path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR accordingly. My gut says "no, it was better the way it was before" but I am curious what you (and others) think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement. As you note, it's not mission critical that generation is exactly in sync - we don't care if it triggers at 1005 and not 1000, really. Maybe we could even used Ordering::Relaxed, but I'm no expert on what exactly that does and whether it's safe. I really like that we don't have an extra lock on the hot path of logging now. overwrite_from is a little gross, but I think it's worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I'm glad we did this, rather than just me making assumptions about what would be best! :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation, the generation field is indeed accessed always under the logger lock, which creates the necessary happens-before relationship. Those atomic accesses could be replaced by relaxed operations without any impact.

As for what those mean, basically it's how strong of a memory barrier the compiler needs to issue depending on the architecture of the processor. For e.g., with strong consistency you don't really get any more expensive than with relaxed in x86_64. But for ARM, strong consistency is more expensive than relaxed because the memory model of the processor is weaker than x86_64.

As for atomic increments vs lock acquisition, the former may be cheaper in some architectures (e.g. x86_64 has fetch_and_inc which is commutative and much more scalable in terms of concurrency than compare_and_swap, with the latter being the primitive used for locks).

// panics (with an error that gives you a hint as to where to go look in your
// code).
#[with_test_telemetry(test)]
fn test_limit_num_generations(_ctx: TestTelemetryContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an important property of MAX_LOG_GENERATION that we want to verify (but I don't think the current test is) is: "MAX_LOG_GENERATION is small enough to prevent a stackoverflow on drop". It seems like this test is verifying "if we call add_fields more than the max, we get our own panic". Could we also have a test that verifies that "if you call add_fields just under the max, it doesn't panic on drop"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test case is added!

let log = parent.read().new(slog::o!());

Arc::new(parking_lot::RwLock::new(log))
let log_lock = parent.read();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't strike me as being functionally the same...

Unless I'm missing something, I think these changes will make it so that fork_log still writes to the same output (since it's just cloning the arcs and locks, not creating a new log with a new output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it wouldn't be functionally the same. Calling Logger::new() with an empty KV will clone the drain and add zero fields to the KV. Which is basically what <LogContext as Clone>::clone will do, too.

I'm open to being convinced otherwise! but I will leave this as is for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I mistakenly thought o! was making a new output or some such, not the KV. Looking at slog's source now, I agree, I think you're right and what you're doing is functionally the same.

// NOTE: we intentionally use a lock without poisoning here to not
// panic the threads if they just share telemetry with failed thread.
pub(crate) type SharedLog = Arc<parking_lot::RwLock<Logger>>;
pub(crate) type SharedLog = Arc<parking_lot::RwLock<LogContext>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it would be better to use an AtomicU32 on generation. It's weird to me to have a struct that has only partial interior mutability, as it forces what you did here - two layers of locks. Since we're forced into the Arc<RwLock<Logger>> for storing the logger itself, I think the next best choice is making the other field(s) on LogContext have their own locks.

One benefit is that it would reduce work on the hot path of actually writing logs, while it would probably be a similar amount of work (not sure how locking a RwLock compares to incrementing + reading an atomic) on the add_fields! path.

Copy link
Contributor

@jgpaiva jgpaiva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks way better than if I had done it.
My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

@bcottrell8738
Copy link
Contributor Author

Nice, this looks way better than if I had done it. My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

How does that look?

@jgpaiva
Copy link
Contributor

jgpaiva commented Jun 12, 2024

Nice, this looks way better than if I had done it. My only comment is around documenting this new behaviour, maybe the documentation for add_fields and set_verbosity should warn the user about the potential panic lurking inside?

How does that look?

LGTM!

…child

Every time we call the add_fields! macro, it adds one to the depth of the
nested structure of Arcs inside the logger object. If the structure gets
too deeply nested, it causes a stack overflow on drop.

Fixing this to make it keep the structure flat rather than nested would be
difficult; so, this commit makes it keep track of how many times it has
replaced a logger with a child logger, and it stops you (with a panic)
before you reach a dangerous level of nesting.

The panic string is intended to be informative enough so that you get a hint
as to where to go look in your code.
@fisherdarling fisherdarling merged commit 341befe into cloudflare:main Jun 13, 2024
16 of 17 checks passed
@bcottrell8738 bcottrell8738 mentioned this pull request Jun 14, 2024
bcottrell8738 added a commit to bcottrell8738/foundations that referenced this pull request Jul 9, 2024
…d with a child logger (cloudflare#47)"

This reverts commit 341befe.

(This change is being reimplemented in a different way, see the next commit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants