Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with Prometheus up to 2023-04-14 (7309ac27) #480

Merged
merged 36 commits into from
Apr 18, 2023

Conversation

zenador
Copy link
Contributor

@zenador zenador commented Apr 14, 2023

This brings changes from prometheus/prometheus up to 2024-04-14 commit prometheus/prometheus@7309ac2

Major merge conflicts with rules/manager and tsdb/compact (both code and test), you might want to rewrite some of that, especially the latter.

codesome and others added 30 commits March 13, 2023 13:13
M-map chunks replayed on startup are discarded if there
was no WAL and no snapshot loaded, because there is no
series created in the Head that it can map to. So only
load m-map chunks from disk if there is either a snapshot
loaded or there is WAL on disk.

Signed-off-by: Ganesh Vernekar <[email protected]>
If the snapshot was enabled with some ooo mmap chunks on disk,
and wbl was removed between restarts, then we should still be able
to query the ooo mmap chunks after a restart. This test shows that
we are not able to query those ooo mmap chunks after a restart
under this situation.

Signed-off-by: Ganesh Vernekar <[email protected]>
Without this fix, if snapshots were enabled, and wbl goes missing
between restarts, then TSDB does not recognize that there are ooo
mmap chunks on disk and we cannot query them until those chunks
are compacted into blocks.

Signed-off-by: Ganesh Vernekar <[email protected]>
It took a `Labels` where the memory could be re-used, but in practice
this hardly ever benefitted. Especially after converting `relabel.Process`
to `relabel.ProcessBuilder`.

Comparing the parameter to `nil` was a bug; `EmptyLabels` is not `nil`
so the slice was reallocated multiple times by `append`.

Lastly `Builder.Labels()` now estimates that the final size will depend
on labels added and deleted.

Signed-off-by: Bryan Boreham <[email protected]>
Larger messages cost less, because the per-message overheads at sender
and receiver are spread over more samples.

Example: scraping 1.5 million series every 15 seconds will send 50
messages per second.

Previous default was 500.

Signed-off-by: Bryan Boreham <[email protected]>
Keep the target throughput and total pending samples the same.

Signed-off-by: Bryan Boreham <[email protected]>
Deleted labels are remembered, even if they were not in `base` or were
removed from `add`, so `base+add-del` could go negative.

Signed-off-by: Bryan Boreham <[email protected]>
Bumps [github.com/Azure/go-autorest/autorest/adal](https://github.com/Azure/go-autorest) from 0.9.22 to 0.9.23.
- [Release notes](https://github.com/Azure/go-autorest/releases)
- [Changelog](https://github.com/Azure/go-autorest/blob/main/CHANGELOG.md)
- [Commits](Azure/go-autorest@autorest/adal/v0.9.22...autorest/adal/v0.9.23)

---
updated-dependencies:
- dependency-name: github.com/Azure/go-autorest/autorest/adal
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
labels: simplify call to get Labels from Builder
Update OOO min/max time properly after replaying m-map chunks
remote-write: raise default samples per send to 2,000
This makes it more consistent with other command like import rules. We
don't have stricts rules and uniformity accross promtool unfortunately,
but I think it's better to only have the http config on relevant check
commands to avoid thinking Prometheus can e.g. check the config over the
wire.

Signed-off-by: Julien Pivotto <[email protected]>
…b.com/Azure/go-autorest/autorest/adal-0.9.23

build(deps): bump github.com/Azure/go-autorest/autorest/adal from 0.9.22 to 0.9.23
Co-authored-by: Björn Rabenstein <[email protected]>
Signed-off-by: Justin Lei <[email protected]>
Add support for native histograms to concreteSeriesIterator
tsdb: Improve a couple of histogram documentation comments
This is a method used by some downstream projects; it was created to
optimize the implementation in `labels_string.go` but we should have one
for both implementations so the same code works with either.

Signed-off-by: Bryan Boreham <[email protected]>
labels: add ScratchBuilder.Overwrite for slice implementation
In other words: Instead of having a “polymorphous” `Point` that can
either contain a float value or a histogram value, use an `FPoint` for
floats and an `HPoint` for histograms.

This seemingly small change has a _lot_ of repercussions throughout
the codebase.

The idea here is to avoid the increase in size of `Point` arrays that
happened after native histograms had been added.

The higher-level data structures (`Sample`, `Series`, etc.) are still
“polymorphous”. The same idea could be applied to them, but at each
step the trade-offs needed to be evaluated.

The idea with this change is to do the minimum necessary to get back
to pre-histogram performance for functions that do not touch
histograms. Here are comparisons for the `changes` function. The test
data doesn't include histograms yet. Ideally, there would be no change
in the benchmark result at all.

First runtime v2.39 compared to directly prior to this commit:

```
name                                                  old time/op    new time/op    delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16            391µs ± 2%     542µs ± 1%  +38.58%  (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16           452µs ± 2%     617µs ± 2%  +36.48%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16         1.12ms ± 1%    1.36ms ± 2%  +21.58%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16        7.83ms ± 1%    8.94ms ± 1%  +14.21%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16           2.98ms ± 0%    3.30ms ± 1%  +10.67%  (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16          3.66ms ± 1%    4.10ms ± 1%  +11.82%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16         10.5ms ± 0%    11.8ms ± 1%  +12.50%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16        77.6ms ± 1%    87.4ms ± 1%  +12.63%  (p=0.000 n=9+9)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16       30.4ms ± 2%    32.8ms ± 1%   +8.01%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16      37.1ms ± 2%    40.6ms ± 2%   +9.64%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16      105ms ± 1%     117ms ± 1%  +11.69%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16     783ms ± 3%     876ms ± 1%  +11.83%  (p=0.000 n=9+10)
```

And then runtime v2.39 compared to after this commit:

```
name                                                  old time/op    new time/op    delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16            391µs ± 2%     547µs ± 1%  +39.84%  (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16           452µs ± 2%     616µs ± 2%  +36.15%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16         1.12ms ± 1%    1.26ms ± 1%  +12.20%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16        7.83ms ± 1%    7.95ms ± 1%   +1.59%  (p=0.000 n=10+8)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16           2.98ms ± 0%    3.38ms ± 2%  +13.49%  (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16          3.66ms ± 1%    4.02ms ± 1%   +9.80%  (p=0.000 n=10+9)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16         10.5ms ± 0%    10.8ms ± 1%   +3.08%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16        77.6ms ± 1%    78.1ms ± 1%   +0.58%  (p=0.035 n=9+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16       30.4ms ± 2%    33.5ms ± 4%  +10.18%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16      37.1ms ± 2%    40.0ms ± 1%   +7.98%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16      105ms ± 1%     107ms ± 1%   +1.92%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16     783ms ± 3%     775ms ± 1%   -1.02%  (p=0.019 n=9+9)
```

In summary, the runtime doesn't really improve with this change for
queries with just a few steps. For queries with many steps, this
commit essentially reinstates the old performance. This is good
because the many-step queries are the one that matter most (longest
absolute runtime).

In terms of allocations, though, this commit doesn't make a dent at
all (numbers not shown). The reason is that most of the allocations
happen in the sampleRingIterator (in the storage package), which has
to be addressed in a separate commit.

Signed-off-by: beorn7 <[email protected]>
Previously, we had one “polymorphous” `sample` type in the `storage`
package. This commit breaks it up into `fSample`, `hSample`, and
`fhSample`, each still implementing the `tsdbutil.Sample` interface.

This reduces allocations in `sampleRing.Add` but inflicts the penalty
of the interface wrapper, which makes things worse in total.

This commit therefore just demonstrates the step taken. The next
commit will tackle the interface overhead problem.

Signed-off-by: beorn7 <[email protected]>
This utilizes the fact that most sampleRings will only contain samples
of one type. In this case, the generic interface is circumvented, and
a bespoke buffer for the one actually occurring sample type is
used. Should a sampleRing receive a sample of a different kind later,
it will transparently switch to the generic behavior.

Signed-off-by: beorn7 <[email protected]>
In the past, every sample value was a float, so it was fine to call a
variable holding such a float "value" or "sample". With native
histograms, a sample might have a histogram value. And a histogram
value is still a value. Calling a float value just "value" or "sample"
or "V" is therefore misleading. Over the last few commits, I already
renamed many variables, but this cleans up a few more places where the
changes are more invasive.

Note that we do not to attempt naming in the JSON APIs or in the
protobufs. That would be quite a disruption. However, internally, we
can call variables as we want, and we should go with the option of
avoiding misunderstandings.

Signed-off-by: beorn7 <[email protected]>
This commit is doing what I would have expected that Go generics do
for me. However, the PromQL benchmarks show a significant runtime and
allocation increase with `genericAdd`, so this replaces it with
hand-coded non-generic versions of it.

Signed-off-by: beorn7 <[email protected]>
histograms: Optimize query performance
@CLAassistant
Copy link

CLAassistant commented Apr 14, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
8 out of 11 committers have signed the CLA.

✅ beorn7
✅ codesome
✅ bboreham
✅ aknuds1
✅ leizor
✅ zenador
✅ csmarchbanks
✅ roidelapluie
❌ alexqyle
❌ soonping-amzn
❌ bwplotka
You have signed the CLA already but the status is still pending? Let us recheck it.

@krajorama krajorama self-requested a review April 17, 2023 07:33
tsdb/compact.go Outdated Show resolved Hide resolved
Copy link
Contributor

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've taken a closer look at rules directory and tsdb/compact.go as per author request.

Also checked the diff of diffs between prometheus and mimir-prometheus.

For sure needs another set of eyes.

// populateBlock fills the index and chunk writers of output blocks with new data gathered as the union
// of the provided blocks.
type BlockPopulator interface {
PopulateBlock(ctx context.Context, metrics *CompactorMetrics, logger log.Logger, chunkPool chunkenc.Pool, mergeFunc storage.VerticalChunkSeriesMergeFunc, concurrencyOpts LeveledCompactorConcurrencyOptions, blocks []BlockReader, minT, maxT int64, outBlocks []shardedBlock) error
Copy link
Contributor

@krajorama krajorama Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is not the signature from upstream, but it's unclear right now how to reuse upstream signature directly. I'd leave a more sophisticated solution to a new PR and live with the current straightforward conflict resolution.

Original: https://github.com/prometheus/prometheus/blob/7309ac272195cb856b879306d6a27af7641d3346/tsdb/compact.go#L667

PopulateBlock(
   ctx context.Context,
   metrics *CompactorMetrics,
   logger log.Logger,
   chunkPool chunkenc.Pool,
   mergeFunc storage.VerticalChunkSeriesMergeFunc,
   blocks []BlockReader,
   meta *BlockMeta,
   indexw IndexWriter,
   chunkw ChunkWriter
) error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstibrany since you wrote #13 you'd probably know how to best fix the conflicts. Would it make sense for us to request a signature change from upstream like in prometheus/prometheus#11711 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we change the signature of this interface, it's not compatible with any implementation provided by upstream anymore, so I'm not sure if there's any benefit of having this interface at all.

I tried to think about this like week, and I was wondering if we could refactor the caller stack leaving this interface as is, and calling it multiple times per each outBlock we have?

We reach this code through CompactWithBlockPopulator->write, so maybe we could leave that path as it is in upstream, and how our own implementation for CompactWithSplittingAndPopulator that would writeWithSplitting and call the populator multiple times?

My main reason to further think about this is to keep at least the public API of the Compactor & BlockPopulator compatible with upstream, to reduce issues in case we want to import some other third party code that also depends on Prometheus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstibrany since you wrote #13 you'd probably know how to best fix the conflicts. Would it make sense for us to request a signature change from upstream like in prometheus/prometheus#11711 ?

When I discussed possibility to upstream #13 with @codesome, he was very skeptical about that, since Prometheus wouldn't use that.

call the populator multiple times?

That's certainly an option, but requires multiple traversals of the input blocks :(

Perhaps we can have our own extension as MultiBlockPopulator, and let compactor work with both BlockPopulator (from upstream) and MultiBlockPopulator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep maybe not upstream the whole thing, but just enough so that we wouldn't need to modify the signature in our fork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add issue for tracking: #484 also added into Native Histograms EPIC so it's not lost

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly an option, but requires multiple traversals of the input blocks

I was thinking about implementing the BlockPopulator interface with a closure created in writeWithSpliting, which would do the block traversing only once.

@colega
Copy link
Contributor

colega commented Apr 18, 2023

This LGTM, I think we should spend some time thinking about how to tackle the compactor changes, I feel this is more a refactor task than a merge-upstream task. Anyway I'll leave any decision here to @pstibrany as he has more context about the compactor change.

@pstibrany
Copy link
Member

I am fine merging it in the current form, but agree that we should fix this soon and use upstream version of BlockPopulator.

One alternative could be to fork our sharding compactor code (reusing as much as possible), and use it from Mimir, while we only keep upstream Prometheus compactor in mimir-prometheus.

@krajorama krajorama merged commit 21131bf into main Apr 18, 2023
@krajorama krajorama deleted the sync-upstream-14-apr-2023 branch April 18, 2023 08:21
@codesome
Copy link
Member

@pstibrany if you have any interface change/update/extension suggestions for upstream, please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.