Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: unsupported for aggregate min: *reads.stringMultiShardArrayCursor #26142

Open
ttftw opened this issue Mar 14, 2025 · 15 comments
Open

panic: unsupported for aggregate min: *reads.stringMultiShardArrayCursor #26142

ttftw opened this issue Mar 14, 2025 · 15 comments
Assignees
Labels
area/flux Issues related to the Flux query engine area/storage kind/bug team/edge

Comments

@ttftw
Copy link

ttftw commented Mar 14, 2025

I have asked for help in Slack and will follow up with customer support, but I thought I would also post this here.

I am currently using influxdb cloud as our primary data store. All data is coming in from an mqtt server. I recently spun up a local influxdb instance in a docker container and am sending a copy of the data to a this new local server for testing. The local server adds a few new tags, but besides that, everything else should be the same. Same bucket name, same data structure, etc. I'm doing this so I can just drop in this new server into existing Grafana dashboards and test some things before we push changes that go to the production server.

When I run the same queries from the cloud against the local server, I have some that cause influxdb to panic.

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")
  |> filter(fn: (r) => r["serial"] == "xxx.xxx")
  |> filter(fn: (r) => r["channel"] == "Cellular")
  |> aggregateWindow(every: v.windowPeriod, fn: min, createEmpty: false)

This runs fine on the cloud server, but when I run it locally, I get

panic: unsupported for aggregate min: *reads.stringMultiShardArrayCursor

and if I comment out the aggregateWindow, then it returns the data as I'd expect. One thing to note is when I run this query, the _value column data type is a double in the cloud and a long on the local instance, but besides that, the tables look identical when I comment out the aggregateWindow.

Oddly enough, I just noticed that if I change the query from

|> filter(fn: (r) => r["channel"] == "Cellular")

to

|> filter(fn: (r) => r["_field"] == "Status.Cellular")

which targets the same data rows, but a different way, then I do not get this panic and the data returns as expected with the aggregateWindow.

Any insight as to why this query would be successful in the cloud version but panic in the OSS version will be much appreciated.

InfluxDB v2.7.11
Server: fbf5d4a

Logs:

influxdb-1 | ts=2025-03-14T15:39:37.538278Z lvl=info msg="Execute source panic" log_id=0vHGw81W000 service=storage-reads error="panic: unsupported for aggregate min: *reads.stringMultiShardArrayCursor" stacktrace="goroutine 207758 [running]:\nruntime/debug.Stack()\n\t/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/influxdata/flux/execute.(*executionState).recover(0xc002241170)\n\t/go/pkg/mod/github.com/influxdata/[email protected]/execute/recover.go:32 +0x1fd\npanic({0x7f33d4a24b40?, 0xc0036f9650?})\n\t/go/src/runtime/panic.go:770 +0x132\ngithub.com/influxdata/influxdb/v2/storage/reads.newWindowMinArrayCursor({0x7f33d502de40?, 0xc00735d850?}, {{0x0, 0x2540be400, 0x0}, {0x0, 0x2540be400, 0x0}, 0x0, 0x0, ...})\n\t/root/project/storage/reads/array_cursor.gen.go:158 +0x40c\ngithub.com/influxdata/influxdb/v2/storage/reads.newWindowAggregateArrayCursor({0xc0069e5dc0?, 0x9?}, 0x9?, {{0x0, 0x2540be400, 0x0}, {0x0, 0x2540be400, 0x0}, 0x0, ...}, ...)\n\t/root/project/storage/reads/array_cursor.go:43 +0x9c\ngithub.com/influxdata/influxdb/v2/storage/reads.(*windowAggregateResultSet).createCursor(0xc0069e3770, {{0x0, 0x0, 0x0}, {0x7f3384608495, 0x6, 0x3f7b6b}, {0xc002071340, 0x7, 0x7}, ...})\n\t/root/project/storage/reads/aggregate_resultset.go:131 +0x2a6\ngithub.com/influxdata/influxdb/v2/storage/reads.(*windowAggregateResultSet).Next(0xc0069e3770)\n\t/root/project/storage/reads/aggregate_resultset.go:84 +0x138\ngithub.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).handleRead(0xc00295a100, 0x7f33d5035218?, {0x7f33d5048520, 0xc0069e3770})\n\t/root/project/storage/flux/reader.go:742 +0x3aa\ngithub.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).Do(0xc00295a100, 0xc0016b3ae0)\n\t/root/project/storage/flux/reader.go:688 +0x365\ngithub.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables(0xc002070f20, {0x7f33d5035218, 0xc0036fe090}, {0x7f33d5020118, 0xc00295a100}, 0x182cb58e09149c46)\n\t/root/project/query/stdlib/influxdata/influxdb/source.go:69 +0x9a\ngithub.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*readWindowAggregateSource).run(0xc002070f20, {0x7f33d5035218, 0xc0036fe090})\n\t/root/project/query/stdlib/influxdata/influxdb/source.go:303 +0x106\ngithub.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).Run(0xc002070f20, {0x7f33d5035218, 0xc0036fe090})\n\t/root/project/query/stdlib/influxdata/influxdb/source.go:50 +0xa3\ngithub.com/influxdata/flux/execute.(*executionState).do.func2({0x7f33d5036d70, 0xc002070f20})\n\t/go/pkg/mod/github.com/influxdata/[email protected]/execute/executor.go:535 +0x375\ncreated by github.com/influxdata/flux/execute.(*executionState).do in goroutine 234\n\t/go/pkg/mod/github.com/influxdata/[email protected]/execute/executor.go:515 +0xf8\n"

@davidby-influx davidby-influx added kind/bug area/storage area/flux Issues related to the Flux query engine team/edge labels Mar 18, 2025
devanbenz added a commit that referenced this issue Mar 19, 2025
This PR is used to alleviate the erroneous panic we are seeing
corresponding with #26142.
There should not be a panic and instead we should be throwing an error.
@devanbenz
Copy link

@ttftw I've opened up a PR to remove the erroneous panic. That being said I'll also need to take a look at the query to see why its being transformed to use the improper cursor type. Thank you for the detailed bug report. Going to continue looking in to it today.

@devanbenz
Copy link

Also if possible but would you be willing to send me over some line protocol for the schema you're using so I can write data to a local influxdb to help with debugging? I'll try my best to reproduce some mock data with the query you've provided in the mean time.

@ttftw
Copy link
Author

ttftw commented Mar 19, 2025

@devanbenz Here is a sample of data that generated the error above:

Sample,alarm=False,channel=Cellular,client=None,driver=Status,project=None,serial=PV112,site=Demo Status.Cellular=-17i 1742420358

@devanbenz
Copy link

@ttftw thank you - another question: How exactly are you running the flux query? Are you using the chronograf UI or running it some other way?

@ttftw
Copy link
Author

ttftw commented Mar 20, 2025

I've tried and confirmed the error from Grafana dashboards/Explore and the Data Explorer tool in Influxdb

@devanbenz
Copy link

devanbenz commented Mar 25, 2025

I've been attempting to reproduce this issue locally without much luck. I assume that this data was line protocol you dumped from a TSM file?

Sample,alarm=False,channel=Cellular,client=None,driver=Status,project=None,serial=PV112,site=Demo Status.Cellular=-17i 1742420358

Just ensuring I am reading it correctly. I assume that your tags set is the following:

alarm=False,channel=Cellular,client=None,driver=Status,project=None,serial=PV112,site=Demo

Can you verify that I have tags right? I may request that you give me a clone of some mock data in the form of a TSM file you're seeing the issue on if possible. I would like to try and reproduce this issue myself.

@devanbenz
Copy link

devanbenz commented Mar 25, 2025

Okay so I was able to reproduce this issue. I basically wrote some points in line protocol using a script. I modified my points so that they were a string type for _value which is just the field value as outlined by our documentation here: https://docs.influxdata.com/influxdb/cloud/reference/key-concepts/data-elements/#fields. Obviously I only have 1 field being channel but I wonder if you have multiple fields and theres a conflict between them somehow... My suspicion is that you have some sneaky _value 's locally that are of string type?

Documents/InfluxData/issue_26142 via 🐍 v3.10.12 on ☁  [email protected] took 18s
❯ python3 repro_lp.py
Data saved to influxdb_data.txt

Sample of generated data:
Sample,alarm=true,channel=WiFi,client=Client1,driver=Status,project=Project3,serial=PV112,site=Production WiFi="bar" 1742324017
Sample,alarm=true,channel=Cellular,client=None,driver=Sensor,project=None,serial=PV112,site=Production Cellular="baz" 1742324077
Sample,alarm=false,channel=Cellular,client=Client3,driver=Monitor,project=None,serial=PV112,site=Staging Cellular="bar" 1742324137
Sample,alarm=false,channel=Ethernet,client=Client2,driver=Control,project=Project1,serial=PV112,site=Development Ethernet="baz" 1742324198
Sample,alarm=true,channel=Bluetooth,client=Client1,driver=Status,project=Project1,serial=PV112,site=Production Bluetooth="foo" 1742324258

I've gone ahead and added a PR for this case where instead of panic'ing we will error. #26165

On your local instance where you're seeing the issue could you run your flux query without the aggregation?

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")
  |> filter(fn: (r) => r["serial"] == "xxx.xxx")
  |> filter(fn: (r) => r["channel"] == "Cellular")

And send over a screen grab from the Simple Table graph like so?

Image

I have a feeling that somehow the query is attempting to aggregate on string values instead of numerical values.

@ttftw
Copy link
Author

ttftw commented Mar 26, 2025

Image

Here are the columns that come back from that query. Without the aggregate window, it data comes back. With the aggregate window, I get the error.

If I change the query to:

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")
  |> filter(fn: (r) => r["serial"] == "xxx.xxx")
  |> filter(fn: (r) => r["_field"] == "Status.Cellular")
  |> aggregateWindow(every: v.windowPeriod, fn: min, createEmpty: false)

This targets the same rows and runs fine. For some reason changing the filter to |> filter(fn: (r) => r["channel"] == "Cellular") is causing this error.

This is the data schema that comes back when I use the _field query above with aggregate window:

Image

@devanbenz
Copy link

devanbenz commented Mar 26, 2025

And what if you send the query just like this?

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")

Image

I wonder if there is a mix of _value types. I see the following when I write the two points:

Sample,alarm=False,channel=Cellular,client=None,driver=Status,project=None,serial=PV112,site=Demo Status.Cellular=-17i 1742420358

and

Sample,alarm=False,channel=Wifi,client=None,driver=Status,project=None,serial=PV112,site=Demo Status.Wifi="foo" 1742420359

When querying this using the window aggregate I see (with my fix to bubble up error instead of panic):

Image

Other than that I cannot repro with the data in line protocol you provided. It appears that adjusting your query works. I'll remove the panics and close the issue if we can't dig up anything more.

@ttftw
Copy link
Author

ttftw commented Mar 27, 2025

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")

This works, but there are a handful of tables that have a different data type for _value. Some _value's are strings and some are longs.

Other than that I cannot repro with the data in line protocol you provided. It appears that adjusting your query works. I'll remove the panics and close the issue if we can't dig up anything more.

I might be misunderstanding, but your last screenshot seems to be reproducing the bug I see?
In that bucket, I have data that is like you describe above, where the channel tag value changes and the field's _value type is a string for some records, and a long for others. These different datasets have different field names, but the same names have the same data type. If I filter by channel == cellular, that table only returns field names with _value's as longs, but if I filter by channel == cellular_name, then the field _value type will be a string. When I target the data by channel == cellular and try to aggregate, I get this panic that says something about aggregating strings, but the type for these field _value's is long.

If I change the query from channel == "cellular" to field == "status.cellular", these two queries return the exact same dataset that show the same data types, but I do not get the panic when aggregating and filtering by field name, while I do when filtering by the tag channel. This same panic happens when I filter other channel tag values, too.

Something else I just noticed, if I run this query:

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")
  |> filter(fn: (r) => r["serial"] == "xxx")
  |> filter(fn: (r) => r["channel"] == "Cellular")
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1m, fn: min, createEmpty: false)

It works, but if I comment out |> group(columns: ["_field"]), then I get the panic.

Also, to reiterate from my original post, all of this works correctly with no panics or errors when I run these same queries on a duplicate set of this data in our Influxdb cloud database.

@devanbenz
Copy link

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")

This works, but there are a handful of tables that have a different data type for _value. Some _value's are strings and some are longs.

Other than that I cannot repro with the data in line protocol you provided. It appears that adjusting your query works. I'll remove the panics and close the issue if we can't dig up anything more.

I might be misunderstanding, but your last screenshot seems to be reproducing the bug I see? In that bucket, I have data that is like you describe above, where the channel tag value changes and the field's _value type is a string for some records, and a long for others. These different datasets have different field names, but the same names have the same data type. If I filter by channel == cellular, that table only returns field names with _value's as longs, but if I filter by channel == cellular_name, then the field _value type will be a string. When I target the data by channel == cellular and try to aggregate, I get this panic that says something about aggregating strings, but the type for these field _value's is long.

If I change the query from channel == "cellular" to field == "status.cellular", these two queries return the exact same dataset that show the same data types, but I do not get the panic when aggregating and filtering by field name, while I do when filtering by the tag channel. This same panic happens when I filter other channel tag values, too.

Something else I just noticed, if I run this query:

from(bucket: "bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "Sample")
  |> filter(fn: (r) => r["serial"] == "xxx")
  |> filter(fn: (r) => r["channel"] == "Cellular")
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1m, fn: min, createEmpty: false)

It works, but if I comment out |> group(columns: ["_field"]), then I get the panic.

Also, to reiterate from my original post, all of this works correctly with no panics or errors when I run these same queries on a duplicate set of this data in our Influxdb cloud database.

Sorry - I meant with the single line:

Sample,alarm=False,channel=Cellular,client=None,driver=Status,project=None,serial=PV112,site=Demo Status.Cellular=-17i 1742420358

I could not reproduce. Adding additional lines with my own data and mixing strings with longs/ints caused the issue. Which it sounds like you have strings for some of the _fields in your bucket.

In the cloud version do you have mixed types for the _field as well? I'm tempted to close this issue as it appears you have found a work around with the flux query and we will be removing the panics in an upcoming release. I've also found that just calling |> group()resolves the issue as well.

@ttftw
Copy link
Author

ttftw commented Mar 27, 2025

Yes, in the cloud version the data mix is the same. The data that's coming in is going to both cloud and OSS versions, so it's basically a duplicate dataset.

I have found a workaround, yes, but this doesn't feel like a confidence-inspiring solution to continue to use Influxdb at the cloud level or OSS. It seems from this thread that this is a bug, right? If so, shouldn't this be a valid issue that gets resolved and not closed without a fix? This kind of data and query seems like a typical use that should work without an error. Or is this not being fixed because 2.0 is being deprecated?

I'm trying to drop in an Influxdb OSS datasource in Grafana to test some queries against existing dashboards that already work with Influxdb cloud, but when I swap the datasource to Influxdb OSS, it throws up these errors. Even if it's not panicking, it's still going to error and not return the data and this is something we will have to work around/work against in the future and I'll have to edit dozens of queries across dozens of dashboards just to get the dashboards that already work to work around this bug.

@devanbenz
Copy link

Yes, in the cloud version the data mix is the same. The data that's coming in is going to both cloud and OSS versions, so it's basically a duplicate dataset.

I have found a workaround, yes, but this doesn't feel like a confidence-inspiring solution to continue to use Influxdb at the cloud level or OSS. It seems from this thread that this is a bug, right? If so, shouldn't this be a valid issue that gets resolved and not closed without a fix? This kind of data and query seems like a typical use that should work without an error. Or is this not being fixed because 2.0 is being deprecated?

I'm trying to drop in an Influxdb OSS datasource in Grafana to test some queries against existing dashboards that already work with Influxdb cloud, but when I swap the datasource to Influxdb OSS, it throws up these errors. Even if it's not panicking, it's still going to error and not return the data and this is something we will have to work around/work against in the future and I'll have to edit dozens of queries across dozens of dashboards just to get the dashboards that already work to work around this bug.

I'm going to take a deeper dive in to why it looks like the filter is pruning data in C2 whereas it's not in OSS for that specific query. I was able to set up a full reproduction case using grafana + OSS running locally + cloud 2 running locally. I'll update this ticket with more information as I work on it. Please standby.

@ttftw
Copy link
Author

ttftw commented Mar 28, 2025

Thank you. I appreciate the help.

@devanbenz
Copy link

After taking a deeper dive in to this it appears to be a difference between how cursors work under the hood in our cloud2 and OSS 2.x codebase with Flux. To resolve this would require a larger change to Flux and OSS 2.x which unfortunately with Flux being in maintenance mode we are no longer working towards extending support that would require such a large change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flux Issues related to the Flux query engine area/storage kind/bug team/edge
Projects
None yet
Development

No branches or pull requests

3 participants