Skip to content
This repository has been archived by the owner on Jan 21, 2023. It is now read-only.

Cannot query rows larger than 100MB limit. #2

Open
plearatblast opened this issue Jun 21, 2022 · 6 comments
Open

Cannot query rows larger than 100MB limit. #2

plearatblast opened this issue Jun 21, 2022 · 6 comments

Comments

@plearatblast
Copy link

First off, your session-ized table (sessions.view) is amazing. Thank you for this work. I've been using that sessions.vew root SQL now to create daily GA4 tables for use outside of Looker.

The recent issue I have encountered isn't a problem with your code base -- but since you might have more influence than I do over the root cause -- the problem is with a noted BigQuery limitation. "Cannot query rows larger than 100MB limit". I'm experimenting with a reduction in the session_list_with_event_history to get around this but I suspect you and others will eventually bump into this constraint. If you have any other alternatives in the works I'd love to hear.

@mike-walker4-king
Copy link

Hey, we're also encountering this issue on the session view. Did you come up with a workaround for it?

@paullear
Copy link

Hey, we're also encountering this issue on the session view. Did you come up with a workaround for it?

I would love for the Looker folks to raise this with the BQ folks.
I have found no work around as yet. It seems to occur only on certain days on one or two of my sites so I suspect it's related to an excessive number of events being un-nested.

@davidbrinegar
Copy link
Contributor

Hello,
I have not been able to reproduce this issue locally, but my first assumption is the problem likely lies in the event_packing cte, which populates all session event data (including all event parameters) into the session row. If a session has a large number of events, or there are a large number of custom event parameters being populated, I could see this causing a large row.
There is some room to modify this, but the join logic likely need to be reworked and I have not scoped the impact that would have on the whole block yet. A significant amount of the structure is based around being able to easily unnest and/or query a parent row from unnested elements. If you isolated the event params + sl_key into another table, it might be fairly straightforward to point the event_param elements to this new table, joined on sl_key. Filtering on event_param data would be slower due to the join, but session-level metrics would be unaffected.

Would it be feasible to test the main session query in isolation but exclude the event_params and/or the entire event_packing cte (as well as their corresponding selected columns + join in the final selection) to see if the row limit is related to this portion?

@plearatblast
Copy link
Author

plearatblast commented Sep 9, 2022

Thanks for the reply and insight.

I had come to the same conclusion and had started to test a revised sessions view SQL to carve out just the session_list_with_event_history's sl.event_params and then use an additional join using sl_key.

I was able to recreate, test and then temporarily resolve the error by simply removing the sl.event_params (commenting out) one line (line 133) in the sessions.view.lkml file, but I fear just moving the event parameters into its own subquery might still recreate a wide row scenario if a single user has too many events (or events with parameters of excessive length). Were you planning on leaving the events un-nested in one struct?

What I will do next is try to isolate a specific user and examine the events and parameters themselves. Perhaps that will add insight.

@plearatblast
Copy link
Author

Interesting "fix" I found today which may seem a bit arbitrary, but it was based on my finding a half dozen visitors with > 5,000 events during a session. Your results may vary. :)

The code update is simple: at the end of your sessions view SQL I added ONE LINE:

...
from session_event_packing se
left join session_tags sa
on se.sl_key = sa.sl_key
left join session_facts sf
on se.sl_key = sf.sl_key
left join device_geo d
on se.sl_key = d.sl_key
WHERE sf.session_event_count < 10000
;;

Which removed 1 single visitor out of 800K who had 64,000+ events. Good Lord!

Once added, the query ran fine without hitting any 100MB limit errors -- which included the 5 visitors with over 5,000 events.

@davidbrinegar
Copy link
Contributor

That is fantastic, and very helpful information.
The events are being ranked in the initial CTE, perhaps, after determining a reasonable cap, Qualify could be used on event_rank in that first query to limit the number of events captured per session. This would still allow some data about those high-event-count sessions without crippling the overall query.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants