-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add basic telemetry to pipelines 2.0 #5929
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, I'm not convinced the exponential sampling will do much honestly, I'd go with a constant sampling rate (or if we want an "advanced" solution, an adaptive strategy based on number of events per hour/day)
I verified that on Posthog side, the SQL feature allows us to aggregate these values in two ways:
WITH FilteredEvents AS (
SELECT
person_id AS person_id,
properties.pipeline_id AS pipeline_id,
sum(properties.runs) AS total_runs
FROM
events
WHERE
event = 'Pipeline run (2.x)'
GROUP BY
person_id,
pipeline_id
)
SELECT *
FROM
FilteredEvents
WITH FilteredEvents AS (
SELECT
person_id AS person_id,
properties.pipeline_id AS pipeline_id,
properties.runs AS runs,
ROW_NUMBER() OVER (PARTITION BY (person_id, pipeline_id) ORDER BY timestamp DESC) AS rn
FROM
events
WHERE
event = 'Pipeline run (2.x)'
)
SELECT
person_id,
pipeline_id,
runs
FROM
FilteredEvents
WHERE
rn = 1 Right now the values of Note that, in either case, PostHog does NOT allow to make graphs out of this data. It can only be shown in table form. See the bottom of https://posthog.com/docs/product-analytics/sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downsampling by number of runs messes up our weekly active users metrics. Limiting the number of events sent per day per user would work but not based on number of runs. For now, it's best to not include it. 👍
pipeline_id
is currently just based on memory address but ideally we would have a pipeline id that doesn't change if I run another pipeline with the same configuration later, right? Or do we want to rely on memory address to see how long a certain pipeline is running without interruption?
The docstring for dynamic_system_specs
should be updated. Otherwise looks good to me. Let's talk about pipeline_id and then I'll be ready approve.
|
||
# Data sent to Posthog | ||
return "Pipeline run (2.x)", { | ||
"pipeline_id": str(id(pipeline)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this would be the same pipeline_id if a pipeline with the exact same configuration is run later again, or even on a different machine. Can't we create a hash based on pipeline_description
? If it's out of scope for this PR, let's at least create an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to create the new issue 😊
Please include first, second and third run in the exponential downsampling, too @ZanSara . Reason: We want to see if users ask at least 3 queries on a RAG pipeline for measuring Haystack 2.0 adoption. @julian-risch you wrote
Currently, we only track when pipelines are initiated. Tracking how often pipelines are run, adds information. The information on how often people use a pipeline is very important to us. Of course, we need to adjust how WAU is computed, but it is not impossible to compute WAU in the same way as before. |
@Timoeller Sure, nothing against counting pipeline runs. 👍 Some more context: My comment was about the idea of "Pipeline will send their first, 10th, 100th, 1000th run event, and then a run event ever 10.000." in an earlier version of the PR. With that approach we would easily have users who are active with dozens of queries per week but they wouldn't send an event every week and thus wouldn't be counted as WAU in every week, just in weeks when they pass a 10000 mark. Sara and I discussed it and the PR and the description were updated afterward. Having at least one event per day is much better to count WAU. 🙂 |
After a quick sync with @Timoeller I will add tracking for also the 2nd and 3rd event of a Pipeline. I don't expect a lot of useful data from them, but I have no objections against trying it out for a few version and re-assess after we have some hard numbers 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍 Thanks for adjusting the downsampling. We can iterate on it later if we notice any need for changes. And great to see that you synced with Timo and integrated the feedback. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy with this PR, the logic is quite complex already, and I can imagine we will keep iterating. Since we ended up tracking timestamps, I wonder if we should give it a shot at the rate limiter I mentioned before: we send all the events and we start buffering as the sends per minute go above a certain threshold. This could be implemented in the Telemetry
class without adding any complexity to the sending functions.
Also not sure about server-side aggregation: we always send pipeline._telemetry_runs
, did you check if posthog lets you compute rollups using deltas? Otherwise we would keep adding total runs to the total at every event send.
Yes, see #5929 (comment): with their SQL support you can do basically anything, and I have made 2 working insights (see the other comment for the links) where we:
This PR currently assumes we're doing 2. The catch is that those tables cannot (yet) be converted into nice graphs by PostHog for some reason, but the data is there and it's aggregated properly.
What threshold you have in mind? 10 would do? And what's the interval? |
I've adapted the strategy again: now it sends always the first event and then at most one other every minute. As discussed with @masci, this is good enough to track pipeline usage: to track user activation (for example, users that made 1 or 2 or 3 pipeline runs) we can use another set of events, to be done in another PR. It will be easier to manage them in Posthog this way. |
Related Issues
Proposed Changes:
Adds telemetry to
Pipeline.run()
for Haystack 2.0 pipelines.This PR introduces Haystack-side downsampling of
Pipeline.run()
events by always the first event and then at most one event every minute.In this first iteration the data collected about the pipeline is fairly limited - only a list of component names and types. Later on we can define what we want to collect about which component and adapt the telemetry code to collect it (see related epic). Note that because of this, right now we would not get ANY information about the document stores used.
The event sent also contains mostly the same basic system information that it used to be collected in v1, namely some libraries versions (transformers, torch, pytest, etc...), system specs, OS, etc.
The events sent are found in Posthog by the name
Pipeline run (2.x)
: https://eu.posthog.com/events#q=%7B%22kind%22%3A%22DataTableNode%22%2C%22full%22%3Atrue%2C%22source%22%3A%7B%22kind%22%3A%22EventsQuery%22%2C%22select%22%3A%5B%22*%22%2C%22event%22%2C%22person%22%2C%22coalesce(properties.%24current_url%2C%20properties.%24screen_name)%20--%20Url%20%2F%20Screen%22%2C%22properties.%24lib%22%2C%22timestamp%22%5D%2C%22orderBy%22%3A%5B%22timestamp%20DESC%22%5D%2C%22after%22%3A%22-24h%22%2C%22event%22%3A%22Pipeline%20run%20(2.x)%22%7D%2C%22propertiesViaUrl%22%3Atrue%2C%22showSavedQueries%22%3Atrue%2C%22showPersistentColumnConfigurator%22%3Atrue%7DHow did you test it?
Manual tests.
Notes for the reviewer
n/a
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.