You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 1, 2021. It is now read-only.
In linehaul, there are 3 states that any particular event can be in:
The user agent is parseable for data.
The user agent is unknown.
The user agent is known, but it's not parseable for data.
For (1) the correct outcome is obvious, we have data so we want to save it in BigQuery.
For (2) the current thing we do is record a download, but with all of the data that typically comes from the user agent missing. Thus the BigQuery table more accurately reflects all of the downloads, but projects querying the data needs to be more careful about how it queries the data (it's easy to do something like py3_downloads / total_downloads, however that would incorrectly give a smaller percentage, since it would count unknown as py2). Prior to Linehaul v3, the behavior was to throw away this event and not log anything for it.
For (3) Linehaul v3 and previous throw away the event (we implement this as "ignored" user agents). The list of these can be found at:
So ultimately the question is, do these behaviors make sense? Which boils down to whether we want BigQuery to most accurately reflect every download, or whether we want to filter the data to data that is more usable for specific questions (but of course, less usable for other questions that are likely to be more of an edge case).
Thoughts?
The text was updated successfully, but these errors were encountered:
A related question is the categorization of some of the user agents we do have. Things like:
Browser
Should these be only real browsers? What about programatic "browsers" like requests, curl, etc?
OS
For Homebrew specifically we attempt to parse information about the OS target out of it, and set the installer to Homebrew. Does this make sense? Should we try to expand this to other OSs?
Various ignored user agents
Do these categories make sense? Are there other categories where we want to bake the long tail of downloaders into a generic name, or should we always try to include relevant information from each user agent to give more specificity (e.g. instead of "OS", break it down to Homebrew, OpenBSD, FreeBSD portsnap, etc).
I would think it would make sense to give people access to this information in some way. In addition it would likely help if it were possible to provide an additional layer of categorizations and metadata such as broad OS categories and further detail which might include versions. The ability to distinguish known CI agents from others would also provide some interesting insights. I have no idea if this is possible with bigquery
So the way we would expose stuff is to simply include it in the data we send to BigQuery. You can think of BigQuery as traditional database for the purposes of this question. It's a table with rows and columns, and this question is ultimately two parts:
What columns do we want to exist in that table, and how do we map the user agents to those columns.
More columns make it harder to query the database (since the data you want may exist in different columns for different user agents).
Do we want to skip adding rows to the database for any user agents, or do we want to add a row for every download.
Logging something for every download makes the data harder to query for common queries (e.g. for "py2 vs py3", you have to filter out unknown rows) but skipping logging some things makes some questions impossible to to query for.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
In linehaul, there are 3 states that any particular event can be in:
For (1) the correct outcome is obvious, we have data so we want to save it in BigQuery.
For (2) the current thing we do is record a download, but with all of the data that typically comes from the user agent missing. Thus the BigQuery table more accurately reflects all of the downloads, but projects querying the data needs to be more careful about how it queries the data (it's easy to do something like
py3_downloads / total_downloads
, however that would incorrectly give a smaller percentage, since it would count unknown as py2). Prior to Linehaul v3, the behavior was to throw away this event and not log anything for it.For (3) Linehaul v3 and previous throw away the event (we implement this as "ignored" user agents). The list of these can be found at:
https://github.com/pypa/linehaul/blob/420354cf789b064f0d38ce02573f6af51aa0306a/linehaul/ua/parser.py#L260-L294
So ultimately the question is, do these behaviors make sense? Which boils down to whether we want BigQuery to most accurately reflect every download, or whether we want to filter the data to data that is more usable for specific questions (but of course, less usable for other questions that are likely to be more of an edge case).
Thoughts?
The text was updated successfully, but these errors were encountered: