Determine how to handle various User Agent situations

In linehaul, there are 3 states that any particular event can be in:

1. The user agent is parseable for data.
2. The user agent is unknown.
3. The user agent is known, but it's not parseable for data.

For (1) the correct outcome is obvious, we have data so we want to save it in BigQuery.

For (2) the current thing we do is record a download, but with all of the data that typically comes from the user agent missing. Thus the BigQuery table more accurately reflects *all* of the downloads, but projects querying the data needs to be more careful about how it queries the data (it's easy to do something like ``py3_downloads / total_downloads``, however that would incorrectly give a smaller percentage, since it would count unknown as py2). Prior to Linehaul v3, the behavior was to throw away this event and *not* log anything for it.

For (3) Linehaul v3 and previous throw away the event (we implement this as "ignored" user agents). The list of these can be found at:

https://github.com/pypa/linehaul/blob/420354cf789b064f0d38ce02573f6af51aa0306a/linehaul/ua/parser.py#L260-L294

So ultimately the question is, do these behaviors make sense? Which boils down to whether we want BigQuery to most accurately reflect *every* download, or whether we want to filter the data to data that is more usable for specific questions (but of course, less usable for other questions that are likely to be more of an edge case).

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determine how to handle various User Agent situations #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Determine how to handle various User Agent situations #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions