Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

984: scrub eap_spans_str_attrs #6694

Merged
merged 5 commits into from
Dec 20, 2024
Merged

984: scrub eap_spans_str_attrs #6694

merged 5 commits into from
Dec 20, 2024

Conversation

kylemumma
Copy link
Member

@kylemumma kylemumma commented Dec 19, 2024

this job deletes all attributes from the spans_str_attrs storage if

  1. its name issentry.user and its an ip
  2. or if its name is sentry.user.ip

https://github.com/getsentry/snuba/blob/master/snuba/datasets/configuration/events_analytics_platform/storages/spans_str_attrs.yaml

@kylemumma kylemumma marked this pull request as ready for review December 19, 2024 23:17
@kylemumma kylemumma requested a review from a team as a code owner December 19, 2024 23:17
on_cluster = f"ON CLUSTER '{cluster_name}'" if cluster_name else ""
return f"""ALTER TABLE spans_str_attrs_3_local
{on_cluster}
DELETE WHERE (attr_key = 'sentry.user.ip' OR attr_key = 'sentry.user')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are scrubbing right? 'scrubbed' for user.ip and 'ip:scrubbed` for user

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can delete too, it's file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wont let me alter table update bc the columns are in the primary key or something, i have to do a delete

@kylemumma
Copy link
Member Author

  1. I added a real test to make sure it only deletes what we want
  2. I made the changes volo requested

Copy link

codecov bot commented Dec 20, 2024

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
1202 2 1200 3
View the top 2 failed tests by shortest run time
tests.datasets.test_errors_replacer.TestReplacer::test_process_offset_twice
Stack Traces | 0.297s run time
Traceback (most recent call last):
  File ".../tests/datasets/test_errors_replacer.py", line 370, in test_process_offset_twice
    assert self.replacer.process_message(message) is None
AssertionError: assert (ReplacementMessageMetadata(partition_index=1, offset=42, consumer_group='consumer_group'), UnmergeGroupsReplacement(state_name=<ReplacerState.ERRORS: 'errors'>, timestamp=datetime.datetime(2024, 12, 20, 0, 26, 31, 715551), hashes=['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'], all_columns=[FlattenedColumn(None, 'project_id', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'timestamp', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'event_id', schemas.UUID(modifiers=None)), FlattenedColumn(None, 'platform', schemas.String(modifiers=None)), FlattenedColumn(None, 'environment', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'release', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'dist', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'ip_address_v4', schemas.IPv4(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'ip_address_v6', schemas.IPv6(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user', schemas.String(modifiers=None)), FlattenedColumn(None, 'user_id', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user_name', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user_email', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_name', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_version', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'http_method', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'http_referer', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn('tags', 'key', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('tags', 'value', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('contexts', 'key', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('contexts', 'value', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn(None, 'transaction_name', schemas.String(modifiers=None)), FlattenedColumn(None, 'span_id', schemas.UInt(64, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'trace_id', schemas.UUID(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'partition', schemas.UInt(16, modifiers=None)), FlattenedColumn(None, 'offset', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'message_timestamp', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'retention_days', schemas.UInt(16, modifiers=None)), FlattenedColumn(None, 'deleted', schemas.UInt(8, modifiers=None)), FlattenedColumn(None, 'group_id', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'primary_hash', schemas.UUID(modifiers=None)), FlattenedColumn(None, 'received', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'message', schemas.String(modifiers=None)), FlattenedColumn(None, 'title', schemas.String(modifiers=None)), FlattenedColumn(None, 'culprit', schemas.String(modifiers=None)), FlattenedColumn(None, 'level', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'location', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'version', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'type', schemas.String(modifiers=None)), FlattenedColumn('exception_stacks', 'type', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'value', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'mechanism_type', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'mechanism_handled', schemas.Array(schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'abs_path', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'colno', schemas.Array(schemas.UInt(32, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'filename', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'function', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'lineno', schemas.Array(schemas.UInt(32, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'in_app', schemas.Array(schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'package', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'module', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'stack_level', schemas.Array(schemas.UInt(16, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn(None, 'exception_main_thread', schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_integrations', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('modules', 'name', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('modules', 'version', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn(None, 'trace_sampled', schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'num_processing_errors', schemas.UInt(64, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'replay_id', schemas.UUID(modifiers=SchemaModifiers(nullable=True, readonly=False)))], project_id=1, previous_group_id=1, new_group_id=2)) is None
 +  where (ReplacementMessageMetadata(partition_index=1, offset=42, consumer_group='consumer_group'), UnmergeGroupsReplacement(state_name=<ReplacerState.ERRORS: 'errors'>, timestamp=datetime.datetime(2024, 12, 20, 0, 26, 31, 715551), hashes=['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'], all_columns=[FlattenedColumn(None, 'project_id', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'timestamp', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'event_id', schemas.UUID(modifiers=None)), FlattenedColumn(None, 'platform', schemas.String(modifiers=None)), FlattenedColumn(None, 'environment', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'release', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'dist', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'ip_address_v4', schemas.IPv4(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'ip_address_v6', schemas.IPv6(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user', schemas.String(modifiers=None)), FlattenedColumn(None, 'user_id', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user_name', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'user_email', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_name', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_version', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'http_method', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'http_referer', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn('tags', 'key', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('tags', 'value', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('contexts', 'key', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('contexts', 'value', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn(None, 'transaction_name', schemas.String(modifiers=None)), FlattenedColumn(None, 'span_id', schemas.UInt(64, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'trace_id', schemas.UUID(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'partition', schemas.UInt(16, modifiers=None)), FlattenedColumn(None, 'offset', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'message_timestamp', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'retention_days', schemas.UInt(16, modifiers=None)), FlattenedColumn(None, 'deleted', schemas.UInt(8, modifiers=None)), FlattenedColumn(None, 'group_id', schemas.UInt(64, modifiers=None)), FlattenedColumn(None, 'primary_hash', schemas.UUID(modifiers=None)), FlattenedColumn(None, 'received', schemas.DateTime(modifiers=None)), FlattenedColumn(None, 'message', schemas.String(modifiers=None)), FlattenedColumn(None, 'title', schemas.String(modifiers=None)), FlattenedColumn(None, 'culprit', schemas.String(modifiers=None)), FlattenedColumn(None, 'level', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'location', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'version', schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'type', schemas.String(modifiers=None)), FlattenedColumn('exception_stacks', 'type', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'value', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'mechanism_type', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_stacks', 'mechanism_handled', schemas.Array(schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'abs_path', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'colno', schemas.Array(schemas.UInt(32, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'filename', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'function', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'lineno', schemas.Array(schemas.UInt(32, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'in_app', schemas.Array(schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'package', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'module', schemas.Array(schemas.String(modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn('exception_frames', 'stack_level', schemas.Array(schemas.UInt(16, modifiers=SchemaModifiers(nullable=True, readonly=False)), modifiers=None)), FlattenedColumn(None, 'exception_main_thread', schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'sdk_integrations', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('modules', 'name', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn('modules', 'version', schemas.Array(schemas.String(modifiers=None), modifiers=None)), FlattenedColumn(None, 'trace_sampled', schemas.UInt(8, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'num_processing_errors', schemas.UInt(64, modifiers=SchemaModifiers(nullable=True, readonly=False))), FlattenedColumn(None, 'replay_id', schemas.UUID(modifiers=SchemaModifiers(nullable=True, readonly=False)))], project_id=1, previous_group_id=1, new_group_id=2)) = <bound method ReplacerWorker.process_message of <snuba.replacer.ReplacerWorker object at 0x7f4c9355d390>>(Message({Partition(topic=Topic(name='replacements'), index=1): 43}))
 +    where <bound method ReplacerWorker.process_message of <snuba.replacer.ReplacerWorker object at 0x7f4c9355d390>> = <snuba.replacer.ReplacerWorker object at 0x7f4c9355d390>.process_message
 +      where <snuba.replacer.ReplacerWorker object at 0x7f4c9355d390> = <tests.datasets.test_errors_replacer.TestReplacer object at 0x7f4cb875ed50>.replacer
tests.manual_jobs.test_scrub_users_from_eap_spans_str_attrs::test_simple_case
Stack Traces | 0.323s run time
Traceback (most recent call last):
  File ".../tests/manual_jobs/test_scrub_users_from_eap_spans_str_attrs.py", line 157, in test_simple_case
    assert get_attributes("sentry.user").values == [
AssertionError: assert ['email:192.168.0.45', 'ip:1.2.3.4', 'ip:127.0.0.1', 'ip:2.3.4.5', 'ip:3.4.5.6', 'normal-user'] == ['ip:1.2.3.4', 'ip:2.3.4.5', 'ip:3.4.5.6', 'normal-user']
  At index 0 diff: 'email:192.168.0.45' != 'ip:1.2.3.4'
  Left contains 2 more items, first extra item: 'ip:3.4.5.6'
  Full diff:
  - ['ip:1.2.3.4', 'ip:2.3.4.5', 'ip:3.4.5.6', 'normal-user']
  + ['email:192.168.0.45', 'ip:1.2.3.4', 'ip:127.0.0.1', 'ip:2.3.4.5', 'ip:3.4.5.6', 'normal-user']
  ?  ++++++++++++++++++++++              ++++++++++++++++

To view more test analytics, go to the Test Analytics Dashboard
📢 Thoughts on this report? Let us know!

Copy link
Member

@volokluev volokluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once you fix test failures, lgtm

@volokluev volokluev enabled auto-merge (squash) December 20, 2024 10:29
@volokluev volokluev merged commit 956b6c2 into master Dec 20, 2024
31 checks passed
@volokluev volokluev deleted the krm/scrubstrattr branch December 20, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants