Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize what flags as "malicious" #95

Closed
Robin5605 opened this issue Jul 11, 2023 · 9 comments · May be fixed by #132
Closed

Centralize what flags as "malicious" #95

Robin5605 opened this issue Jul 11, 2023 · 9 comments · May be fixed by #132
Assignees

Comments

@Robin5605
Copy link
Contributor

The way the current system works is that the API returns all packages that have been scanned (it quite literally just dumps the results of the SQLAlchemy result) within the given constraints in the request. This then means that the consumer (the bot, in this case) has to filter through the response for the packages it wants to display (in this case, the bot will filter through packages with a score greater than or equal to 5).

It has been expressed numerous times that what constitutes as "malicious" should be in a centralized location (such as this API). There are a few ways of going about doing this, I'd like to get ideas on the table in this issue. A basic solution we could start off with is a field in constants.py that we can tweak (though worth nothing we would have to redeploy to tweak this). The API response would then return a list of packages scanned, and a list of malicious packages.

We can also discuss having the API itself dispatch a webhook to the appropriate channels instead of having the bot poll the API every 60 seconds. I'm leaning more towards this approach.

@import-pandas-as-numpy
Copy link
Member

import-pandas-as-numpy commented Jul 11, 2023

This seems like a desirable endgoal-- albeit I'm not keen on redeploying just to fix the weighting.
I think ultimately, the thing we care most about is malicious packages, having a way to query this (and a way to query changes to this) can help inform our actual detection metrics.

I pitched the idea of an additional table to track malicious packages (to the outrage of everyone) but I do think that the idea of segregating detections from the global package list can help us aggregate data better for what exactly we're detecting and why. This'll be a lot more relevant when we introduce additional detection schemas such as the AST idea, whereby we'll want to know what context something was detected in, since both detection systems will be using YARA.

Do I think it's a perfect idea? Nah. But if we look at how we're trying to do the BigQuery dataset polling, where the bot is simply revolving every 'x' seconds and moving the query to encompass the last notifications, it seems like we might be able to apply that here. To clarify that because I think it's kind of confusing, being able to query our table every minute, or providing a callback for the cronjob to run that query as well, might be useful. (IE. When the cronjob runs and adds jobs to the package queue, it could also query the current state of the database using the last time it was ran and now, and report all the detections via a webhook.)

This simplifies the model at least in my head.

@Robin5605
Copy link
Contributor Author

Here's another idea that was proposed:
We have an endpoint that, when hit, will send an embed webhook to some configurable webhook URL with "all packages scanned in the last 60 seconds" We could then configure a Kubernetes cron job to run some configurable amount of time and hit that endpoint

As for malicious packages, we could simply dispatch those as they come in from clients (so it'd be real-time).

@import-pandas-as-numpy
Copy link
Member

Here's another idea that was proposed: We have an endpoint that, when hit, will send an embed webhook to some configurable webhook URL with "all packages scanned in the last 60 seconds" We could then configure a Kubernetes cron job to run some configurable amount of time and hit that endpoint

As for malicious packages, we could simply dispatch those as they come in from clients (so it'd be real-time).

This doesn't make sense to me (dispatching from the clients) as we'd lose the premise of this-- the score threshold. Unless we're pushing it down to the clients themselves through get job. Which, ehhhh... I mean I guess?

@Robin5605
Copy link
Contributor Author

Robin5605 commented Jul 11, 2023

This doesn't make sense to me (dispatching from the clients) as we'd lose the premise of this-- the score threshold.

The way this would work is clients would send their result up to the API as usual (including the score, the rules matched, etc). If this sent score exceeds some threshold we've set server-side, we'd trigger a webhook (from the server). This causes no change in client behaviour.

@Robin5605
Copy link
Contributor Author

albeit I'm not keen on redeploying just to fix the weighting.

As for this, we could probably just save it as in the database, and have an endpoint to tweak it. Perhaps a function in the bot to hit this endpoint so we can tweak the weights without having to make an HTTP request ourselves

@import-pandas-as-numpy
Copy link
Member

Both sound reasonable to me, thanks for clarifying. You're cleared hot for implementation unless anyone has some other nits.

@Robin5605 Robin5605 self-assigned this Jul 12, 2023
@Robin5605
Copy link
Contributor Author

On further thought this might be more difficult than anticipated to do within the API 🤔
That is, as long as we keep the interacting "reporting" functionality
We'd need to set up interactions over HTTP with Discord, then create endpoints to handle all of the stuff. Might be easier to just keep using the bot, honestly. I'll leave this up for discussion again for now if someone has any better ideas.

@Robin5605
Copy link
Contributor Author

I propose a new endpoint, perhaps something along the lines of GET /scans that will return a list of all packages scanned, and a list of packages that were malicious.

@Robin5605
Copy link
Contributor Author

Going to close this as succeeded by #260

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants