Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(circleci-plugin): incremental data collection #7986

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Nickcw6
Copy link
Contributor

@Nickcw6 Nickcw6 commented Aug 29, 2024

Summary

What does this PR do?

Incremental data collection for the CircleCI plugin.

Haven't swapped the extractor or converter to be incremental as this is enough for now to make a substantial difference in pipeline duration, but something to consider.

Does this close any open issues?

N/A

Screenshots

Full data collection:
Screenshot 2024-08-29 at 18 16 16

Incremental data collection ran afterwards:
Screenshot 2024-08-29 at 18 28 47

Other Information

Had to make some adjustments to the existing NewStatefulApiCollectorForFinalizableEntity function, namely:

  • Original implementation assumed all results on the page were invalid (ie. created before createdAfter) or valid (created after createdAfter), extended this to check each item in turn if required to prevent duplicate records.
  • For CircleCI jobs API there is no nice created_at property - we have to use started_at which may be null, therefore also added checks for time.IsZero(). We still collect the data if this is zero.

There are some annoying design choices around the jobs collection worth noting, we use the workflow/{id}/jobs endpoint to collect all jobs for a workflow on a single endpoint - as mentioned above this lacks a definitive created_at property, so we use started_at which may be null. Couple of ways round this:

  1. We swap to use the job/{id} endpoint.
    • This returns different fields so would be more work to change this for the extractor etc
    • This returns a 404 for some statuses so we'd never collect this data
    • Would drastically increase the duration of what is an already slow collection (e.g. instead of making 1 request for 5 jobs, we'd need to make 5 separate requests).
  2. If started_at is null for a selection of statuses we care about (e.g. running), then we give this a temp timestamp of time.Now()
    • Records would be recollected & corrected on the next run.
    • We're interfering with raw data in the DB which is dirty & could confuse users.
  3. We check for IsZero() and collect these records anyway
    • Data collected exactly as per the API
    • Some of the jobs we collect have minimal value (eg. we'd collect blocked or unauthorized jobs)
    • Potentially unintended side-effects for other plugins
  4. We use the created_at of the workflow (created_date on _tool_circleci_workflows) and extend _tool_circleci_jobs with a workflow_created_date column
    • As we collect all jobs for a workflow simultaneously we would only need to consider this timestamp
    • Data collected as per the API

I couldn't get 4) to work nicely in the GetCreated function, so have opted for 3), but open to other ideas/suggestions on how this could be implemented.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) pr-type/feature-development This PR is to develop a new feature labels Aug 29, 2024
@klesh
Copy link
Contributor

klesh commented Sep 5, 2024

Hi, @Nickcw6 . Thanks for your contribution. Do you think it is ready for review?

@Nickcw6
Copy link
Contributor Author

Nickcw6 commented Sep 5, 2024

Hey @klesh - yes it is

@klesh
Copy link
Contributor

klesh commented Sep 6, 2024

Great work! I’ve noticed that you’re also collecting unfinished jobs. However, let’s revisit this based on some past discussions with other PPMCs and Committers. We concluded that perhaps we don’t need to collect them at all.

Most jobs tend to finish relatively quickly compared to our data collection frequency. Plus, the cherry on top: we don’t analyze pending jobs at all! 😂 So, for simplicity’s sake, I propose that we ignore pending jobs during collection.

Keep up the good work, and let’s streamline our process!

@Nickcw6 Nickcw6 force-pushed the feat/circleci-plugin-incremental branch from bca1883 to 93f1170 Compare September 18, 2024 15:57
@Nickcw6
Copy link
Contributor Author

Nickcw6 commented Sep 18, 2024

@klesh Thanks! I have pushed a small refactor commit.

Re: unfinished jobs - not sure I necessarily agree in this case. I'm not super familiar with other CI platforms but jobs in CircleCI are often statistically significant & not always that quick; in our case e.g. we have some relatively long running build jobs, and some of our deployment matchers are configured to look for a specific deploy-to-prod job. We definitely want to ensure these are collected at some point.

To ensure we still collect these, whilst not collecting unfinished jobs, we'd need to:

  • Add the logic to exclude any unfinished jobs
  • Check for unfinished workflows instead of unfinished jobs

100% agree we should be as aligned as possible but IMO this adds more complexity than the current approach of just collecting everything & recollecting any unfinished jobs... Unless I am missing something - open to suggestions :)

@klesh
Copy link
Contributor

klesh commented Sep 24, 2024

If I understand your point correctly:

  1. Workflows might take a long time to complete, while the jobs within them finish relatively quickly.
  2. You’re looking to collect data from those completed jobs within the still-running workflows for analysis.

Could you clarify a couple of points for me?

  1. How long do these workflows typically take to finish, or are there cases where they never finish?
  2. What type of analysis are you aiming to perform that requires data from the completed jobs within an ongoing workflow?

@Nickcw6
Copy link
Contributor Author

Nickcw6 commented Oct 9, 2024

@klesh CircleCI is functionally layered as such (where the item to the left contains items to the right):

Pipelines -> Workflows -> Jobs -> Commands

Workflows and jobs (and to a lesser extent pipelines) are what are statistically significant. Commands are typically small blocks of code that make up jobs - there is no way to collect these and little value in doing so anyway (not sure if these are maybe referred to as jobs in other CI tools which is contributing to a misunderstanding here?)

For some context & a more concrete example: a pipeline is created upon merge to main. Within this pipeline we often have a generic workflow called deploy-to-env. Within this will be several jobs required for the deployment, one of which could be called update-prod-ecs.

As it is this update-prod-ecs job that actually indicates the production deployment has occurred, we need to ensure it is collected and should ideally try to collect its finalised state as soon as possible. We want to collect the finished deploy-to-env workflow too, as this details the entire deployment flow.

If a workflow is unfinished, then at least some of the jobs it contains will also be unfinished. With the current approach we'd mark both as such and recollect both next time. If we don't collect the unfinished jobs, then we'd need a mechanism to recollect these jobs based on the workflow being unfinished, which in my mind is more complicated than the current implementation.

In terms of your questions:

  1. Workflow timings can vary depending on what is occurring - a build workflow may take up to 20 minutes, whereas a deployment pipeline may be ~2 minutes.
  2. This is more a case of the statically relevant data (ie. a successful deployment has occurred) being available and so we should collect it so the data is as up to date as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) pr-type/feature-development This PR is to develop a new feature size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants