Skip to content

cdk version 2.33 onwards is getting stuck #22923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
galsasi1989 opened this issue Nov 15, 2022 · 15 comments
Closed

cdk version 2.33 onwards is getting stuck #22923

galsasi1989 opened this issue Nov 15, 2022 · 15 comments
Labels
@aws-cdk/aws-s3 Related to Amazon S3 bug This issue is a bug. p2

Comments

@galsasi1989
Copy link

galsasi1989 commented Nov 15, 2022

Describe the bug

I am trying to deploy an S3 bucket using 2.32.1 and it's working just fine.
My cdk is run from Jenkins and is written in Typescript(node v16) running inside a docker container

Jenkins is running cdk cli version 2.44.0.
When I upgrade the package in the package.json to 2.33.0 onwards, the same deployment command is getting stuck and the pipeline is staying hang.

Am I missing something? Are there any breaking changes in 2.33.0? from the release notes I couldn't find any useful information.

Thanks,
Gal

Expected Behavior

Using cdk packages(aws-cdk in DevDependencies and aws-cdk-lib in dependencies) will work so I will be able to deploy the S3 bucket with the latest versions.

Current Behavior

When I am using cdk packages in version 2.32.1 it works just fine. I am able to deploy the S3 bucket.
After upgrading to version 2.33.0 or any later version, the cdk synth/diff/deploy is getting hang..

Reproduction Steps

The Jenkins pipeline is running inside docker containers.
On the Jenkins agent, docker server is installed.
The first container in the pipeline is based on python 3.8. Inside it, another docker container of nodejs v16(alpine dist) is running with cdk-cli version 2.44.0 installed.

This is the package.json:

{
    "name": "general",
    "version": "0.1.0",
    "bin": {
        "general": "bin/general.js"
    },
    "scripts": {
        "build": "tsc",
        "watch": "tsc -w",
        "test": "jest",
        "cdk": "cdk"
    },
    "devDependencies": {
        "@types/jest": "^27.5.2",
        "@types/node": "^10.17.27",
        "@types/prettier": "2.6.0",
        "aws-cdk": "2.32.1",
        "jest": "^27.5.1",
        "ts-jest": "^27.1.4",
        "ts-node": "^10.9.1",
        "typescript": "~3.9.7"
    },
    "dependencies": {
        "aws-cdk-lib": "2.32.1",
        "constructs": "^10.0.0",
        "@aws-cdk/aws-glue-alpha": "^2.32.1-alpha.0",
        "source-map-support": "^0.5.21"
    }
}
```{
    "name": "general",
    "version": "0.1.0",
    "bin": {
        "general": "bin/general.js"
    },
    "scripts": {
        "build": "tsc",
        "watch": "tsc -w",
        "test": "jest",
        "cdk": "cdk"
    },
    "devDependencies": {
        "[@types/jest](https://npmjs.com/package/@types/jest)": "[^27.5.2](https://npmjs.com/package/@types/jest)",
        "[@types/node](https://npmjs.com/package/@types/node)": "[^10.17.27](https://npmjs.com/package/@types/node)",
        "[@types/prettier](https://npmjs.com/package/@types/prettier)": "[2.6.0](https://npmjs.com/package/@types/prettier)",
        "[aws-cdk](https://npmjs.com/package/aws-cdk)": "[2.32.1](https://npmjs.com/package/aws-cdk)",
        "[jest](https://npmjs.com/package/jest)": "[^27.5.1](https://npmjs.com/package/jest)",
        "[ts-jest](https://npmjs.com/package/ts-jest)": "[^27.1.4](https://npmjs.com/package/ts-jest)",
        "[ts-node](https://npmjs.com/package/ts-node)": "[^10.9.1](https://npmjs.com/package/ts-node)",
        "[typescript](https://npmjs.com/package/typescript)": "[~3.9.7](https://npmjs.com/package/typescript)"
    },
    "dependencies": {
        "[aws-cdk-lib](https://npmjs.com/package/aws-cdk-lib)": "[2.32.1](https://npmjs.com/package/aws-cdk-lib)",
        "[constructs](https://npmjs.com/package/constructs)": "[^10.0.0](https://npmjs.com/package/constructs)",
        "[@aws-cdk/aws-glue-alpha](https://npmjs.com/package/@aws-cdk/aws-glue-alpha)": "[^2.32.1-alpha.0](https://npmjs.com/package/@aws-cdk/aws-glue-alpha)",
        "[source-map-support](https://npmjs.com/package/source-map-support)": "[^0.5.21](https://npmjs.com/package/source-map-support)"
    }
}

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.44.0

Framework Version

No response

Node.js Version

16

OS

Ubuntu 18/20

Language

Typescript

Language Version

No response

Other information

No response

@galsasi1989 galsasi1989 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 15, 2022
@github-actions github-actions bot added the @aws-cdk/aws-s3 Related to Amazon S3 label Nov 15, 2022
@galsasi1989 galsasi1989 changed the title cdk version 2.33 onwards in getting stack cdk version 2.33 onwards in getting stuck Nov 15, 2022
@galsasi1989 galsasi1989 changed the title cdk version 2.33 onwards in getting stuck cdk version 2.33 onwards is getting stuck Nov 15, 2022
@peterwoodworth
Copy link
Contributor

Can you please put together a repo which we can clone to reproduce this issue? There's a lot of parts and pieces to this, and no changes were made between 2.32 and 2.33 which look suspicious.

@peterwoodworth peterwoodworth added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Nov 15, 2022
@galsasi1989
Copy link
Author

Hi @peterwoodworth

I have just created a project with samples from my code: https://github.com/galsasi1989/cdk-sample-issue
Inside this repository you can find a general directory with the cdk code and a Jenkinsfile

Let me know if you need additional information

Thanks!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 16, 2022
@peterwoodworth
Copy link
Contributor

Hey @galsasi1989,

it seems to me like you might've left some stuff out in your reproduction. This repo is only creating an S3 bucket, a related custom resource for the deletion of objects inside the bucket, and some roles. I'm not familiar with Jenkins or how you might've setup your pipeline, so I'll need a little bit more help here 🙂

@peterwoodworth
Copy link
Contributor

I'm also curious to know the exact error you're running into. Can you paste the error message you're receiving?

@otaviomacedo otaviomacedo removed their assignment Nov 18, 2022
@galsasi1989
Copy link
Author

Hi @peterwoodworth
It's indeed the case. I am talking only about running cdk which will create/update S3 bucket.
There's no error actually, the pipeline is just stuck until I stop it manually and I don't see any change in cloudformation.

Here is a step by step guide how to setup Jenkins: https://www.jenkins.io/doc/tutorials/tutorial-for-installing-jenkins-on-AWS/
My Jenkins in running on Ununtu 18/20 if you prefer to run it on the same OS.

At the end, you should have 2 machines - 1 server and 1 agent.
You need to ssh into the agent and make sure the OS is updated and install docker and git(if any of them is missing)

After Jenkins is up and running, you need to click on Manage Jenkins -> Manage Plugins. Click on 'Available' tab and search for the following plugins:

  1. docker plugin
  2. docker-build-step
  3. timestamper
  4. AWS(for simplicity, you can install of them)
  5. git
  6. Cleanup workspace plugin

You will need to install the following plugins by clicking on 'Install without restart'

Then you'll need to build 2 docker images(I will add the Dockerfiles to the repository above):

  1. python based docker image - use the tag: 3.8.6-pipeline
  2. nodejs based docker image - use the tag: 16.17.1-alpine-pipeline

At the end, you'll have to create a pipeline job in Jenkins which will run the Jenkinsfile in the above repository. Inside the Jenkinsfile you can find the cdk commands that I have tried to run without success.

@FarrOut
Copy link

FarrOut commented Nov 20, 2022

hey @peterwoodworth , if it helps, feel free to take a look at this project. it's still pretty modest, but hopefully it helps with getting the Jenkins master up.

For the initial setup admin password, you can find it in the CloudWatch LogGroup.

@anuprajg
Copy link

anuprajg commented Nov 20, 2022

Interesting issue which helped solved my problem by reverting to v2.32.1 as well. I see the same issue happening from v2.33 including the latest 2.51.1

However, in my case issue happens if I use the Triggers to invoke a Lambda function during CDK deployment. Maybe that helps in identifying the root cause.

Here's my reduced Stack: https://gist.github.com/anuprajg/3925fa431891108c204de72aebc3a39d which is basically a Hello World Lambda function + Trigger to execute it during deployment.

On commenting Line 27 which is creating the trigger, the cdk bootstrap/deploy works fine. With the trigger, the pipeline hangs at line 26.

Also note that, deploying the same stack with cdk 2.51.1 via local machine (OSX), it does go through. So the issue has to do something with the Jenkins environment + cdk changes between v.2.32.1 and v2.33 related to Trigger (maybe)

@anuprajg
Copy link

v.2.33 has some fixes related related to Custom Resource Provider #17460. Could that cause issues while in Jenkins environment?

@galsasi1989 In your project, if you remove the autoDeleteObjects, does it work with newer cdk versions?

autoDeleteObjects: removalPolicy === RemovalPolicy.DESTROY ? true : false,

@galsasi1989
Copy link
Author

Hi @anuprajg

Thanks for you reply! when I set the autoDeleteObjects to false(hard-coded), it works even with cdk version 2.51.1
And you're right, behind the scene, cloudformation invokes a lambda function so what you're saying makes a lot of sense.

Regarding running cdk locally, again, you're right. It works locally(OSX and WSL ubuntu 20.04). Only in Jenkins it's getting stuck.
My Jenkins server and agents are running on-premises. Do you have any clue what might be the reason it's getting stuck in Jenkins?
The version of Jenkins server is 2.361.2

Thanks!

@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 22, 2022

@galsasi1989,

You didn't post representative output, so it's hard to say what's wrong. You are saying that cdk synth, cdk diff, and cdk deploy all fail, correct? That means it must not have started a CloudFormation deployment yet, correct?

  • Can you run again with cdk deploy -v and paste the output?
  • Is your Jenkins worker running inside Docker? From your above comments, I think so, right?

@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 22, 2022

Aha it might be this: #21379

@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 22, 2022

  • That other thread has a workaround: set the temporary directory to a place inside the working tree. (I assume it will have to be a mounted volume...?)
  • Another workaround is upgrading the Linux kernel to 5.11 (as discussed here, the problem is solved in 5.11: https://lore.kernel.org/stable/[email protected]/)

@rix0rrr rix0rrr closed this as completed Nov 22, 2022
@github-actions
Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@galsasi1989
Copy link
Author

Hi @rix0rrr

Thanks for your help with this issue! It was very helpful after we spent long days or even weeks on this issue.

Can you please give us a high level description about the communication between cdk and the linux kernel? what was changed in cdk and how is it related to the kernel version?

In addition, I think it's very important to add validation and make sure that all the system requirements are met when I install my cdk project's dependencies(via pip, npm or other tools) and throw a clear exception as much as possible so at least we will have a clue next time.

@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 24, 2022

The CDK behavior is as follows:

  • Setting autoDeleteObjects creates a Custom Resource that will clear the bucket on stack deletion.
  • The CDK writes copies files when it needs to generate a code bundle for the Custom Resource provider. This code bundle consists of your code plus an index file we add for you.
  • After these source files are generated, the files are then copied into the cdk.out directory as part of asset staging. This is the same for all assets. The directory these files are copied into depends on the hash of all source files going into it, so the source bundle needs to be complete before this step can start.

The change was:

  • We used to do the first step, copying of source files, inside the node_modules directory. This was actually incorrect, as the node_modules directory should be considered a read-only repository of library code. So we changed the code generation to be moved to the system's temporary directory.
  • From Docker's point of view, in the old situation the file used to be created on a volume mount, but in the new situation is now created in a directory that's fully inside the container's overlayfs file system.
  • (This is why the workaround is moving the $TMP dir back to a location inside a Docker volume mount)

The problem was:

  • Because of a combination of Docker and kernel behavior, the copy second copy operation would appear to copy 0 bytes.
  • The NodeJS copyFile function keeps on retrying the call to copy more and more bytes over, getting 0 every time, and waiting until the copy is complete. This never finishes, and so the build appears to hang.
  • In later kernel versions, this bug has been fixed so the copy operation returns an actual number of bytes instead of 0, allowing the copy to succeed.

Full props to @nburtsev for figuring this out. I'm not sure I myself would have been able to put all of this together.


In summary:

The CDK does not directly communicate with the kernel--we just perform filesystem copies. Bugs in the interaction of other pieces of software cause the file copy to loop endlessly if the right combination of circumstances is hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-s3 Related to Amazon S3 bug This issue is a bug. p2
Projects
None yet
Development

No branches or pull requests

6 participants