Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

silently failing to download and stage source release file #75

Open
eedwards-sk opened this issue Aug 17, 2018 · 23 comments
Open

silently failing to download and stage source release file #75

eedwards-sk opened this issue Aug 17, 2018 · 23 comments

Comments

@eedwards-sk
Copy link

eedwards-sk commented Aug 17, 2018

with concourse 4.0.0:

I have a pipeline which grabs a github release archive:

- name: homebrew
  type: github-release
  check_every: 12h
  source:
    owner: Homebrew
    repository: brew

and

    - get: homebrew
      params:
        include_source_tarball: true

The pipeline was working fine, then I put my computer to sleep with the stack 'stopped', then resumed my pc and woke the stack back up. The pipeline ran again as expected, and showed a valid version for the homebrew resource.

However, when it gets to the step in the pipeline where it uses the file, there's no .tar.gz file to be found!

+ cd /opt/concourse/local/worker/volumes/live/7df5d723-4cb4-4239-7b73-d3c6857ebe6c/volume/homebrew
+ ls -la
total 0
drwxr-xr-x  2 eedwards  staff   64 Aug 16 10:41 .
drwxr-xr-x  6 root      staff  192 Aug 17 09:40 ..

So for whatever reason the check container is reporting success, and concourse is acting like it pulled and staged the resource correctly, but the binary file is missing!

Also, a fly check-resource succeeds, but the file is still missing!

@eedwards-sk
Copy link
Author

eedwards-sk commented Aug 23, 2018

Just happened again. In the darwin worker logs, you can see where it seems to be mapping the input:

{"src_path":"/opt/concourse/local/worker/volumes/live/3644b133-40fd-492b-79a3-a5fbac783f42/volume","dst_path":"/tmp/build/bd904192/homebrew","mode":1}

However, the volume path is empty.

Concourse reports the check successful, and still doesn't provide the file with a check-resource.

Edit: Even disabling the versions available, and then re-enabling a version doesn't help. Concourse thinks it properly staged the file, I guess? I'm not sure if this is a bug in the resource or not but it's really bad... because I don't have a workaround without destroying the pipeline!

@DanielJonesEB
Copy link

DanielJonesEB commented Mar 25, 2019

We've got the same thing - our Control Tower pipeline was watching for new Concourse releases, reckons it's got 5.0.1, but the input directory has no files in it:

# ls -la concourse-github-release/
total 16
drwxr-xr-x    1 root     root            48 Mar 25 17:59 .
drwxr-xr-x    1 root     root            56 Mar 25 19:32 ..
-rw-r--r--    1 root     root            68 Mar 25 17:59 body
-rw-r--r--    1 root     root            40 Mar 25 17:59 commit_sha
-rw-r--r--    1 root     root             6 Mar 25 17:59 tag
-rw-r--r--    1 root     root             5 Mar 25 17:59 version

# cat concourse-github-release/tag
v5.0.1

The pipeline isn't publicly visible, sadly. Are GitHub releases atomic? Do the files get added to a release after it exists as an entity?

@DanielJonesEB
Copy link

Recreating the (one) worker in our deployment fixed it, presumably by forcibly dropping the cache. Would be nice if there was a way of doing this without recreating the VM?

@kayrus
Copy link

kayrus commented Mar 16, 2021

I also have an issue, when release binaries are not downloaded. Any clue on how can I debug this?

@DanielJonesEB
Copy link

@kayrus Restart your workers, and if you're seeing the same issue as us, it should fix it.

@kayrus
Copy link

kayrus commented Mar 16, 2021

@DanielJonesEB unfortunately I don't have possibility to restart nodes. is there a simple way to clear the cache by entering into the container?

@DanielJonesEB
Copy link

@kayrus I don't think so. You'd need to delete the volume that represents that version of the resource, and to do that you'd need to be outside of a check container, and 'at the same level' as the worker process.

@kayrus
Copy link

kayrus commented Mar 17, 2021

@DanielJonesEB I can get an access to the worker FS, but I cannot reboot it, it runs too many prod jobs. Where should I see the volume?

@DanielJonesEB
Copy link

@kayrus I've asked one of my colleagues to comment on where to find the volumes. Can you not add another worker? How many workers do you have? If you can't restart one without causing production problems, that's probably a sign that your system is a little fragile and should have more capacity.

@kayrus
Copy link

kayrus commented Mar 17, 2021

@DanielJonesEB I don't control it, I just use it and I have admin privileges. I really don't want to break it. It is running in k8s cluster and I'd like to carefully clean up the releases cache and figure out why it doesn't download binaries.
I have 10 worker nodes:

$ kubectl -n concourse get pods | grep concourse-worker-services | wc -l
10

I looped over each of them to find the volume, which I see via fly5 hijack:

bash-5.0# pwd
/tmp/build/get
bash-5.0# find 
.
./tag
./version
./commit_sha
./body

I see my commit, but no binary files from the release:

bash-5.0# cat ./commit_sha && echo
14d3a0c8342a5093e250cec4658d991a1a632d76

Here is the command I'm using to identify the source worker:

$ for i in `kubectl -n concourse get pods | grep concourse-worker-services | awk '{print $1}'`; do echo $i; kubectl -n concourse exec -ti $i -- find /concourse-work-dir -name get -exec grep -r 14d3a0c8 {} \; ; done

However it returns no results. I'm now aware about the low level concourse architecture, therefore I might looking in a wrong place. I appreciate your help though.

@will-gant
Copy link

What about safely draining that worker with fly land-worker and then restarting it once you've confirmed that it's running no containers?

@kayrus
Copy link

kayrus commented Mar 17, 2021

@will-gant How can I identify which worker is used?
P.S. I'm trying to fix symptoms, but not the root cause. Probably you can try to find a possibility to clear the releases cache in a more API driven approach. clear-task-cache doesn't help at all

@will-gant
Copy link

If you run fly containers you should be able to find the particular step you're having issues with (you've got a "build" column that, read in conjunction with the "pipeline", "type" and "name" columns, should be enough for you to identify it). Then you'll be able to see the corresponding worker ID.

@will-gant
Copy link

Probably you can try to find a possibility to clear the releases cache in a more API driven approach. clear-task-cache doesn't help at all

Ah I'm not an author of the release - just a colleague of @DanielJonesEB :-)

@kayrus
Copy link

kayrus commented Mar 17, 2021

@will-gant found it:

$ fly5 -t services containers | grep my-broken-release
ccdca9cc-190d-4dcf-476e-8033bc522fc2  node-name-gl2ms  pipeline                   shell-build                         127    851195  get    my-broken-release

I entered into the worker, but couldn't find anything related to my-broken-release:

$ find / -name 'my*-broken-release*'

However I found the ccdca9cc-190d-4dcf-476e-8033bc522fc2 dir:

# find / -name '*ccdca9cc-190d-4dcf-476e-8033bc522fc2*'
/sys/fs/cgroup/memory/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/perf_event/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/hugetlb/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/cpuset/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/blkio/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/pids/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/devices/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/freezer/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/cpu,cpuacct/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/net_cls,net_prio/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/systemd/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/run/runc/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/concourse-work-dir/depot/ccdca9cc-190d-4dcf-476e-8033bc522fc2

but there is nothing about the problem release inside.

@kayrus
Copy link

kayrus commented Mar 17, 2021

Ok, inside the failed release container via hijack I was able to determine the volume, which contains the failed release:

bash-5.0# mount | grep /tmp/build/get
/dev/sdb on /tmp/build/get type btrfs (rw,seclabel,relatime,space_cache,subvolid=20481,subvol=/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume)

Then I'm looking for a target worker:

$ for i in `kubectl -n concourse get pods | grep concource-worker | awk '{print $1}'`; do echo $i; kubectl -n concourse exec -ti $i -- find /concourse-work-dir -name 793b81c7-2225-41b0-518a-3f65ee353b1e ; done

Entering the worker and list the files:

$ ls -la /concourse-work-dir/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume/
total 16
drwxr-xr-x. 1 4294967294 4294967294 48 Mar 17 10:05 .
drwxr-xr-x. 1 root       root       72 Mar 17 10:05 ..
-rw-r--r--. 1 4294967294 4294967294 17 Mar 17 10:05 body
-rw-r--r--. 1 4294967294 4294967294 40 Mar 17 10:05 commit_sha
-rw-r--r--. 1 4294967294 4294967294  6 Mar 17 10:05 tag
-rw-r--r--. 1 4294967294 4294967294  5 Mar 17 10:05 version

I then delete this subvolume from the target node:

btrfs subvolume delete /concourse-work-dir/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume

Restart the task and still has the same issue. No releases, new failed release subvolume is provisioned on another node now.

@kayrus
Copy link

kayrus commented Mar 17, 2021

Removing all related buggy subvolumes from every worker now ends with failed to create volume :
Now back to my initial question, how can I force concource releases to redownload the release?

@crsimmons
Copy link
Member

There isn't currently any way to clear the cache and force a resource refresh. This is a feature that has been talked about for years and the issue is still open (concourse/concourse#1038) but it hasn't been implemented.

I suspect you also need remove references to the deleted volume from the Concourse's postgres DB but I'm not familar enough with the shema to point you to where.

@DanielJonesEB
Copy link

Yeah, we need to get Concourse to realise that it doesn't have the volume, and therefore to download it again (now that the binaries are uploaded to GitHub). Recreating the worker does this (it has to download the files from somewhere to be able to use them), and I was hoping that deleting the volume itself would do the trick. It could be that there's a reference to it somewhere that makes Concourse think that it's already downloaded it, and so it won't re-download it, but will error when trying to access the volume that's now been deleted?

@crsimmons Would we expect volumes to be streamed from other workers? If the same cached volume with no binaries in exists on other workers, would we need to worry about bouncing those too?

@kayrus
Copy link

kayrus commented Mar 17, 2021

@crsimmons I found release tails in the postgres, removed them:

concourse=> select id from resource_caches where metadata LIKE '%14d3a0c8342a5093e250cec4658d991a1a632d76%';
   id   
 632031
concourse=> delete from resource_cache_uses where resource_cache_id = 632031;
DELETE 1
concourse=> delete from resource_caches where id = 632031;
DELETE 1
(1 row)

restarted the pipeline, I clearly saw, that the problem release was getting updated comparing to others, but it still doesn't contain the binaries. The repo is public and I clearly see the binaries on the release page. The config for the pipeline is:

- name: problem-release.release
  type: github-release
  check_every: 5m
  source:
    access_token: ((github-com-access-token))
    owner: our-public-org
    repository: our-public-repo

other releases have the same format and no problem. Probably it is somehow related to the fact that we renamed the repo name from previous-name to our-public-repo?

@kayrus
Copy link

kayrus commented Mar 17, 2021

other releases have the same format and no problem. Probably it is somehow related to the fact that we renamed the repo name from previous-name to our-public-repo?

I found an issue. The pipeline sources were updated in github, but not applied to concourse. Therefore concourse successfully used old glob patterns. These globs were not seen in concourse pipeline UI, but I clearly saw a new repo URL. I applied the pipeline to concourse manually and now I can see the release being downloaded. Though now I know concourse architecture better :)

Thanks everyone for a help

@DanielJonesEB
Copy link

@kayrus Glad you got it sorted! I would really recommend making all your pipeline 'self-set' as their first step.

Here's a running example and the corresponding config.

Stick something like this at the beginning of your pipelines, or encourage your developers to do so:

jobs:
- name: set-pipeline
  serial: true
  plan:
  - get: git-repo
    trigger: true
  - set_pipeline: self
    file: git-repo/ci/pipeline.yml

@kayrus
Copy link

kayrus commented Mar 17, 2021

@DanielJonesEB will keep this in mind. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants