silently failing to download and stage source release file #75

eedwards-sk · 2018-08-17T14:45:46Z

with concourse 4.0.0:

I have a pipeline which grabs a github release archive:

- name: homebrew
  type: github-release
  check_every: 12h
  source:
    owner: Homebrew
    repository: brew

and

    - get: homebrew
      params:
        include_source_tarball: true

The pipeline was working fine, then I put my computer to sleep with the stack 'stopped', then resumed my pc and woke the stack back up. The pipeline ran again as expected, and showed a valid version for the homebrew resource.

However, when it gets to the step in the pipeline where it uses the file, there's no .tar.gz file to be found!

+ cd /opt/concourse/local/worker/volumes/live/7df5d723-4cb4-4239-7b73-d3c6857ebe6c/volume/homebrew
+ ls -la
total 0
drwxr-xr-x  2 eedwards  staff   64 Aug 16 10:41 .
drwxr-xr-x  6 root      staff  192 Aug 17 09:40 ..

So for whatever reason the check container is reporting success, and concourse is acting like it pulled and staged the resource correctly, but the binary file is missing!

Also, a fly check-resource succeeds, but the file is still missing!

The text was updated successfully, but these errors were encountered:

eedwards-sk · 2018-08-23T06:03:36Z

Just happened again. In the darwin worker logs, you can see where it seems to be mapping the input:

{"src_path":"/opt/concourse/local/worker/volumes/live/3644b133-40fd-492b-79a3-a5fbac783f42/volume","dst_path":"/tmp/build/bd904192/homebrew","mode":1}

However, the volume path is empty.

Concourse reports the check successful, and still doesn't provide the file with a check-resource.

Edit: Even disabling the versions available, and then re-enabling a version doesn't help. Concourse thinks it properly staged the file, I guess? I'm not sure if this is a bug in the resource or not but it's really bad... because I don't have a workaround without destroying the pipeline!

DanielJonesEB · 2019-03-25T19:38:12Z

We've got the same thing - our Control Tower pipeline was watching for new Concourse releases, reckons it's got 5.0.1, but the input directory has no files in it:

# ls -la concourse-github-release/
total 16
drwxr-xr-x    1 root     root            48 Mar 25 17:59 .
drwxr-xr-x    1 root     root            56 Mar 25 19:32 ..
-rw-r--r--    1 root     root            68 Mar 25 17:59 body
-rw-r--r--    1 root     root            40 Mar 25 17:59 commit_sha
-rw-r--r--    1 root     root             6 Mar 25 17:59 tag
-rw-r--r--    1 root     root             5 Mar 25 17:59 version

# cat concourse-github-release/tag
v5.0.1

The pipeline isn't publicly visible, sadly. Are GitHub releases atomic? Do the files get added to a release after it exists as an entity?

DanielJonesEB · 2019-03-25T19:51:02Z

Recreating the (one) worker in our deployment fixed it, presumably by forcibly dropping the cache. Would be nice if there was a way of doing this without recreating the VM?

kayrus · 2021-03-16T16:22:58Z

I also have an issue, when release binaries are not downloaded. Any clue on how can I debug this?

DanielJonesEB · 2021-03-16T16:32:45Z

@kayrus Restart your workers, and if you're seeing the same issue as us, it should fix it.

kayrus · 2021-03-16T16:56:51Z

@DanielJonesEB unfortunately I don't have possibility to restart nodes. is there a simple way to clear the cache by entering into the container?

DanielJonesEB · 2021-03-16T17:29:39Z

@kayrus I don't think so. You'd need to delete the volume that represents that version of the resource, and to do that you'd need to be outside of a check container, and 'at the same level' as the worker process.

kayrus · 2021-03-17T07:05:58Z

@DanielJonesEB I can get an access to the worker FS, but I cannot reboot it, it runs too many prod jobs. Where should I see the volume?

DanielJonesEB · 2021-03-17T08:54:36Z

@kayrus I've asked one of my colleagues to comment on where to find the volumes. Can you not add another worker? How many workers do you have? If you can't restart one without causing production problems, that's probably a sign that your system is a little fragile and should have more capacity.

kayrus · 2021-03-17T09:24:14Z

@DanielJonesEB I don't control it, I just use it and I have admin privileges. I really don't want to break it. It is running in k8s cluster and I'd like to carefully clean up the releases cache and figure out why it doesn't download binaries.
I have 10 worker nodes:

$ kubectl -n concourse get pods | grep concourse-worker-services | wc -l
10

I looped over each of them to find the volume, which I see via fly5 hijack:

bash-5.0# pwd
/tmp/build/get
bash-5.0# find 
.
./tag
./version
./commit_sha
./body

I see my commit, but no binary files from the release:

bash-5.0# cat ./commit_sha && echo
14d3a0c8342a5093e250cec4658d991a1a632d76

Here is the command I'm using to identify the source worker:

$ for i in `kubectl -n concourse get pods | grep concourse-worker-services | awk '{print $1}'`; do echo $i; kubectl -n concourse exec -ti $i -- find /concourse-work-dir -name get -exec grep -r 14d3a0c8 {} \; ; done

However it returns no results. I'm now aware about the low level concourse architecture, therefore I might looking in a wrong place. I appreciate your help though.

will-gant · 2021-03-17T09:25:06Z

What about safely draining that worker with fly land-worker and then restarting it once you've confirmed that it's running no containers?

kayrus · 2021-03-17T09:26:14Z

@will-gant How can I identify which worker is used?
P.S. I'm trying to fix symptoms, but not the root cause. Probably you can try to find a possibility to clear the releases cache in a more API driven approach. clear-task-cache doesn't help at all

will-gant · 2021-03-17T09:30:46Z

If you run fly containers you should be able to find the particular step you're having issues with (you've got a "build" column that, read in conjunction with the "pipeline", "type" and "name" columns, should be enough for you to identify it). Then you'll be able to see the corresponding worker ID.

will-gant · 2021-03-17T09:37:22Z

Probably you can try to find a possibility to clear the releases cache in a more API driven approach. clear-task-cache doesn't help at all

Ah I'm not an author of the release - just a colleague of @DanielJonesEB :-)

kayrus · 2021-03-17T09:40:26Z

@will-gant found it:

$ fly5 -t services containers | grep my-broken-release
ccdca9cc-190d-4dcf-476e-8033bc522fc2  node-name-gl2ms  pipeline                   shell-build                         127    851195  get    my-broken-release

I entered into the worker, but couldn't find anything related to my-broken-release:

$ find / -name 'my*-broken-release*'

However I found the ccdca9cc-190d-4dcf-476e-8033bc522fc2 dir:

# find / -name '*ccdca9cc-190d-4dcf-476e-8033bc522fc2*'
/sys/fs/cgroup/memory/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/perf_event/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/hugetlb/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/cpuset/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/blkio/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/pids/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/devices/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/freezer/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/cpu,cpuacct/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/net_cls,net_prio/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/sys/fs/cgroup/systemd/garden/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/run/runc/ccdca9cc-190d-4dcf-476e-8033bc522fc2
/concourse-work-dir/depot/ccdca9cc-190d-4dcf-476e-8033bc522fc2

but there is nothing about the problem release inside.

kayrus · 2021-03-17T10:17:40Z

Ok, inside the failed release container via hijack I was able to determine the volume, which contains the failed release:

bash-5.0# mount | grep /tmp/build/get
/dev/sdb on /tmp/build/get type btrfs (rw,seclabel,relatime,space_cache,subvolid=20481,subvol=/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume)

Then I'm looking for a target worker:

$ for i in `kubectl -n concourse get pods | grep concource-worker | awk '{print $1}'`; do echo $i; kubectl -n concourse exec -ti $i -- find /concourse-work-dir -name 793b81c7-2225-41b0-518a-3f65ee353b1e ; done

Entering the worker and list the files:

$ ls -la /concourse-work-dir/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume/
total 16
drwxr-xr-x. 1 4294967294 4294967294 48 Mar 17 10:05 .
drwxr-xr-x. 1 root       root       72 Mar 17 10:05 ..
-rw-r--r--. 1 4294967294 4294967294 17 Mar 17 10:05 body
-rw-r--r--. 1 4294967294 4294967294 40 Mar 17 10:05 commit_sha
-rw-r--r--. 1 4294967294 4294967294  6 Mar 17 10:05 tag
-rw-r--r--. 1 4294967294 4294967294  5 Mar 17 10:05 version

I then delete this subvolume from the target node:

btrfs subvolume delete /concourse-work-dir/volumes/live/793b81c7-2225-41b0-518a-3f65ee353b1e/volume

Restart the task and still has the same issue. No releases, new failed release subvolume is provisioned on another node now.

kayrus · 2021-03-17T10:32:21Z

Removing all related buggy subvolumes from every worker now ends with failed to create volume :
Now back to my initial question, how can I force concource releases to redownload the release?

crsimmons · 2021-03-17T11:18:11Z

There isn't currently any way to clear the cache and force a resource refresh. This is a feature that has been talked about for years and the issue is still open (concourse/concourse#1038) but it hasn't been implemented.

I suspect you also need remove references to the deleted volume from the Concourse's postgres DB but I'm not familar enough with the shema to point you to where.

DanielJonesEB · 2021-03-17T11:32:51Z

Yeah, we need to get Concourse to realise that it doesn't have the volume, and therefore to download it again (now that the binaries are uploaded to GitHub). Recreating the worker does this (it has to download the files from somewhere to be able to use them), and I was hoping that deleting the volume itself would do the trick. It could be that there's a reference to it somewhere that makes Concourse think that it's already downloaded it, and so it won't re-download it, but will error when trying to access the volume that's now been deleted?

@crsimmons Would we expect volumes to be streamed from other workers? If the same cached volume with no binaries in exists on other workers, would we need to worry about bouncing those too?

kayrus · 2021-03-17T12:05:11Z

@crsimmons I found release tails in the postgres, removed them:

concourse=> select id from resource_caches where metadata LIKE '%14d3a0c8342a5093e250cec4658d991a1a632d76%';
   id   
 632031
concourse=> delete from resource_cache_uses where resource_cache_id = 632031;
DELETE 1
concourse=> delete from resource_caches where id = 632031;
DELETE 1
(1 row)

restarted the pipeline, I clearly saw, that the problem release was getting updated comparing to others, but it still doesn't contain the binaries. The repo is public and I clearly see the binaries on the release page. The config for the pipeline is:

- name: problem-release.release
  type: github-release
  check_every: 5m
  source:
    access_token: ((github-com-access-token))
    owner: our-public-org
    repository: our-public-repo

other releases have the same format and no problem. Probably it is somehow related to the fact that we renamed the repo name from previous-name to our-public-repo?

kayrus · 2021-03-17T12:21:52Z

other releases have the same format and no problem. Probably it is somehow related to the fact that we renamed the repo name from previous-name to our-public-repo?

I found an issue. The pipeline sources were updated in github, but not applied to concourse. Therefore concourse successfully used old glob patterns. These globs were not seen in concourse pipeline UI, but I clearly saw a new repo URL. I applied the pipeline to concourse manually and now I can see the release being downloaded. Though now I know concourse architecture better :)

Thanks everyone for a help

DanielJonesEB · 2021-03-17T12:26:41Z

@kayrus Glad you got it sorted! I would really recommend making all your pipeline 'self-set' as their first step.

Here's a running example and the corresponding config.

Stick something like this at the beginning of your pipelines, or encourage your developers to do so:

jobs:
- name: set-pipeline
  serial: true
  plan:
  - get: git-repo
    trigger: true
  - set_pipeline: self
    file: git-repo/ci/pipeline.yml

kayrus · 2021-03-17T12:33:05Z

@DanielJonesEB will keep this in mind. Thanks.

eedwards-sk mentioned this issue Aug 23, 2018

4.0.0 some resources silently fail to produce input artifacts concourse/concourse#2525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

silently failing to download and stage source release file #75

silently failing to download and stage source release file #75

eedwards-sk commented Aug 17, 2018 •

edited

Loading

eedwards-sk commented Aug 23, 2018 •

edited

Loading

DanielJonesEB commented Mar 25, 2019 •

edited

Loading

DanielJonesEB commented Mar 25, 2019

kayrus commented Mar 16, 2021

DanielJonesEB commented Mar 16, 2021

kayrus commented Mar 16, 2021 •

edited

Loading

DanielJonesEB commented Mar 16, 2021

kayrus commented Mar 17, 2021 •

edited

Loading

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

will-gant commented Mar 17, 2021

kayrus commented Mar 17, 2021 •

edited

Loading

will-gant commented Mar 17, 2021

will-gant commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021 •

edited

Loading

crsimmons commented Mar 17, 2021

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

silently failing to download and stage source release file #75

silently failing to download and stage source release file #75

Comments

eedwards-sk commented Aug 17, 2018 • edited Loading

eedwards-sk commented Aug 23, 2018 • edited Loading

DanielJonesEB commented Mar 25, 2019 • edited Loading

DanielJonesEB commented Mar 25, 2019

kayrus commented Mar 16, 2021

DanielJonesEB commented Mar 16, 2021

kayrus commented Mar 16, 2021 • edited Loading

DanielJonesEB commented Mar 16, 2021

kayrus commented Mar 17, 2021 • edited Loading

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

will-gant commented Mar 17, 2021

kayrus commented Mar 17, 2021 • edited Loading

will-gant commented Mar 17, 2021

will-gant commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021 • edited Loading

crsimmons commented Mar 17, 2021

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

kayrus commented Mar 17, 2021

DanielJonesEB commented Mar 17, 2021

kayrus commented Mar 17, 2021

eedwards-sk commented Aug 17, 2018 •

edited

Loading

eedwards-sk commented Aug 23, 2018 •

edited

Loading

DanielJonesEB commented Mar 25, 2019 •

edited

Loading

kayrus commented Mar 16, 2021 •

edited

Loading

kayrus commented Mar 17, 2021 •

edited

Loading

kayrus commented Mar 17, 2021 •

edited

Loading

kayrus commented Mar 17, 2021 •

edited

Loading