Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper: Make Create+Put on a new znode atomic #148

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amirma
Copy link

@amirma amirma commented Feb 24, 2017

Writes to a new Zookeepr znode should take advantage of Zookeeper's
atomic create + write primitive. If not, it is possible that a read
that was triggered by a watch will return an empty string. The previous
workaround for this does not always work (e.g., when an empty string
is returned due to a race) and can potentially cause call-stack overflow.
This change-set fixes this race and removes the workaround.
It also adds a call to Zookeepeer's Sync() on a Get operation,
only when an empty string (or SOH) is returned to guard against an older
version of libkv doing create+write in a non atomic fashion.

This change-set addresses github.com/docker-archive/classicswarm/issues/1915

Signed-off-by: Amir Malekpour [email protected]

Writes to a new Zookeepr znode should take advantage of Zookeeper's
atomic create + write primitive. If not, it is possible that a read
that was triggered by a watch will return an empty string. The previous
workaround for this does not always work (e.g., when an empty string
is returnedi due to a race) and can potentially cause call-stack overflow.
This change-set fixes this race and removes the workaround.
It also adds a call to Zookeepeer's Sync() on a Get operation,
only when an empty string (or SOH) is returned to guard against an older
version of libkv doing create+write in a non atomic fashion.

This change-set addresses github.com/docker-archive/classicswarm/issues/1915

Signed-off-by: Amir Malekpour <[email protected]>
Copy link

@dongluochen dongluochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we should add test to validate the changes.

// SOH control character instead of the actual value
if string(resp) == SOH {
return s.Get(store.Normalize(key))
if (string(resp) == SOH || string(resp) == "") && i < syncRetryLimit {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel i < syncRetryLimit here is unnecessary.

Copy link
Author

@amirma amirma Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongluochen The i < syncRetryLimit is to prevent a Sync in the very last iteration of the for loop. Without this, we would unnecessarily call sync and then exit the loop. (So effectively we call sync not more than syncRetryLimit number of times).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@dongluochen
Copy link

I'm not familiar with this repo. CI is failing for last few PRs.

0.08s$ script/travis_consul.sh 0.6.3
--2017-02-24 01:32:49--  https://releases.hashicorp.com/consul/0.6.3/consul_0.6.3_linux_amd64.zip
Resolving releases.hashicorp.com (releases.hashicorp.com)... 151.101.33.183
Connecting to releases.hashicorp.com (releases.hashicorp.com)|151.101.33.183|:443... connected.
Unable to establish SSL connection.

// SOH control character instead of the actual value
if string(resp) == SOH {
return s.Get(store.Normalize(key))
if (string(resp) == SOH || string(resp) == "") && i < syncRetryLimit {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

defaultTimeout = 10 * time.Second
defaultTimeout = 10 * time.Second

syncRetryLimit = 5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to choose 5? I don't know but it seems high. How much delay this may introduce?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongluochen No specific reason, except in my test environment 3 retries would always be needed to get the correct value, and I rounded this up to 5.
Note that the sync currently happens on SOH or empty string getting returned, and not on all reads. Also this is to work around older versions of libkv which have this bug. The versions that have this patch will do create+write atomically and should not return SOH or empty so the sync should never be needed.

@dongluochen
Copy link

Ping @docker/core-libkv-maintainers.

CI is also failing.

@mcapuccini
Copy link

@dongluochen my PR fixes it #154

Copy link

@nishanttotla nishanttotla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amirma are you still working on this?

@amirma
Copy link
Author

amirma commented Jun 21, 2017

@nishanttotla We've merged this in our production repository and I can confirm it has fixed the issue. I think the patch is ready to merge as is.

@nishanttotla
Copy link

@amirma thanks for the prompt reply! Seems like your build is failing. Could you take a look at that? It would be nice to merge this.

@amirma
Copy link
Author

amirma commented Jun 21, 2017

@nishanttotla I just pushed another build and it failed due to a new (I believe infrastructure-related) reason. Any insights?

@nishanttotla
Copy link

It seems like some package paths/locations may have changed:

27.62s$ go get -t -v ./...
github.com/stretchr/testify (download)
github.com/boltdb/bolt (download)
github.com/hashicorp/consul (download)
github.com/coreos/etcd (download)
github.com/coreos/go-semver (download)
github.com/ugorji/go (download)
Fetching https://golang.org/x/net/context?go-get=1
Parsing meta tags from https://golang.org/x/net/context?go-get=1 (status code 200)
get "golang.org/x/net/context": found meta tag main.metaImport{Prefix:"golang.org/x/net", VCS:"git", RepoRoot:"https://go.googlesource.com/net"} at https://golang.org/x/net/context?go-get=1
get "golang.org/x/net/context": verifying non-authoritative meta tag
Fetching https://golang.org/x/net?go-get=1
Parsing meta tags from https://golang.org/x/net?go-get=1 (status code 200)
golang.org/x/net (download)
github.com/samuel/go-zookeeper (download)
github.com/docker/libkv/store
github.com/boltdb/bolt
github.com/hashicorp/consul/vendor/github.com/hashicorp/go-cleanhttp
github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/coordinate
github.com/coreos/etcd/pkg/pathutil
github.com/coreos/etcd/pkg/types
github.com/coreos/go-semver/semver
github.com/ugorji/go/codec
golang.org/x/net/context
github.com/docker/libkv
github.com/stretchr/testify/vendor/github.com/davecgh/go-spew/spew
github.com/stretchr/testify/vendor/github.com/pmezard/go-difflib/difflib
github.com/stretchr/testify/vendor/github.com/stretchr/objx
github.com/samuel/go-zookeeper/zk
github.com/coreos/etcd/version
github.com/hashicorp/consul/api
github.com/stretchr/testify/assert
github.com/docker/libkv/store/boltdb
github.com/docker/libkv/store/zookeeper
github.com/stretchr/testify/mock
github.com/docker/libkv/testutils
github.com/docker/libkv/store/mock
github.com/docker/libkv/store/consul
github.com/coreos/etcd/client
github.com/docker/libkv/store/etcd
0.08s$ script/travis_consul.sh 0.6.3
--2017-02-24 01:32:49--  https://releases.hashicorp.com/consul/0.6.3/consul_0.6.3_linux_amd64.zip
Resolving releases.hashicorp.com (releases.hashicorp.com)... 151.101.33.183
Connecting to releases.hashicorp.com (releases.hashicorp.com)|151.101.33.183|:443... connected.
Unable to establish SSL connection.
unzip:  cannot find or open consul_0.6.3_linux_amd64.zip, consul_0.6.3_linux_amd64.zip.zip or consul_0.6.3_linux_amd64.zip.ZIP.
script/travis_consul.sh: line 18: ./consul: No such file or directory

Ping @docker/core-libkv-maintainers, the CI might need fixing.

@abronan
Copy link
Contributor

abronan commented Aug 29, 2017

The fix is included to my fork: abronan/libkv along with other patches, thanks @amirma.

@nishanttotla
Copy link

@abronan should we merge this PR, or might it need more changes? I think this is useful and fixes a long standing issue.

@abronan
Copy link
Contributor

abronan commented Aug 29, 2017

@nishanttotla I think it is safe to merge it yes, just merge the other PRs that are fixing the build first (the commit history on my fork could be of help).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants