Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Etcd membership module #17736

Merged
merged 62 commits into from
Aug 1, 2023
Merged

Conversation

lucyge2022
Copy link
Contributor

What changes are proposed in this pull request?

Add membership module for dora membership management.

Why are the changes needed?

Allow usage of etcd cluster or static configuration instead of Master for membership management

Does this PR introduce any user facing changes?

No

@lucyge2022 lucyge2022 marked this pull request as ready for review July 5, 2023 22:25
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • First word of title ("Etcd") is not an imperative verb. Please use one of the valid words
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@lucyge2022 lucyge2022 changed the title Etcd membership Add Etcd membership module Jul 6, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

Copy link
Contributor

@jenoudet jenoudet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to introduce dependency injection through Guice to this new code? That would match the rest of the worker code.

dora/core/common/src/main/java/alluxio/Constants.java Outdated Show resolved Hide resolved
dora/core/common/src/main/java/alluxio/Constants.java Outdated Show resolved Hide resolved
Comment on lines 93 to 154
public void connect() {
connect(false);
}

public void connect(boolean force) {
if (mConnected.get() && !force) {
return;
}
mConnected.set(false);
// create client using endpoints
Client client = Client.builder().endpoints(mEndpoints)
.build();
if (mConnected.compareAndSet(false, true)) {
mClient = client;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client creation should equal client connection, as is the case for gRPC clients and raw etcd clients. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually refer to AbstractClient.connect() which is called within retryRpcInternal, and got triggered on first rpc call made to grpc services. Therefore AlluxioEtcdClient.connect() is getting called on first usage of itself, like getEtcdClient(), start a new ServiceDiscovery() instance where the instantiation is embedded in AlluxioEtcdClient constructor, so in a way, AlluxioEtcdClient.connect gets called at time of instantiation.

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lucy, amazing work! I have some comments from a perspective of a consumer (e.g. worker selection policies) of this API.

* @return all registered workers
* @throws IOException
*/
public List<WorkerInfo> getAllMembers() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the return type a dedicated data structure? e.g. a ClusterMembershipView that allows doing things more efficiently?

  • checking if a worker is a member of the cluster

This is more efficient if done with a Set instead of a List.

  • checking if two view snapshots are the same, and
  • calculating the diff between two view snapshots

These two are useful for worker selection policies to tell if it is necessary to invalidate their cached view of the cluster members. For example WorkerLocationPolicy.ConsistentHashProvider currently caches the list of workers to save the cost of rebuilding a hash ring every time when called to decide the preferred worker for a file.

  • seletively querying workers by their properties, e.g. liveness

So that you can move the getLiveMemebers getFailedMember methods to the view object and make this interface cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like public ClusterMembershipView getView() {
return new ClusterMembershipView()
}
ClusterMembershipView {
equals();
getLiveMembers
getFailedMembers
}
like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly!

dora/core/common/pom.xml Outdated Show resolved Hide resolved
Copy link
Contributor

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR! this is my 1st batch of comments mostly on style and readability. I didn't get into details of the logics

private static final Lock INSTANCE_LOCK = new ReentrantLock();
@GuardedBy("INSTANCE_LOCK")
private static final AtomicReference<AlluxioEtcdClient> ALLUXIO_ETCD_CLIENT = new AtomicReference<>();
protected AtomicBoolean mConnected = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this mMember to be protected and not final ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think u meant mConnected?
changed to private final

dora/core/common/src/main/java/alluxio/MembershipType.java Outdated Show resolved Hide resolved
@lucyge2022 lucyge2022 force-pushed the etcd-membership branch 4 times, most recently from 1d56796 to f8ffeed Compare July 24, 2023 23:50
Copy link
Contributor

@Xenorith Xenorith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup related comments

Copy link
Contributor

@Xenorith Xenorith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non java changes look good to me

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey sorry I just submitted a 1st batch of comments as I haven't fully read the longest few classes. I will try to finish very soon.

try {
GetResponse getResp = mClient.getKVClient().get(
ByteSequence.from(mBarrierPath, StandardCharsets.UTF_8)).get();
LOG.info("get key:{}, [{}]", mBarrierPath, getResp.getKvs());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be noisy for INFO log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this class is not used anywhere, but keeping it thinking it might be useful in future, will remove it as part of this PR

@@ -83,7 +84,8 @@ enum Command {
SUMMARY, // Report cluster summary
UFS, // Report under filesystem information
JOBSERVICE, // Report job service metrics information
PROXY // Report proxy information in the cluster
PROXY, // Report proxy information in the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use bin/alluxio fsadmin report capacity to list all workers and their capacity. Does that overlap with what you are doing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't report any worker capacity info, only the live/fail status of each worker

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay! I just added the rest of my comments. Honestly majority of those are about formatting and readability, with a few sanity checks for potential catches. PTAL, thanks!

case STATIC:
return new StaticMembershipManager(conf);
case ETCD:
return new EtcdMembershipManager(conf);
case NOOP:
return new NoOpMembershipManager();
default:
throw new IOException("Unrecognized Membership Type.");
throw new IllegalStateException("Unrecognized Membership Type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what this type is when unrecognized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would directly throw exception without returning any type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean i think you can log the unrecognized type in the exception for better troubleshooting, just a nit

Comment on lines 34 to 38
listen-peer-urls: http://172.31.30.204:2380

# List of comma separated URLs to listen on for client traffic.
#give ip/hostname of this etcd instance
listen-client-urls: http://172.31.30.204:2379,http://127.0.0.1:2379
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think an example fits better in comments. So a user wouldn't just copy this config and start and see mysterious errors/behaviors.

Comment on lines +7 to +8
StandardOutput=append:/var/log/etcd.log
StandardError=append:/var/log/etcd.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac has /var/log but that belongs to root. I figure it's worth mentioning in a short comment, that one can use this path on Mac but may need sudo. I figure the template file is most useful when one want to test and experience a feature, and that test may very likely happen on Mac instead of linux.

Comment on lines +33 to +38
// revision number of kv pair of registered entity on etcd, used for CASupdate
protected long mRevision;
public final ReentrantLock mLock = new ReentrantLock();
public AtomicBoolean mNeedReconnect = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments about the thread safety requirements and guarantees, as I can't make a reasonable guess based on these fields here. Do you need volatile for the mRevision as you wanna CAS it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

*/
public class HashUtils {

private static final HashFunction HASH_FUNCTION = murmur3_32_fixed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this hash func is deprecated but it matters little here

@@ -54,6 +54,11 @@ public List<WorkerInfo> getWorkerInfos() {
}
}

@Override
public List<WorkerInfo> getLiveWorkerInfos() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to make sure, this does not throw any declared exception, meaning all errors will be RuntimeException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be, this needs to be further revamped when developing etcd interaction for scheduler

mConf, ServerUserState.global()));
// TODO(lucy): temporary fallback logic during transition of removing master dependency
if (mMembershipManager instanceof NoOpMembershipManager) {
getExecutorService()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a log here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

registerToMaster();
return;
}
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a little log here, as you might get stuck in this loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryPolicy has a max duration of attempted retry.

Copy link
Contributor

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution! I left comments, most are about coding convention.

dora/core/common/src/main/java/alluxio/MembershipType.java Outdated Show resolved Hide resolved
+ "Choose STATIC for pre-configured members."
+ "Choose ETCD for using etcd for membership management")
.setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
.setScope(Scope.WORKER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this membership type will affect client behavior, correct? if yes, it shall be ALL scope

Suggested change
.setScope(Scope.WORKER)
.setScope(Scope.ALL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.setDescription("Type of membership manager used for workers."
+ "Choose STATIC for pre-configured members."
+ "Choose ETCD for using etcd for membership management")
.setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a critical config that all components need to have the same consistent setting.

Suggested change
.setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
.setConsistencyCheckLevel(ConsistencyCheckLevel.ENFORCE)

* Base Entity class including information to register to Etcd
* when using EtcdMembershipManager.
*/
public class ServiceEntity implements Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a ServiceEntity interface with the only required methods,
but leaving this implementation like BasicServiceEntry or DefaultServiceEntry.
Actually will there be any one create an instance of this class? I didn't find one though.

Also, this does not look like threadsafe to me

* MembershipManager configured by a static file.
*/
public class StaticMembershipManager implements MembershipManager {
List<WorkerInfo> mMembers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param conf
* @throws IOException
*/
public StaticMembershipManager(AlluxioConfiguration conf) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constructors shall be simple and always succeed without throwing Exceptions.

Check alluxio.client.file.cache.LocalCacheManager#create and alluxio.client.file.cache.LocalCacheManager#LocalCacheManager,

create is a static factory method that can through exceptions, while LocalCacheManager is a constructor purely receiving member assignment (absolutely will complete)

Handling a constructor that can throw exceptions is tricky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private void register() throws IOException {
Preconditions.checkState(mAddress != null, "worker not started");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Preconditions.checkState(mAddress != null, "worker not started");
Preconditions.checkNotNull(mAddress, "worker not started");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great work! I'm good except a few doc-related commends like https://github.com/Alluxio/alluxio/pull/17736/files#r1275694116

Despite that there are few comments from Bin which can lead to code changes, I will give my stamp in order not to block the progress. I'll leave the rest to your judgements :P

2. use conf/workers as default static worker membership mgr typed file
and use existing util to parse it accordingly
…ration

   of thread safety / atomicity / race condition addressing in doc. Remove
   unwanted lock.
2. Close MembershipManager properly in Worker close & FileSystemContext closeContext
3. address more review comments
Copy link
Contributor

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address in this PR (git related)

  • the change in this PR for dora/core/server/worker/src/main/java/alluxio/worker/AlluxioWorker.java looks like a regression from main branch. please fix it
  • change in dora/tests/pom.xml also looks like a rebase regression

In follow up PR

  • please remember address comments w.r.t. potential stream resource leak in dora/core/common/src/main/java/alluxio/membership/EtcdMembershipManager.java.

Comment on lines +503 to +507
try (AutoCloseable ignoredCloser = mMembershipManager) {
// do nothing as we are closing
} catch (Exception e) {
throw new IOException(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary since there is nothing inside the try-catch block?
how is this different from simply calling mMembershipManager.close()?

public static final PropertyKey ALLUXIO_CLUSTER_NAME =
stringBuilder(Name.ALLUXIO_CLUSTER_NAME)
.setDefaultValue("DefaultAlluxioCluster").build();
public static final PropertyKey ETCD_ENDPOINTS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address this also in the next PR @lucyge2022

* its atomicity/consistency semantics goes with what ETCD has to offer, this class
* does not add upon any semantics itself.
*
* AlluxioEtcdClient should only be used as singleton wrapping one jetcd Client object,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very helpful javadoc!

* @param watchType
*/
private void addListenerInternal(
String parentPath, StateListener listener, WatchType watchType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check if parentPath is not an empty String?

break;
case UNRECOGNIZED: // Fall through
default:
LOG.info("Unrecognized event:{} on watch path of:{}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.error

try {
unregisterService(entry.getKey());
} catch (IOException ex) {
LOG.info("Unregister all services failed unregistering for:{}.", entry.getKey(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.error?

newLeaseInternal(entity);
entity.mNeedReconnect.set(false);
} catch (IOException e) {
LOG.info("Failed trying to new the lease for service:{}", entity, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

@Override
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove @VisibleForTesting in the following methods.
we just need it in the interface, only for these methods only exposed for testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -229,7 +238,8 @@ public static String description() {
+ " metrics metrics information\n"
+ " summary cluster summary\n"
+ " ufs under storage system information\n"
+ " jobservice job service metrics information\n";
+ " jobservice job service metrics information\n"
+ " nodestatus node status [worker as of now]\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to leave it next pr, but we shall just call this cmd either status or nodes (workers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not part of ReportCommand anymore, I have created another fsadmin cmd, named "nodes", we can do showstatus or other admin functionalities such as remove nodes over there.
introduced in #17890
e.g.
$bin/alluxio fsadmin nodes status

@@ -88,6 +88,7 @@
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>1.14.3</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

@lucyge2022
Copy link
Contributor Author

alluxio-bot, merge this please

@alluxio-bot
Copy link
Contributor

merge failed:
Merge refused because pull request does not have label start with type-

@lucyge2022 lucyge2022 added the type-feature This issue is a feature request label Aug 1, 2023
@lucyge2022
Copy link
Contributor Author

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 20eac74 into Alluxio:main Aug 1, 2023
11 checks passed
lucyge2022 added a commit to lucyge2022/alluxio that referenced this pull request Aug 9, 2023
//
public static final PropertyKey ALLUXIO_CLUSTER_NAME =
stringBuilder(Name.ALLUXIO_CLUSTER_NAME)
.setDefaultValue("DefaultAlluxioCluster").build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't all the builder fields being set? specifically

  • for ALLUXIO_CLUSTER_NAME: setDescription, setScope, setConsistencyCheckLevel
  • for WORKER_MEMBERSHIP_ETCD_ENDPOINTS: setScope, setConsistencyCheckLevel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i think i missed it, let me add it, is it causing any problems now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet but... 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature This issue is a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants