Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Etcd membership module #17736

Merged
merged 62 commits into from
Aug 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a6829a0
etcd membership WIP
lucyge2022 May 23, 2023
9647bd6
move class
lucyge2022 May 31, 2023
8779d7f
WIP - service discovery changes
lucyge2022 Jun 1, 2023
38e2f53
stash changes locally
lucyge2022 Jun 2, 2023
f158b51
WIP - integrate membership into worker process
lucyge2022 Jun 19, 2023
0565af5
more WIP fixes
lucyge2022 Jun 22, 2023
9b5c2ad
bug fixes + get rid of blockingly wait for master config load to star…
lucyge2022 Jun 26, 2023
8bc5f54
1. refactor for etcd client util classes and membership managers to a…
lucyge2022 Jun 27, 2023
696ff87
1. add membership worker provider for scheduler 2. fixes for static m…
lucyge2022 Jun 28, 2023
cf48792
1. fix string format msg for etcd calls
lucyge2022 Jun 28, 2023
3bd1545
1. fix getlivemembers/getfailedmembers bug
lucyge2022 Jul 1, 2023
0702239
add tests for static membership
lucyge2022 Jul 3, 2023
4db1c5d
clean up - WIP
lucyge2022 Jul 4, 2023
d0b129c
1. fix reconnect logic to create new lease if lease expired bcos any …
lucyge2022 Jul 5, 2023
a5462e7
remove unnecessary lines from rebasing
lucyge2022 Jul 5, 2023
07e63ca
remove unwanted changes
lucyge2022 Jul 5, 2023
6f91e3a
remove unwanted file
lucyge2022 Jul 6, 2023
f66aea0
have a no-op member mgr as default
lucyge2022 Jul 6, 2023
649f1af
modify tests / add file
lucyge2022 Jul 6, 2023
15eec12
fixes for default membershiptype
lucyge2022 Jul 6, 2023
5a98697
add blockmasterclient.connect to make master web service available
lucyge2022 Jul 7, 2023
9bb2752
Update dora/core/common/src/main/java/alluxio/Constants.java
lucyge2022 Jul 10, 2023
74c2fe7
review comment [1]
lucyge2022 Jul 11, 2023
aeea556
review comment [2]
lucyge2022 Jul 12, 2023
3e742a3
make AlluxioEtcdClient apis uniformly throws IOException after retries
lucyge2022 Jul 12, 2023
cd5f6fb
add etcd linux service registration files
lucyge2022 Jul 12, 2023
efe5eb7
license header + compilation fixes
lucyge2022 Jul 12, 2023
6461040
compile + missing license
lucyge2022 Jul 13, 2023
c2a5ffb
WIP - modification to make tests work
lucyge2022 Jul 18, 2023
7212b41
revert java11 syntax back to java8 compatible ones
lucyge2022 Jul 18, 2023
f15a6bc
env changes to be able to run within docker container
Xenorith Jul 18, 2023
29c3ca5
make fsadmin also capable to run in interactive mode
lucyge2022 Jul 20, 2023
348f0da
Revert "env changes to be able to run within docker container"
lucyge2022 Jul 20, 2023
c03708a
1. spotbugs fixes
lucyge2022 Jul 21, 2023
d24bc0d
remove interactive option, add nodestatus cmd in fsadmin report
lucyge2022 Jul 24, 2023
918c078
revert all testcontainer test changes
lucyge2022 Jul 24, 2023
9e3b125
review comments / checkstyle
lucyge2022 Jul 24, 2023
9fe70f0
not enable membership mgr by default
lucyge2022 Jul 24, 2023
11e7ca7
fix filesystemctx to use original getallworkers as default
lucyge2022 Jul 24, 2023
ab2d2d2
more checkstyle fixes
lucyge2022 Jul 24, 2023
925bb78
more review comments
lucyge2022 Jul 24, 2023
6df3eec
review comments
lucyge2022 Jul 25, 2023
3bc5003
address review comment
lucyge2022 Jul 25, 2023
1e9d688
avoid string concatenation
lucyge2022 Jul 25, 2023
80a5879
make sure string.format is used only for one-time codepath and use St…
lucyge2022 Jul 26, 2023
0816af0
add more elaboration in etcd.conf and remove unused file
lucyge2022 Jul 26, 2023
b5f9614
more review comments / comments
lucyge2022 Jul 26, 2023
4a32caf
missing license header
lucyge2022 Jul 26, 2023
4a5b49e
1. PropertyKey field name changes
lucyge2022 Jul 26, 2023
fe0bf2f
method name change
lucyge2022 Jul 26, 2023
a26bf2e
review comments
lucyge2022 Jul 27, 2023
ef18bc5
singleton logic change
lucyge2022 Jul 27, 2023
3bc06b2
move static fields togehter
lucyge2022 Jul 27, 2023
1078c16
review comments
lucyge2022 Jul 27, 2023
c87bce8
1. ServiceDiscoveryRecipe: fix atomicity update guarantee, more elabo…
lucyge2022 Jul 31, 2023
7ffe87f
checkstyle fix
lucyge2022 Jul 31, 2023
cdc733b
findbugs fix
lucyge2022 Jul 31, 2023
66be616
remove unwanted checkins due to rebasing
lucyge2022 Aug 1, 2023
3224225
close resource properly & more review comments
lucyge2022 Aug 1, 2023
7da6f45
ignore tests no longer needed
lucyge2022 Aug 1, 2023
e0327ad
checkstyle
lucyge2022 Aug 1, 2023
28c04c1
comment out tests which needs confirmation of worker registration thr…
lucyge2022 Aug 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions conf/etcd/etcd.conf.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# This is the configuration file to start a etcd instance
# e.g. /usr/local/bin/etcd --config-file /etc/etcd/etcd.conf
# *******README******
# To make etcd a linux service:
# After installation of etcd, make sure etcd and etcdctl
# are available in /usr/local/bin
# To make etcd a linux service:
# Copy alluxio/conf/etcd/etcd.service.template to /etc/systemd/system/etcd.service
# Copy alluxio/conf/etcd/etcd.conf.template to /etc/etcd/etcd.conf
# For each etcd instance, change the config params in etcd.conf
# accordingly.
# And do:
# #systemctl daemon-reload
# Then etcd could be registered as a linux service
# e.g.
# Check status
# #service etcd status
# Start etcd
# #service etcd start
# Stop etcd
# #service etcd stop


# Human-readable name for this member.
#name: 'etcd1'

# Path to the data directory.
data-dir: /etcd-data-dir/data

# Path to the dedicated wal directory.
wal-dir: /etcd-data-dir/wal


# List of comma separated URLs to listen on for peer traffic.
#give ip/hostname of this etcd instance
listen-peer-urls: http://<hostname_or_ip>:2380

# List of comma separated URLs to listen on for client traffic.
#give ip/hostname of this etcd instance
listen-client-urls: http://<hostname_or_ip>:2379,http://127.0.0.1:2379

# List of this member's peer URLs to advertise to the rest of the cluster.
# The URLs needed to be a comma-separated list.
#give ip/hostname of this etcd instance for remote etcd members communication
initial-advertise-peer-urls: http://<hostname_or_ip>:2380

# List of this member's client URLs to advertise to the public.
# The URLs needed to be a comma-separated list.
#give ip/hostname of this etcd instance for etcd client communication
advertise-client-urls: http://<hostname_or_ip>:2379

# Initial cluster configuration for bootstrapping.
#give all ip/hostnames of members of initial etcd cluster
initial-cluster: etcd0=http://<hostname_or_ip>:2380,etcd1=http://<hostname_or_ip>:2380,etcd2=http://<hostname_or_ip>:2380

# Initial cluster token for the etcd cluster during bootstrap.
#initial-cluster-token: 'etcd-cluster-1'

# Initial cluster state ('new' or 'existing').
initial-cluster-state: 'new'

# Enable debug-level logging for etcd.
#log-level: debug

#logger: zap

# Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd.
# log-outputs: [stderr]

11 changes: 11 additions & 0 deletions conf/etcd/etcd.service.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Unit]
Description=Etcd Service

[Service]
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf
KillSignal=SIGTERM
StandardOutput=append:/var/log/etcd.log
StandardError=append:/var/log/etcd.err
Comment on lines +7 to +8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these two default locations work on Mac?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this file is for setting up a linux service of etcd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac has /var/log but that belongs to root. I figure it's worth mentioning in a short comment, that one can use this path on Mac but may need sudo. I figure the template file is most useful when one want to test and experience a feature, and that test may very likely happen on Mac instead of linux.


[Install]
WantedBy=default.target
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
import alluxio.grpc.GrpcServerAddress;
import alluxio.master.MasterClientContext;
import alluxio.master.MasterInquireClient;
import alluxio.membership.MembershipManager;
import alluxio.membership.NoOpMembershipManager;
import alluxio.metrics.MetricsSystem;
import alluxio.network.netty.NettyChannelPool;
import alluxio.network.netty.NettyClient;
Expand Down Expand Up @@ -154,6 +156,8 @@ public class FileSystemContext implements Closeable {
*/
private volatile ConcurrentHashMap<ClientPoolKey, BlockWorkerClientPool>
mBlockWorkerClientPoolMap;
@Nullable
private MembershipManager mMembershipManager;
apc999 marked this conversation as resolved.
Show resolved Hide resolved

/**
* Indicates whether the {@link #mLocalWorker} field has been lazily initialized yet.
Expand Down Expand Up @@ -443,6 +447,11 @@ protected synchronized void initContext(ClientContext ctx,
mBlockMasterClientPool = new BlockMasterClientPool(mMasterClientContext);
mBlockWorkerClientPoolMap = new ConcurrentHashMap<>();
mUriValidationEnabled = ctx.getUriValidationEnabled();
try {
mMembershipManager = MembershipManager.Factory.create(getClusterConf());
} catch (IOException ex) {
LOG.error("Failed to set membership manager.", ex);
apc999 marked this conversation as resolved.
Show resolved Hide resolved
}
}

/**
Expand Down Expand Up @@ -490,6 +499,12 @@ private synchronized void closeContext() throws IOException {
if (mMetricsEnabled) {
MetricsHeartbeatContext.removeHeartbeat(getClientContext());
}
LOG.debug("Closing membership manager.");
try (AutoCloseable ignoredCloser = mMembershipManager) {
// do nothing as we are closing
} catch (Exception e) {
throw new IOException(e);
}
Comment on lines +503 to +507
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary since there is nothing inside the try-catch block?
how is this different from simply calling mMembershipManager.close()?

} else {
LOG.warn("Attempted to close FileSystemContext which has already been closed or not "
+ "initialized.");
Expand Down Expand Up @@ -864,6 +879,17 @@ public List<BlockWorkerInfo> getCachedWorkers() throws IOException {
* @return the info of all block workers
*/
protected List<BlockWorkerInfo> getAllWorkers() throws IOException {
// TODO(lucy) once ConfigHashSync reinit is gotten rid of, will remove the blockReinit
// guard altogether
try (ReinitBlockerResource r = blockReinit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have reasons to believe this blockReinit() gives you thread safety against reinit(). But my gut feeling is we should:

  1. Either remove ConfigHashSync together with reinit()
  2. Or add a UT on thread safety, in a follow-up PR

Could you add a TODO on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this as part of subsequent changes. will add TODO to notify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

// Use membership mgr
if (mMembershipManager != null && !(mMembershipManager instanceof NoOpMembershipManager)) {
return mMembershipManager.getAllMembers().stream()
.map(w -> new BlockWorkerInfo(w.getAddress(), w.getCapacityBytes(), w.getUsedBytes()))
.collect(toList());
}
}
// Fall back to old way
try (CloseableResource<BlockMasterClient> masterClientResource =
acquireBlockMasterClientResource()) {
return masterClientResource.get().getWorkerInfoList().stream()
Expand Down
4 changes: 4 additions & 0 deletions dora/core/common/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-jvm</artifactId>
</dependency>
<dependency>
<groupId>io.etcd</groupId>
<artifactId>jetcd-core</artifactId>
</dependency>
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-core</artifactId>
Expand Down
41 changes: 41 additions & 0 deletions dora/core/common/src/main/java/alluxio/conf/PropertyKey.java
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
import alluxio.master.metastore.MetastoreType;
import alluxio.master.metastore.rocks.DataBlockIndexType;
import alluxio.master.metastore.rocks.IndexType;
import alluxio.membership.MembershipType;
import alluxio.network.ChannelType;
import alluxio.network.netty.FileTransferType;
import alluxio.security.authentication.AuthType;
Expand Down Expand Up @@ -5505,6 +5506,25 @@ public String toString() {
.setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
.setScope(Scope.WORKER)
.build();
public static final PropertyKey WORKER_MEMBERSHIP_MANAGER_TYPE =
enumBuilder(Name.WORKER_MEMBERSHIP_MANAGER_TYPE, MembershipType.class)
.setDefaultValue(MembershipType.NOOP.name())
.setDescription("Type of membership manager used for workers."
+ "Choose STATIC for pre-configured members."
+ "Choose ETCD for using etcd for membership management"
+ "Default is NOOP which does not enable membership manager at all")
.setConsistencyCheckLevel(ConsistencyCheckLevel.ENFORCE)
.setScope(Scope.ALL)
.build();
public static final PropertyKey WORKER_STATIC_MEMBERSHIP_MANAGER_CONFIG_FILE =
stringBuilder(Name.WORKER_STATIC_MEMBERSHIP_MANAGER_CONFIG_FILE)
.setDefaultValue(format("${%s}/workers", Name.CONF_DIR))
.setDescription("Absolute path of the config file for list"
+ "of worker hostnames/IPs for the cluster. "
+ Name.WORKER_MEMBERSHIP_MANAGER_TYPE + " needs to be set"
+ " to STATIC first.")
.setScope(Scope.ALL)
.build();

//
// Proxy related properties
Expand Down Expand Up @@ -7626,6 +7646,19 @@ public String toString() {
stringBuilder(Name.ZOOKEEPER_JOB_LEADER_PATH)
.setDefaultValue("/alluxio/job_leader").build();

//
// Membership related properties
//
public static final PropertyKey ALLUXIO_CLUSTER_NAME =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final PropertyKey ALLUXIO_CLUSTER_NAME =
public static final PropertyKey WORKER_MEMBERSHIP_ETCD_CLUSTER_NAME =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually this IS the name of the alluxio cluster for etcd to manage, purpose is for etcd if it is managing multiple alluxio clusters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason is, this property key is only used when WORKER_MEMBERSHIP_TYPE == ETCD, so we should use prefixing in the property key name. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

stringBuilder(Name.ALLUXIO_CLUSTER_NAME)
.setDefaultValue("DefaultAlluxioCluster").build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use hyphen instead of camel case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't all the builder fields being set? specifically

  • for ALLUXIO_CLUSTER_NAME: setDescription, setScope, setConsistencyCheckLevel
  • for WORKER_MEMBERSHIP_ETCD_ENDPOINTS: setScope, setConsistencyCheckLevel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i think i missed it, let me add it, is it causing any problems now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet but... 🤷

public static final PropertyKey ETCD_ENDPOINTS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final PropertyKey ETCD_ENDPOINTS =
public static final PropertyKey WORKER_MEMBERSHIP_ETCD_ENDPOINTS =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider in future it might be used for scope outside of worker membership, hence the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to stay with a smaller scope (by prefixing in the property key name). And later if you have a larger scope and introduce ETCD_ENDPOINTS, we can also define WORKER_MEMBERSHIP_ETCD_ENDPOINTS = ${ETCD_ENDPOINTS} by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address this also in the next PR @lucyge2022

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ADDRESS IN NEXT PR]

listBuilder(Name.ETCD_ENDPOINTS)
.setDescription("A list of comma-separated http://host:port addresses of "
+ "etcd cluster (e.g. http://localhost:2379,http://etcd1:2379)")
.setScope(Scope.ALL)
.build();

//
// JVM Monitor related properties
//
Expand Down Expand Up @@ -8993,6 +9026,10 @@ public static final class Name {
public static final String WORKER_UFS_INSTREAM_CACHE_MAX_SIZE =
"alluxio.worker.ufs.instream.cache.max.size";
public static final String WORKER_WHITELIST = "alluxio.worker.whitelist";
public static final String WORKER_MEMBERSHIP_MANAGER_TYPE =
"alluxio.worker.membership.manager.type";
public static final String WORKER_STATIC_MEMBERSHIP_MANAGER_CONFIG_FILE =
"alluxio.worker.static.membership.manager.config.file";

//
// Proxy related properties
Expand Down Expand Up @@ -9480,6 +9517,10 @@ public static final class Name {
public static final String ZOOKEEPER_JOB_ELECTION_PATH = "alluxio.zookeeper.job.election.path";
public static final String ZOOKEEPER_JOB_LEADER_PATH = "alluxio.zookeeper.job.leader.path";

// Membership related properties
public static final String ALLUXIO_CLUSTER_NAME = "alluxio.cluster.name";
public static final String ETCD_ENDPOINTS = "alluxio.etcd.endpoints";

//
// JVM Monitor related properties
//
Expand Down
Loading
Loading