-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: new format for [bootstrap] configuration #2578
Conversation
Problem: flux start --size=N where N>1 warns unhelpfully: "warning: setting --bootstrap=selfpmi due to --size option". Drop the warning message.
While writing tests I discovered that libtomlc99 doesn't allow heterogeneous type elements in arrays, so the original example I gave above of mixed string and table I updated the example above to reflect this limitation. Maybe the extra verbosity will be tolerable when we have some approximation of hostlists? |
Yeah, I think it the extra verbosity is perfectly acceptable (perhaps even preferable). |
I don't have the code in front of me but maybe I should post this interface for review, for makeshift "hostset" support (for taking them apart, not constructing them): /* Parse the first bracketed idset embedded in string 's'.
*/
struct idset * idset_parse_embedded (const char *s);
/* Replace the first bracketed idset in 'fmt' with 'id', placing the result in 'buf'
*/
int idset_format (char *buf, size_t *bufsz, const char *fmt, unsigned int id); So basically this would let you expand something like This seems like the bare minimum needed to support a hostset expression in place of a host in the above config. It falls far short of what we're used to in hostlist.c, especially doesn't offer the reverse: take a list of hosts and make a hostset for you. But also, the list is always sorted |
The interface seems fine to me. My first thought it that most use cases of the low-level functions above would immediately want to turn a "hostset" string into a list over which they could iterate, but how that looks or where that function lives will probably fall out of future use of the config. Keeping the libidset functions low-level seems fine to me. It did occur to me that you could also simplify the parsing of the config by allowing the hostnames to be split into different fields, e.g. hosts = [
{
prefix="fluke",
ids = "0",
suffix="",
},
] This would save even more config space because default host prefix and suffix could be set: default_prefix = "fluke"
hosts = [
{ ids = "[0-99]" }
] Maybe Then you can avoid an extra step of parsing what you get out of your toml parser. (Just an idea to throw out there) |
Also I assume Another thought, if you extend the api (or maybe add separate functions) to allow the nth index to be parsed/replaced, then libidset could parse hostset strings with multiple idsets embedded, e.g. |
Right, I was thinking iteration would consist of It would be pretty easy to add an "index" argument to choose the nth idset in a string with multiple if we felt that was likely to be needed. |
Removing WIP from the description as I think this may be ready for another peek. [Er, I thought I had written something earlier about the changes just pushed but I'm not seeing it now so maybe I didn't press the button before leaving yesterday?] To handle an arbitrary number of embedded idsets, I went with a "map" pattern where a user supplied function is called for each formatted string. The function recurses for each level of idset and can use the Unrelated commits tacked on here
|
Any reason why this function can't be called just (I actually think the idea of
As I tried to say above, I think any users of the config will want to just get a list of hosts (er, set of hosts) from the config, not have to parse the idset and immediately turn it into a list with a call to map, strcpy, list_append(), etc. My guess would be if we don't have this higher level interface at first, then the second person to implement it will create it ;-) (maybe not a bad thing, though) I would perhaps not spend too much time on the hostset interface though. We had a |
typo in 1ffadf2: |
Were you thinking we need a generic way to iterate over "hostsets" or whatever, or something specific for the I did write up a throwaway class for a struct hostset *hostset_decode (const char *s);
void hostset_destroy (struct hostset *hostset);
const char *hostset_first (struct hostset *hostset);
const char *hostset_next (struct hostset *hostset);
size_t hostset_count (struct hostset *hostset); |
Yeah, I had assumed there would be more than just a single use case (eventually). I'm fine with waiting to try to make the higher level abstraction until other use cases show up. In testing this code I noticed (realized?) that idsets can be specified out-of-order. I'm not sure if it is a problem, but since they are being used in this particular case to specify an ordered array it could be confusing that |
Well that's a good point. It would be worrisome if, for example, "service nodes" are spaced out numerically, like Maybe we should state that the hosts array order is ignored (not rank order), and add other config to express the TBON topology, or even the "host role" in terms of hosts entries? Example: overlay-levels = [
[ "fluke0" ],
[ "fluke[64,128]" ],
[ "fluke[1-63,65-127,129-191]" ],
] hmmm. On multiple users, maybe we'll need to capture the mapping of hosts to broker ranks somewhere else for general consumption because this only works for a "configured" instance. With PMI, we might need to collect that info. |
Or we could just think of this PR as solving #2572, with more config to come (probably) as we require more configurability for the system instance TBON? Meanwhile maybe a flux_config_bootstrap(7) man page makes sense so we have a place to document the semantics of the hosts array? Thanks for catching the commit typo. Would |
I like your idea of separately specifying the TBON topology.
Yeah, I figured that (and I think an |
Problem: when the broker is started with the 'log-filename' attribute set on the command line, the log file is opened from all broker ranks. This does not match the documentation or the code. Move initialization of logbuf->rank in logbuf_initialize() to before logbuf_register_attrs() is called. Now the test for rank 0 is only true on rank 0. Move the logbuf->h initialization too for consistency. Fixes flux-framework#2581
Problem: logbuf_initialize() calls flux_attr_set_cacheonly() on the "rank" attribute. The broker call to create_dummyattrs() right after the call to logbuf_initialize() handles this, and there is no requirement to set it before that. Drop this call.
Problem: there is no convenient way to decode an idset from a bounded substring. Add idset_ndecode() which accepts a length.
Add idset_format_map() to simplify expansion of strings that have embedded idsets. Given a string with zero or more embedded bracketed idsets, expands idsets, calling a map function for each formatted string. For example, an input string of "r[0-1]n[0-1]" would result in map function calls for "r0n0", "r0n1", "r1n0", "r1n1". Returns number of iterations, or -1 on failure with errno set. Iteration can be halted early from the map function (no error). The map function can also return -1 with errno set to abort iteration with an error.
OK forced a push with some changes so the only exposed idset function is |
I don't think anyone uses this naming convention anymore for hosts, but one other difference between idsets and hostlists is that hostlists preserve leading zeros from the range syntax: ~$ hostlist -e "foo[00-10]"
foo00,foo01,foo02,foo03,foo04,foo05,foo06,foo07,foo08,foo09,foo10 In the future, if support for this is needed perhaps Or, an alternate syntax for specifying sets of hosts could be supported as above hosts = [{
prefix = "foo"
suffix = ""
zeropad = 2
ids = "0-10"
} ] |
I hesitate to ask, but since I'm not in the same office as you ATM and you can't throw things at me -- should we just pull in |
That might be unfortunate in the long run. Let's proceed with the current direction, we could always support the slightly more verbose but more explicit syntax above, or when someone has time work on a hostlist.c replacement. (It seems likely we'll want something ordered and which can support "compress" methods eventually anyway) Edit: to be more clear, it is probably fine to pull hostlist.[ch] back in for the short term, but I think the arguments against that are 1) it has a lot of older, ad-hoc, crufty code cobbled together over time, and may not be the most maintainable thing 2) interfaces exported could be improved 3) we really do want to do a rewrite of this thing at some point, but since it seemingly does the minor thing it is supposed to, the code keeps getting dragged along. On the other hand, my one worry about extending Probably this is getting us off track from our near-term goals so I think we should leave things as is and move on. |
OK then, full speed ahead. I wasn't planning on adding anything else here, unless you have additional suggestions. I would leave the explicit overlay config to another PR. |
Oh, I did have one other thought. One unsatisfying thing is the hostname [bootstrap]
default_port = 8020
default_bind = "tcp://127.0.0.1:%p"
default_connect = "tcp://127.0.0.1:%p"
hosts = [
{ host = "localhost" }
] Any better ideas here? It seems like there needs to be one hosts entry to avoid introducing magical behavior? Allowing the host to be a glob or regex might get confusing if idsets are also supported. Any better ideas? Maybe just allow exactly |
Hm, no. I think I don't fully understand the use case. When is the "generic" config file read by the broker?
Since the hosts entry is a full JSON object, maybe a special key like On second thought, since the default address maps to localhost I do think special |
The use case was for the default installed boot.toml, so that someone could maybe Maybe that case can be handled without a config file, come to think of it. |
OK, I got rid of the "localhost" specialness, and instead now assume that if the hosts array is empty or missing, this is a size=1 instance. I went ahead and made some minor changes to the way the config boot method initializes the overlay, so we can avoid binding to, and therefore configuring, a zeromq socket on leaf nodes, where it will never be used (at least not until we have some strategy for "grow"). Also, for size=1, the single broker now doesn't need zeromq sockets at all, so a) it doesn't need curve keys, and b) it won't start the zeromq service thread. I have a follow-on PR that implements the same overlay changes for the PMI boot method, and would have tacked that on here, except that it seems to have made the subprocess exec service in the broker inexplicably sad, so I want to debug that first. Finally, made one minor tweak to the systemd unit file to set If that sounds OK, I'll squash the incremental development. |
Restarted a builder that failed valgrind with a SIGALRM. |
Found the problem - a new The hydra test failures were really getting in my way so as discussed in #1169, I went ahead and removed them here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick look through the PR and things look good to me. Just a couple minor comments. I also poked at things a bit and didn't find any issues.
doc/man5/flux-config-bootstrap.adoc
Outdated
such as when being launched by systemd. | ||
|
||
The default bootstrap mode is PMI. The Flux systemd unit file forces the | ||
broker to use the config method by specifying `--setattr=boot-method=config` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo? boot-method
should be boot.method
?
t/t0001-basic.t
Outdated
echo $ATTR_VAL | grep "^ipc" && | ||
! echo $ATTR_VAL | grep "%B" | ||
' | ||
test_expect_success 'tbon.endpoint fails on bad endpoint' ' | ||
! flux start -o,--setattr=tbon.endpoint='foo://bar' flux getattr tbon.endpoint | ||
! flux start -s2 -o,--setattr=tbon.endpoint='foo://bar' flux getattr tbon.endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are editing these tests anyway, mind switching from !
to test_must_fail
? See docs for test_must_fail
for a description of why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Actually that last one fails under test_must_fail
because rank 1 broker has to be killed by flux-start
. I'll add a comment for housekeepers of the future.
Nice test coverage as well! I did notice from the coverage report that if you want a small bump in coverage, some of the |
Change the [bootstrap] configuration to allow bind URI, and connect URI to be expressed independently, and to determine the rank by matching the hostname, not the local interface addresses. Rename the rank-ordered array from 'tbon-endpoints' to 'hosts' for clarity. A hosts entry is a table containing the keys 'host' (required), 'bind' (optional), and 'connect' (optional). The keys 'default_port', 'default_bind', and 'default_connect' may be defined. The default bind and connect values are used when those keys are not explicitly set in the hosts entry. The 'default_port' value is substituted for the token "%p" if it occurs in bind or connect strings. The 'host' value in a hosts entry entry is substituted for the token "%h" if it occurs in bind or connect strings. The 'host' key may included embedded bracketed idsets. Example 1024 node cluster config: [bootstrap] default_port = 8050 default_bind = "tcp://en0:%p" default_connect = "tcp://e%h:%p" hosts = [ { host="fluke0", bind="tcp://en4:9001", connect="tcp://fluke-mgmt:9001" }, { host = "fluke[1-1023]" }, ] In the example, the bind address is "tcp:/en4:9001" on rank 0 and "tcp://en0:8050" on ranks 1-1023. The connect address for rank 0 is "tcp://fluke-mgmt:9001"; "tcp://efluke1:8050" for rank 1; "tcp://efluke2:8050" for rank 2, etc.. Update existing sharness test. Update installed config file. Fixes flux-framework#2572
Avoid triggering the loading of curve keys and/or starting of the zeromq service thread if the overlay is not going to be used.
Problem: the broker sets the local-uri attribute to local://${rundir}/${rank}/local, but if FLUX_URI is unset, flux_open() tries to open local://${rundir}/local. Set local-uri on the command line in the systemd unit file, to match the flux_open() default.
Problem: instance-level test tries to do too many things in one test and is hard to debug. Spit the test into several subtests, and write output to the file system for post-mortem.
Problem: tests expect tbon.endpoint to be set in a size=1 instance when it need not be. Check this attribute in rank 0 of a size=2 instance, and relax expectations of content. Also: [cleanup] use test_must_fail where appropriate.
When the bootstrap method is PMI, don't bind to a zeromq socket if there are no downstream peers. Change broker_response_sendmsg() to silently discard a response if it has exhausted all other routing option and tries to send it downstream. This is zeromq's default behavior for unroutable messages sent on a ROUTER socket anyway (drop).
Problem: hydra tests fail occasionally As noted in flux-framework#1169, some versions of hydra might have a problem capturing stdio, which makes these tests unreliable. Drop the tests that are just verifying hydra's PMI behavior, and make the ones that remain not dependent on stdio. Fixes flux-framework#1169
Thanks for the helpful comments! Just forced a push with suggested fixes, incremental development squashed, an attempt at more coverage, and a few minor commit message cleanups. |
Problem: t2601-job-shell.t (test 8 - PMI cliques are correct for 2 ppn) just failed in travis with no useful output. This test, and a few tests surrounding it, redirect the stderr of flux job attach to a file, then ignore the file. Allow flux job attach's stderr to be included in the verbose test output
Hmm, had one builder fail in
There would be output in the logs for the test_cmp if that was what failed. Tests before and after are OK. For some reason the test redirects the stderr of |
Codecov Report
@@ Coverage Diff @@
## master #2578 +/- ##
==========================================
+ Coverage 81.45% 81.47% +0.02%
==========================================
Files 247 248 +1
Lines 38306 38438 +132
==========================================
+ Hits 31201 31319 +118
- Misses 7105 7119 +14
|
Whee, OK this is ready IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, LGTM!
Thanks! |
As discussed in #2572, this PR changes the
[bootstrap]
config file for (hopefully) improved clarity and expressiveness. Example:Here the bind address is
tcp://en4:9001
on rank 0 andtcp://en0:8050
on ranks 1-3. The connect address for rank 0 istcp://fluke-mgmt:9001
,tcp://efluke1:8050
for rank 1,tcp://efluke2:8050
for rank 2, etc..Note: the rank is determined by looking up the hostname in the hosts array, instead of matching IP addresses to local interfaces. The size of the instance is the length of the array. I dropped the ability to explicitly set the rank and size in the config file. That did not seem terribly useful since setting the rank means the config file must be unique for each broker, and the size can be easily inferred if the hosts array is fully specified.