Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leader election framework for performance metric collection #589

Merged
merged 1 commit into from
Jun 22, 2021

Conversation

AmitRoushan
Copy link

@AmitRoushan AmitRoushan commented Jun 3, 2021

What this PR does / why we need it:
The PR introduce leased based leader election for metric job scheduling in distributed performance metric collection framework

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
fixes #560

Special notes for your reviewer:

Release note:

1. Added changes in Service framework for hosting multiple services
2. Added lease based leader election with redis as backend

@AmitRoushan AmitRoushan force-pushed the dis-perf-framework branch from b6df07d to f90dc54 Compare June 3, 2021 14:50
@codecov
Copy link

codecov bot commented Jun 3, 2021

Codecov Report

Merging #589 (3155865) into master (ce9c853) will decrease coverage by 0.91%.
The diff coverage is 41.93%.

❗ Current head 3155865 differs from pull request most recent head e0ec7d0. Consider uploading reports for the commit e0ec7d0 to get more accurate results

@@            Coverage Diff             @@
##           master     #589      +/-   ##
==========================================
- Coverage   72.10%   71.18%   -0.92%     
==========================================
  Files         141      144       +3     
  Lines       12814    12794      -20     
  Branches     1535     1528       -7     
==========================================
- Hits         9240     9108     -132     
- Misses       3050     3167     +117     
+ Partials      524      519       -5     
Impacted Files Coverage Δ
delfin/cmd/task.py 0.00% <0.00%> (ø)
delfin/leader_election/tooz/leader_elector.py 24.07% <24.07%> (ø)
delfin/coordination.py 63.93% <24.44%> (-23.08%) ⬇️
delfin/service.py 27.06% <32.25%> (+1.02%) ⬆️
delfin/task_manager/scheduler/schedule_manager.py 55.55% <45.94%> (-4.45%) ⬇️
delfin/leader_election/factory.py 46.66% <46.66%> (ø)
...er/scheduler/schedulers/telemetry/telemetry_job.py 77.27% <58.06%> (-15.04%) ⬇️
delfin/leader_election/interface.py 59.37% <59.37%> (ø)
...duler/schedulers/telemetry/failed_telemetry_job.py 79.74% <61.90%> (-11.03%) ⬇️
delfin/leader_election/tooz/callback.py 66.66% <66.66%> (ø)
... and 15 more

@AmitRoushan AmitRoushan force-pushed the dis-perf-framework branch 6 times, most recently from 678091c to cf34f14 Compare June 14, 2021 01:41
@AmitRoushan AmitRoushan changed the title [WIP] leader election framework for performance metric collection leader election framework for performance metric collection Jun 14, 2021

scheduler_mgr = SchedulerManager()

if plugin == "tooz":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin name can be a configurable name, so that it is easy to switch from one plugin type to other in future based on configuration..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its already a configurable parameter with name leader_election_plugin.


while not self._stop.is_set():
with timeutils.StopWatch() as w:
LOG.info("sending heartbeats for leader election")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log can be DEBUG leval

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logged as debug as suggested

# adjust time for leader
has_to_sleep_for = has_to_sleep_for / 2

LOG.info('resting after leader watch as leader=%(leader)s '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log.debug ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logged as debug as suggested

self.group = None

def start(self):
"""Connect to coordination back end."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it switchable coordination back end based on config, current implimentaion can be only for redis with toox

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue to address the issue as future improvement #610

class LeaderElectionFactory:

@staticmethod
def construct_elector(plugin, leader_key=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give a docstring here to describe what the function does?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring. Thanks

def set_leader_callback(self, *args, **kwargs):
self.leader = True

def cleanup(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give a docstring here to describe what the function does?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup is abstract method of LeaderElector interface. The docstring is already specified there.

@AmitRoushan AmitRoushan force-pushed the dis-perf-framework branch 2 times, most recently from 7994321 to c012b7b Compare June 17, 2021 02:18
@AmitRoushan AmitRoushan requested a review from NajmudheenCT June 17, 2021 02:21
@AmitRoushan
Copy link
Author

AmitRoushan commented Jun 17, 2021

@@ -80,7 +82,7 @@ def wait_random(low, high):
def _wait(f, *a, **k):
rd = random.randint(0, 100)
secs = low + (high - low) * rd / 100
time.sleep(secs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between time.sleep and greenthread.sleep

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greenthread.sleep Yield control to another eligible coroutine until at least seconds have elapsed but with time.sleep call do not do same. It eventually keep multiple coroutine busy when we test performance metric collection with load test (register more than 100 storage).

# check if group exist
groups.index(group)
except Exception:
# create a group if not exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put a debug message here? or for that sake there should be a debug message in any exception block, IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added debug message. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment starts with uppercase letter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comment as per suggestion

@kumarashit
Copy link
Collaborator

Changes looks good

key = leader_election_key.encode('ascii')
super(Elector, self).__init__(callbacks, key)

self._coordinator = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason, why are we using _ before some variables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python takes "_"ed variable as private variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention. But with single underscore(_), privatization can not be supported in python. If really you want, then either __(double undrscore) or getter(), setter() way required.
Anyway, otherthings looks good to me. Thanks

kumarashit
kumarashit previously approved these changes Jun 18, 2021
Copy link
Collaborator

@kumarashit kumarashit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PravinRanjan10
Copy link
Contributor

LGTM

NajmudheenCT
NajmudheenCT previously approved these changes Jun 18, 2021
Copy link
Member

@NajmudheenCT NajmudheenCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmitRoushan AmitRoushan dismissed stale reviews from NajmudheenCT and kumarashit via 842b499 June 21, 2021 01:55
@AmitRoushan AmitRoushan force-pushed the dis-perf-framework branch 4 times, most recently from 38f5aa5 to afc6212 Compare June 21, 2021 02:32
@@ -521,7 +523,7 @@ def get_capabilities(context):
"description": "Average time taken for an IO "
"operation in ms"
},
"requests": {
"iops": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes are merged using #609 you may rebase

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. Thanks

@AmitRoushan
Copy link
Author

@ThisIsClark Please take a look.


self.group = group

def join_group(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there is create group and join group, do we need remove a node from a group when a node is down

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we need not remove node for leader election once node is down. But for cleanup reasons we need to remove the node from the group.
Its done in code

Comment on lines 2 to 4
# Copyright 2010 United States Government as represented by the
# Administrator of the National Aeronautics and Space Administration.
# Copyright 2011 Justin Santa Barbara
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add copyright of these

Copy link
Author

@AmitRoushan AmitRoushan Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright string is copied from task cmd

I hope i got your question right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got that, but task cmd has some other reason to have that copyright, so for these files in your PR, you could refer to ssl_utils.py to set the copyright

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated copyright string as per suggestion. Thanks

@AmitRoushan AmitRoushan requested a review from ThisIsClark June 21, 2021 09:20
@AmitRoushan
Copy link
Author

AmitRoushan commented Jun 21, 2021

@ThisIsClark Addressed your comments. PTAL.

NajmudheenCT
NajmudheenCT previously approved these changes Jun 21, 2021
Copy link
Member

@NajmudheenCT NajmudheenCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmitRoushan AmitRoushan requested a review from NajmudheenCT June 21, 2021 09:58
ThisIsClark
ThisIsClark previously approved these changes Jun 21, 2021
Copy link
Collaborator

@ThisIsClark ThisIsClark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@wisererik wisererik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the python comment style

self.started = True

def ensure_group(self, group):
# request groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as per comment

groups = req.get()
try:
# check if group exist
groups.index(group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment starts with uppercase letter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as per comment

# check if group exist
groups.index(group)
except Exception:
# create a group if not exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment

# check if group exist
groups.index(group)
except Exception:
# create a group if not exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment starts with uppercase letter

self.group = group

def join_group(self):
# Join a group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remvoe comments here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment removed. Thanks

self._stop.clear()

self._coordinator = LeaderElectionCoordinator()
# start leader coordinator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment here

self._coordinator.start()

self._coordinator.ensure_group(self.election_key)
# join coordinator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment removed

# join coordinator
self._coordinator.join_group()

# register callback for elected leader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment removed

callbacks.on_started_leading)

# register internal callback to notify being a leader
self._coordinator. \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is special case for internal callback to notify current coordinator is selected as leader.
IMO the comment is required for maintainability.

" happen (which will not end well).", ran_for,
ran_for - wait_until_next_beat)

# check if coordinator is still a leader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment starts with uppercase letter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment

@AmitRoushan AmitRoushan dismissed stale reviews from ThisIsClark and NajmudheenCT via 9804a91 June 22, 2021 09:38
@AmitRoushan AmitRoushan requested a review from wisererik June 22, 2021 09:46
Copy link
Collaborator

@wisererik wisererik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@NajmudheenCT NajmudheenCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NajmudheenCT NajmudheenCT merged commit d4a4a25 into sodafoundation:master Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Metric collection Framework: Distributed scheduling for multi node task management
7 participants