Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new operational mode - percent with CPU #3351

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

spacetourist
Copy link
Contributor

Summary
Implements a further load_balancer module strategy for distributing calls more evenly when dealing with high request volumes.

Details/Solution
The change caches the heartbeat data into the module and performs the following calculation for each request:
( 100 - ( 100 * current_sessions + sessions_since_last_heartbeat / max_sessions ) ) * CPU Idle factor

This disregards the dialog profile counts and allocates simply based on the last known call stats and any changes that have been made locally. The intention is to distribute calls to the last known least loaded server whilst not overloading a single system given the latency of the heartbeat data. AFAIK the minimum on both sides is 1s. For a system handling hundreds of calls per second to shared destinations this aims to balance the individual routing decisions more evenly.

Compatibility
This should not impact the other module features.

Closing issues
Closes #3297

Copy link
Member

@liviuchircu liviuchircu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spacetourist, I finally took some time to review both the initial issue, as well as this PR. Thanks for the rich description, as well as the data tables, they were really useful in understanding what's wrong here.

As I see it, the bigger issue here is around the "r" flag, freeswitch-enabled destinations set aside. There is something intuitively off about the 100 - 100 / MAX * 100 formula which I cannot find a solid explanation for (we should plot it!), but it seems like any relative disparity of the inputs seems to be softened or normalized by this formula, effectively bringing the outputs a lot closer, relative to each other. For example, putting all your data in a single table:

Final Float Transf-2 Max-Load Transf-1 Sessions
91 91.11 100 - 100 / 1125 * 100 1125 .75 * (2500 - (1100 - 100)) 1100
91 91.66 100 - 100 / 1200 * 100 1200 .75 * (2500 - (1000 - 100)) 1000
92 92.15 100 - 100 / 1275 * 100 1275 .75 * (2500 - (900 - 100)) 900
92 92.59 100 - 100 / 1350 * 100 1350 .75 * (2500 - (800 - 100)) 800
92 92.98 100 - 100 / 1425 * 100 1425 .75 * (2500 - (700 - 100)) 700

Like, how on earth did we obtain a 1% difference between least-loaded/most-loaded in the output, coming from a 57% difference between least-loaded/most-loaded in the inputs? The reduction was done in two steps: from 57% -> 26.6% -> 1%. What is intrinsically wrong with this formula and can we mathematically change it in order to obtain better weights?

Now, you also sensed this problem based on your empirical evidence (why are my calls going to the more loaded FS?!) and the PERCENT_WITH_CPU approach is a two-fold improvement:

  • first, you change the computation from 2 steps into 1 step. There is no more of that "pseudo max_load" intermediary value, which helps preserve more of the original ratios.
  • secondly, you perform the .cpu_idle multiplication in the last step, after the 100 - 100 * X formula, which will also help to reflect more of the FS instance data into the final weight.

While I am 100% for merging this new "c" (CPU) flag / exclusive with "r" right away, I will leave you a fun question about the "r" mode in general and whether we should enable it in the first place: If I give you two FS instances, one running on a raspberry PI at 1/2 calls (50%) and another one on a super-server at 500/1000 calls (50%), would you "relatively" balance a call to any of them? Or is the situation not so relative, after all? :)

modules/load_balancer/load_balancer.c Show resolved Hide resolved
@spacetourist
Copy link
Contributor Author

Morning @liviuchircu - I'm now happy with the state of this PR with no immediate plans for further changes. Having said that, I have the following ideas for future improvements to this module:

  • Synchronising event_heartbeat_interval/fetch_freeswitch_stats (details above, not sure how that could be achieved across the separate modules)
  • CPU weighting - at present the CPU idle factor is heavily weighted, this is a snapshot provided by FS HB and may not be representative of the actual system load throughout the interval. A module parameter of value 0-100 (100=default) which reduces the weighting of this factor could provide a useful tunable
  • Presenting the sessions since last heartbeat in MI lb_list function

There are also a number of algorithm changes which might be worth a look in time, in particular addressing issues such as your fun question from above. To accomodate systems of wildly different capacities we ought to be looking at the impact of the allocation as well as the state preallocation. Using your example, the Pi may have the lowest pre-allocation load but with that call occupying a whopping 50% of the overall capacity we'd be able to reevaluate the decision. Obviously anyone actually running that wild mix of instance sizes would be asking for trouble anyway but there is clearly a lot more that can be done to increase the module flexibility.

At this time I'm reasonably confident that these changes will solve for my problem so I'm keen to get your feedback, cheers

@spacetourist spacetourist requested a review from liviuchircu April 3, 2024 10:02
Copy link
Member

@liviuchircu liviuchircu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @spacetourist,

We've discussed this internally and while the new "Integrated Estimation" mode seems to solve your concrete problem, the module will still lack flexibility when it comes to different ratios of sessions to max-sessions. For example, with a sufficiently high max_sessions (e.g. in the order of thousands), and current_sessions in the order of hundreds, the new mode's formula will still output relatively similar weights, without giving the user any control to change it.

So we backtracked a bit and concluded that the problem can be alleviated while the max_load is being computed, during lb_update_max_loads(). In order to give full control to the user over their FreeSWITCH sessions scaling (some want 100 max sessions, others 500 or even 2500!), we could add a new freeswitch_sessions_exponent (default: 1, no change), that would be applied as a power to the current Sessions value. Here is how such an exponent would modify the output max_load:

lb-sessions-exponent

The picture shows 4 possible exponent settings: 1, 1.01, 1.05 and 1.1, which already create a dramatic change in the relative difference between the output max_load values.

In your case, probably a "1.1" value of the modparam would suffice, and it would fix your scenario with typical Max-Sessions values of 2500. The exponentiation would be added to this code section:

            if (psz < dst->fs_sock->stats.max_sess) {
                dst->rmap[ri].max_load =
                (dst->fs_sock->stats.id_cpu / (float)100) *
                    (dst->fs_sock->stats.max_sess -
                     (powf(dst->fs_sock->stats.sess, new_modparam) - psz));

This is just a working example as we get closer to the final solution, but the idea remains: the relative mode should be made to work as-is, rather than inventing new, obscure flags. And there is no need to leak all kinds of random information (CPU load? current_sessions? etc.) into get_dst_load(), which is ultimately meant to provide a couple algorithms of interpreting the max_load of a destination, nothing more.

@spacetourist
Copy link
Contributor Author

Hi @liviuchircu - that's an interesting idea but I'm not sure it solves some key aspects of my issue.

The main issue is the sheer volume of calls I'm dealing with - I exceed 200/cps on a couple of instances so must take that into account between heartbeats to avoid allocating all of those calls to the same instance until the next execution of lb_update_max_loads()

I also have several OpenSIPs instances feeding into the same bank of FreeSWITCH servers meaning that the profile size is not really relevant to the calculation. In some ways having the max load score being a close contest isn't a bad thing here provided the "s" flag is also enabled as calls will be distributed randomly to those instances until the next heartbeat clarifies the real active session counts.

I'll give this some more thought as I agree there are further improvements we could make to:

  • handle instances of different sizes
  • modify the impact of the CPU factor (likely similar to your proposal here)
  • respect FreeSWITCH setting for Session-Per-Sec (also in heartbeat value)

It may be that what I'm looking for falls too far outside of the scope of the module authors intentions to implement something general here but I'm keen to work towards a solution which has the flexibility needed to both be tunable and applicable to the wider community.

@spacetourist
Copy link
Contributor Author

Having given this some more thought I'm actually feeling more confident that the PR is along the right tracks as a general solution. The calculations manage hosts of varying sizes by reflecting the overall percentage of that capacity used and between heartbeats we allocate proportionally to the capacity available on each instance.

Excluding the CPU factor for now, starting with all systems at 10% of max sessions used we'd have an even score and the system would pick one at random (provided we supply the "s" flag):
image

If we allocate 20 sessions and by chance it adds one to each instance we'd see the first instance score drop far enough to not be included in subsequent operations:
image

Continuing to deliver based on the score we'd start to see it preferring the larger instances before all instances are again equally scored:
image

This proceeds nicely as more sessions are allocated:
image

Assume the heartbeat data is updated and all instances have reached 50% load, at this point we get the same allocation pattern:
image

This is all quite simple but hinges on having a score re-calculated for each allocation, without that the scores would remain fixed for too long and most likely it will overload the smaller instances.

Am I missing something or does that make sense?

Regarding tunability I think it would be appropriate to expose options for tuning how the CPU factor is implemented - in practical use the CPU running at 20% to 40% utilisation seems very normal and I imagine the figures in the heartbeats vary quite a bit according the time they were captured, personally I don't see the need to always factor that in as it will skew the score quite dramatically. I suggest we might wish to provide params for the following:

  • a setting to eliminate the CPU factor from these calculations
  • a threshold value over which we will either include it as a factor, i.e. if CPU is >75% util then kick in to reduce the change of that instance being selected

spacetourist and others added 2 commits June 10, 2024 11:35
* ✨ new operational mode - percent with CPU

* 🐛 syntax errors

* 🐛 cherry pick duplicate

* 🐛 prevent divide by zero

* 📝 improve log message

* 📝 improve log message

* 🐛 incorrect type

* 📝 improve log message

* 📝 improve log message

* 📝 improve log message

* 📝 improve log message

* Dev cpufactor (#1)

* new operational mode - percent with CPU

* 🐛 fix print of str type

* remove comment

* modify character choice for new flag

* document new integrated estimation flag usage

* ✨ create CPU factor option flag

* document new integrated estimation flag usage

* 📝 CPU factor is optional, improve description

* capture docs in template rather than README directly
@spacetourist
Copy link
Contributor Author

I wanted to follow up on this PR now that I have finally moved forward and have this running in production. My results will of course be anecdotal however I am pleased to report that the modifications have had the desired effect and even with four distinct OpenSIPs instances (no data sharing) in front of a bank of 13 media servers I'm seeing call loads balance to within 50 calls (4k-8k concurrent).

I'm only currently pushing ~25% of the traffic through the load balancer so I expect the lines to converge further in the coming weeks. The smoothing effect of having even this proportion of my load allocated is really beneficial and outside of peak moments I see very flat load across the media servers.

Thanks for all the assistance in getting this patch working properly, I'll provide another update and some charts once I have all the traffic using this system. 🚀

@spacetourist
Copy link
Contributor Author

As mentioned above here are the charts that illustrate the performance of this revised algorithm. Charts are before and after the load balancer changes on the same list of 13 servers:

Before (using manual allocations by service relationship):
image

After (using the load balancer in the new mode):
image

Note that the extra blue line on the second chart is simply an external tool allocating outbound calls to one specific instance on top of the LB call load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] load_balancer algorithm weaknesses
2 participants