rtapp execution gives unreliable actual duty cycle #89

douglas-raillard-arm · 2019-10-04T13:58:25Z

The following JSON configuration was used on a Juno R0, big core running with "performance" cpufreq governor and idle states disabled:

{
    "tasks": {
        "pelt1": {
            "loop": 1,
            "policy": "SCHED_OTHER",
            "phases": {
                "p000001": {
                    "loop": 125,
                    "run": 8000,
                    "timer": {
                        "ref": "pelt1",
                        "period": 16000
                    },
                    "cpus": [
                        1
                    ]
                }
            },
            "delay": 0
        }
    },
    "global": {
        "duration": -1,
        "calibration": 247,
        "default_policy": "SCHED_OTHER",
        "logstats": false,
        "ftrace": "loop",
        "lock_pages": false
    }

This leads to active time in a period of ~8.5ms in one trace and ~9ms in another, instead of 8 in both.
traces.tar.gz

This may be related to the "loop" ftrace configuration, will update the ticket with further information as they come.

The text was updated successfully, but these errors were encountered:

douglas-raillard-arm · 2019-10-04T16:59:29Z

Actually, it seems more like a calibration issue: rt-app gives very different calibration results from one run to another, but when a good value is found, rt-app seems to execute workloads reliably.

EDIT: both issues are present, and seems to share a common cause

douglas-raillard-arm · 2019-10-10T13:49:24Z

The known-good version I tried was 482e47a

vingu-linaro · 2019-10-11T15:56:23Z

Hi Douglas,

I have run on my hikey 20 times : "rt-app doc/examples/template.json" with the master branch and the calibration stays the same: 151ns. I have also checked the log file of the thread, the variation of the duration of the 10msec run event is 18usec.

vingu-linaro · 2019-10-11T16:53:39Z

I also tried with static link:
The calibration returns 18 times 500ns and 2 times 501ns (obviously the ldexp is not the same) and the variation of the 10msec run event is 123usec.

I also run tests on the big core of hikey960 and the results are quite similar: the calibration always return 138ns and the variation of the 10 msec run event is only 7usec

douglas-raillard-arm · 2019-10-11T17:07:51Z

Do you see anything strange in the traces I've attached to the issue ?

vingu-linaro · 2019-10-14T09:46:03Z

i don't see any strange things.
Have you tried to run without enabling trace ? and only using the log file with a memory buffer ?

douglas-raillard-arm · 2019-10-21T09:54:38Z

Have you tried to run without enabling trace ?

Calibration runs as root without tracing. We just run rt-app with idle states disabled, performance cpufreq gov and the rest of the userspace frozen.

and only using the log file with a memory buffer ?

Do you mean logging to a file in tmpfs, or a specific rt-app option ?

vingu-linaro · 2019-10-21T11:00:45Z

Have you tried to run without enabling trace ?

Calibration runs as root without tracing. We just run rt-app with idle states disabled, performance cpufreq gov and the rest of the userspace frozen.

and only using the log file with a memory buffer ?

Do you mean logging to a file in tmpfs, or a specific rt-app option ?

"log_size" : 1024, as an example so you will not access file while running events
or even
"log_size" : "disable"

douglas-raillard-arm · 2019-10-21T11:20:08Z

I'll give it a go. Is that the same logs as enabled by logstats (we disable it) ? Can't find a reference to logstats in the documentation.

Also found that util of the task has some kind of low freq component (~120ms period) with the current rt-app. That happened on CPU1 of my Juno R0, although CPU2 was apparently unaffected (they are the same kind of big core).

vingu-linaro · 2019-10-21T14:39:46Z

logstats ? Do you mean log_timing() ?
For the low freq component, i will try to reproduce it

douglas-raillard-arm · 2019-10-21T14:44:33Z

@derkling added "logstats" global option in the JSON produced by LISA, so I assume rt-app knows about it. This is supposed to enable/disable the generation of slack logs, now that they can also be emitted as ftrace events.

Let me know if you are interested in a trace where this strange util variation is observed, and fixed by the change to the busy loop

vingu-linaro · 2019-10-21T15:07:40Z

There is no "logstats" is master branch sha1: 9a50d76
This raises the question : have you tried the master branch of rt-app ?

douglas-raillard-arm · 2019-10-21T15:19:54Z

Tests when changing the busy loop were conducted on master @ 9a50d76
So the comparison between rt-app versions with same toolchain are somewhat valid.

But rt-app used within LISA is:

rt-app v1.0-95-g72ab18b (2019-09-05 14:26:07 BST)

This commit SHA1 does not seem to exist in rt-app. @derkling Do you remember if the build of rt-app you upstreamed in LISA ?
EDIT: it's probably 72ab18b (and not g72ab18b), so it's almost the latest master. However, I also cannot find references to logstats in the code ...

douglas-raillard-arm · 2019-10-21T15:26:18Z

I'll try with "log_size" : "disable" rather than logstats

douglas-raillard-arm · 2019-11-04T18:33:44Z

I've tried with "log_size": "disable" and I get the following results on Juno R2.
JSON: rta_ntaskscpumigration.json.txt

Task utilization with current version of rt-app:

With rt-app from #90 :

These results seem to be reproducible (I ran multiple iterations with each, always getting similar results).

I also ran an integration cycle with the modified version of rt-app and it removed some failures, most notably on the CPUMigration tests that these graphs are taken from.

vingu-linaro · 2019-11-05T13:39:31Z

I have run your json file on my hikey but still can't reproduce your instability. I would say that this is even quite stable.
chart.pdf

Then, the theoretical range of util_avg for your migrX-X tasks is [159-211] but with #90 your range is around [110-150]

could you try to run my rt-app binary ? so we can check if your instability comes from your compilation env
rt-app.gz

vingu-linaro · 2019-11-22T09:18:09Z

Douglas,
Any update on this problem ? Have you tried the binary that I posted ?

vingu-linaro · 2020-01-14T07:44:38Z

Hi Douglas,

Have you done any progress on this topic ?

douglas-raillard-arm · 2020-01-14T10:57:00Z

Hi @vingu-linaro, sorry for the response delay. The current state of things seems to be:

I think we get issues more when (at least) two tasks are scheduled at the same time on the same CPU like in the trace I posted here. There is plenty of idle time so I can't really explain it, and that may be a wrong lead
PELT can give surprising results at time (although not at that scale AFAIK), so the relation duty cycle <=> utilization might not be so simple. In the meantime, I've added some duty-cycle oriented plots to LISA so it might be time to revisit the trace I posted here.

note: The CPUMigration test in LISA will soon not trigger that issue anymore, since it is going to measure the duty cycle from the trace directly to avoid this kind of issue.

vingu-linaro · 2020-01-15T09:33:02Z

Hi @vingu-linaro, sorry for the response delay. The current state of things seems to be:

I think we get issues more when (at least) two tasks are scheduled at the same time on the same CPU like in the trace I posted here. There is plenty of idle time so I can't really explain it, and that may be a wrong lead

I have just rerun run your json file on my hikey and tasks are scheduled simultaneously on the same CPU AFAICT. And the util_avg of all tasks stays quite stable: +/-1 at most

PELT can give surprising results at time (although not at that scale AFAIK), so the relation duty cycle <=> utilization might not be so simple. In the meantime, I've added some duty-cycle oriented plots to LISA so it might be time to revisit the trace I posted here.

yes there is no linear relation between duty cycle and utilization because the period also impact the utilization. The formula is:
max utilization : (1-y^r) / (1-y^p) with r running time and p the period (kim that the step is 1024us)
min utilization : max utilization * y^(p-r)

note: The CPUMigration test in LISA will soon not trigger that issue anymore, since it is going to measure the duty cycle from the trace directly to avoid this kind of issue.

douglas-raillard-arm · 2020-01-15T11:57:46Z

PELT can give surprising results at time

I was referring to this kind of behavior

The red lines Y coordinate give the duration of the corresponding rtapp activation. Black lines are the same for the sleep part. The util signal has a weird non-symmetrical oscillation even in parts where the duty cycle stays stable (between t=3.75 and t=4).

The current PELT simulator we have seems to work well, except in that case where it gives different results, so I assume there is something tricky going on, but I've not been able to pinpoint what it is exactly. The issue can be reproduced with the Invariance test in LISA [1] (it's not failing all the times, but you should get at least 2% of failed runs or so).

I don't think this issue and the one we are discussing here are related, but who knows ...

[1] https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.load_tracking.InvarianceItem.test_util_correctness

PS: This kind of plot can be reproduced in a notebook with:

        task = 'the_rtapp_task_name'
        trace = Trace('trace.dat')
        # plot util
        axis = trace.analysis.load_tracking.plot_task_signals(task, signals=['util])
        activation_axis = axis.twinx()
        # Plot activation/sleep and activation "background bands". You can also replace "duration=True" by "duty_cycle=True" with similar results
        trace.analysis.tasks.plot_task_activation(task, alpha=0.2, axis=activation_axis, duration=True)
        axis.legend()

credp added the bug label Oct 7, 2019

douglas-raillard-arm linked a pull request Oct 9, 2019 that will close this issue

Improve busy loop stability #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtapp execution gives unreliable actual duty cycle #89

rtapp execution gives unreliable actual duty cycle #89

douglas-raillard-arm commented Oct 4, 2019

douglas-raillard-arm commented Oct 4, 2019 •

edited

Loading

douglas-raillard-arm commented Oct 10, 2019

vingu-linaro commented Oct 11, 2019

vingu-linaro commented Oct 11, 2019 •

edited

Loading

douglas-raillard-arm commented Oct 11, 2019

vingu-linaro commented Oct 14, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019 •

edited

Loading

douglas-raillard-arm commented Oct 21, 2019 •

edited

Loading

douglas-raillard-arm commented Nov 4, 2019 •

edited

Loading

vingu-linaro commented Nov 5, 2019 •

edited

Loading

vingu-linaro commented Nov 22, 2019

vingu-linaro commented Jan 14, 2020

douglas-raillard-arm commented Jan 14, 2020

vingu-linaro commented Jan 15, 2020

douglas-raillard-arm commented Jan 15, 2020 •

edited

Loading

rtapp execution gives unreliable actual duty cycle #89

rtapp execution gives unreliable actual duty cycle #89

Comments

douglas-raillard-arm commented Oct 4, 2019

douglas-raillard-arm commented Oct 4, 2019 • edited Loading

douglas-raillard-arm commented Oct 10, 2019

vingu-linaro commented Oct 11, 2019

vingu-linaro commented Oct 11, 2019 • edited Loading

douglas-raillard-arm commented Oct 11, 2019

vingu-linaro commented Oct 14, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019

vingu-linaro commented Oct 21, 2019

douglas-raillard-arm commented Oct 21, 2019 • edited Loading

douglas-raillard-arm commented Oct 21, 2019 • edited Loading

douglas-raillard-arm commented Nov 4, 2019 • edited Loading

vingu-linaro commented Nov 5, 2019 • edited Loading

vingu-linaro commented Nov 22, 2019

vingu-linaro commented Jan 14, 2020

douglas-raillard-arm commented Jan 14, 2020

vingu-linaro commented Jan 15, 2020

douglas-raillard-arm commented Jan 15, 2020 • edited Loading

douglas-raillard-arm commented Oct 4, 2019 •

edited

Loading

vingu-linaro commented Oct 11, 2019 •

edited

Loading

douglas-raillard-arm commented Oct 21, 2019 •

edited

Loading

douglas-raillard-arm commented Oct 21, 2019 •

edited

Loading

douglas-raillard-arm commented Nov 4, 2019 •

edited

Loading

vingu-linaro commented Nov 5, 2019 •

edited

Loading

douglas-raillard-arm commented Jan 15, 2020 •

edited

Loading