Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtapp execution gives unreliable actual duty cycle #89

Open
douglas-raillard-arm opened this issue Oct 4, 2019 · 21 comments · May be fixed by #90
Open

rtapp execution gives unreliable actual duty cycle #89

douglas-raillard-arm opened this issue Oct 4, 2019 · 21 comments · May be fixed by #90
Labels

Comments

@douglas-raillard-arm
Copy link

The following JSON configuration was used on a Juno R0, big core running with "performance" cpufreq governor and idle states disabled:

{
    "tasks": {
        "pelt1": {
            "loop": 1,
            "policy": "SCHED_OTHER",
            "phases": {
                "p000001": {
                    "loop": 125,
                    "run": 8000,
                    "timer": {
                        "ref": "pelt1",
                        "period": 16000
                    },
                    "cpus": [
                        1
                    ]
                }
            },
            "delay": 0
        }
    },
    "global": {
        "duration": -1,
        "calibration": 247,
        "default_policy": "SCHED_OTHER",
        "logstats": false,
        "ftrace": "loop",
        "lock_pages": false
    }

This leads to active time in a period of ~8.5ms in one trace and ~9ms in another, instead of 8 in both.
traces.tar.gz

This may be related to the "loop" ftrace configuration, will update the ticket with further information as they come.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 4, 2019

Actually, it seems more like a calibration issue: rt-app gives very different calibration results from one run to another, but when a good value is found, rt-app seems to execute workloads reliably.

EDIT: both issues are present, and seems to share a common cause

@credp credp added the bug label Oct 7, 2019
@douglas-raillard-arm douglas-raillard-arm linked a pull request Oct 9, 2019 that will close this issue
@douglas-raillard-arm
Copy link
Author

The known-good version I tried was 482e47a

@vingu-linaro
Copy link
Member

Hi Douglas,

I have run on my hikey 20 times : "rt-app doc/examples/template.json" with the master branch and the calibration stays the same: 151ns. I have also checked the log file of the thread, the variation of the duration of the 10msec run event is 18usec.

@vingu-linaro
Copy link
Member

vingu-linaro commented Oct 11, 2019

I also tried with static link:
The calibration returns 18 times 500ns and 2 times 501ns (obviously the ldexp is not the same) and the variation of the 10msec run event is 123usec.

I also run tests on the big core of hikey960 and the results are quite similar: the calibration always return 138ns and the variation of the 10 msec run event is only 7usec

@douglas-raillard-arm
Copy link
Author

Do you see anything strange in the traces I've attached to the issue ?

@vingu-linaro
Copy link
Member

i don't see any strange things.
Have you tried to run without enabling trace ? and only using the log file with a memory buffer ?

@douglas-raillard-arm
Copy link
Author

Have you tried to run without enabling trace ?

Calibration runs as root without tracing. We just run rt-app with idle states disabled, performance cpufreq gov and the rest of the userspace frozen.

and only using the log file with a memory buffer ?

Do you mean logging to a file in tmpfs, or a specific rt-app option ?

@vingu-linaro
Copy link
Member

Have you tried to run without enabling trace ?

Calibration runs as root without tracing. We just run rt-app with idle states disabled, performance cpufreq gov and the rest of the userspace frozen.

and only using the log file with a memory buffer ?

Do you mean logging to a file in tmpfs, or a specific rt-app option ?

"log_size" : 1024, as an example so you will not access file while running events
or even
"log_size" : "disable"

@douglas-raillard-arm
Copy link
Author

I'll give it a go. Is that the same logs as enabled by logstats (we disable it) ? Can't find a reference to logstats in the documentation.

Also found that util of the task has some kind of low freq component (~120ms period) with the current rt-app. That happened on CPU1 of my Juno R0, although CPU2 was apparently unaffected (they are the same kind of big core).

@vingu-linaro
Copy link
Member

logstats ? Do you mean log_timing() ?
For the low freq component, i will try to reproduce it

@douglas-raillard-arm
Copy link
Author

@derkling added "logstats" global option in the JSON produced by LISA, so I assume rt-app knows about it. This is supposed to enable/disable the generation of slack logs, now that they can also be emitted as ftrace events.

Let me know if you are interested in a trace where this strange util variation is observed, and fixed by the change to the busy loop

@vingu-linaro
Copy link
Member

There is no "logstats" is master branch sha1: 9a50d76
This raises the question : have you tried the master branch of rt-app ?

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 21, 2019

Tests when changing the busy loop were conducted on master @ 9a50d76
So the comparison between rt-app versions with same toolchain are somewhat valid.

But rt-app used within LISA is:

rt-app v1.0-95-g72ab18b (2019-09-05 14:26:07 BST)

This commit SHA1 does not seem to exist in rt-app. @derkling Do you remember if the build of rt-app you upstreamed in LISA ?
EDIT: it's probably 72ab18b (and not g72ab18b), so it's almost the latest master. However, I also cannot find references to logstats in the code ...

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 21, 2019

I'll try with "log_size" : "disable" rather than logstats

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Nov 4, 2019

I've tried with "log_size": "disable" and I get the following results on Juno R2.
JSON: rta_ntaskscpumigration.json.txt

Task utilization with current version of rt-app:
bad_tasks_util

With rt-app from #90 :
good_tasks_util

These results seem to be reproducible (I ran multiple iterations with each, always getting similar results).

I also ran an integration cycle with the modified version of rt-app and it removed some failures, most notably on the CPUMigration tests that these graphs are taken from.

@vingu-linaro
Copy link
Member

vingu-linaro commented Nov 5, 2019

I have run your json file on my hikey but still can't reproduce your instability. I would say that this is even quite stable.
chart.pdf

Then, the theoretical range of util_avg for your migrX-X tasks is [159-211] but with #90 your range is around [110-150]

could you try to run my rt-app binary ? so we can check if your instability comes from your compilation env
rt-app.gz

@vingu-linaro
Copy link
Member

Douglas,
Any update on this problem ? Have you tried the binary that I posted ?

@vingu-linaro
Copy link
Member

Hi Douglas,

Have you done any progress on this topic ?

@douglas-raillard-arm
Copy link
Author

Hi @vingu-linaro, sorry for the response delay. The current state of things seems to be:

  • I think we get issues more when (at least) two tasks are scheduled at the same time on the same CPU like in the trace I posted here. There is plenty of idle time so I can't really explain it, and that may be a wrong lead
  • PELT can give surprising results at time (although not at that scale AFAIK), so the relation duty cycle <=> utilization might not be so simple. In the meantime, I've added some duty-cycle oriented plots to LISA so it might be time to revisit the trace I posted here.

note: The CPUMigration test in LISA will soon not trigger that issue anymore, since it is going to measure the duty cycle from the trace directly to avoid this kind of issue.

@vingu-linaro
Copy link
Member

Hi @vingu-linaro, sorry for the response delay. The current state of things seems to be:

  • I think we get issues more when (at least) two tasks are scheduled at the same time on the same CPU like in the trace I posted here. There is plenty of idle time so I can't really explain it, and that may be a wrong lead

I have just rerun run your json file on my hikey and tasks are scheduled simultaneously on the same CPU AFAICT. And the util_avg of all tasks stays quite stable: +/-1 at most

  • PELT can give surprising results at time (although not at that scale AFAIK), so the relation duty cycle <=> utilization might not be so simple. In the meantime, I've added some duty-cycle oriented plots to LISA so it might be time to revisit the trace I posted here.

yes there is no linear relation between duty cycle and utilization because the period also impact the utilization. The formula is:
max utilization : (1-y^r) / (1-y^p) with r running time and p the period (kim that the step is 1024us)
min utilization : max utilization * y^(p-r)

note: The CPUMigration test in LISA will soon not trigger that issue anymore, since it is going to measure the duty cycle from the trace directly to avoid this kind of issue.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Jan 15, 2020

PELT can give surprising results at time

I was referring to this kind of behavior
rt-app pelt

The red lines Y coordinate give the duration of the corresponding rtapp activation. Black lines are the same for the sleep part. The util signal has a weird non-symmetrical oscillation even in parts where the duty cycle stays stable (between t=3.75 and t=4).

The current PELT simulator we have seems to work well, except in that case where it gives different results, so I assume there is something tricky going on, but I've not been able to pinpoint what it is exactly. The issue can be reproduced with the Invariance test in LISA [1] (it's not failing all the times, but you should get at least 2% of failed runs or so).

I don't think this issue and the one we are discussing here are related, but who knows ...

[1] https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.load_tracking.InvarianceItem.test_util_correctness

PS: This kind of plot can be reproduced in a notebook with:

        task = 'the_rtapp_task_name'
        trace = Trace('trace.dat')
        # plot util
        axis = trace.analysis.load_tracking.plot_task_signals(task, signals=['util])
        activation_axis = axis.twinx()
        # Plot activation/sleep and activation "background bands". You can also replace "duration=True" by "duty_cycle=True" with similar results
        trace.analysis.tasks.plot_task_activation(task, alpha=0.2, axis=activation_axis, duration=True)
        axis.legend()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants