Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates #4

Open
wants to merge 16 commits into
base: jb-4.2.1
Choose a base branch
from
Open

Updates #4

wants to merge 16 commits into from

Conversation

mnazzzim
Copy link

Please check these commits. Would be good if they were in semaphore.

Thank You

mnazzzim and others added 16 commits April 18, 2013 12:09
When switching to a new cpu_base in switch_hrtimer_base(), we
briefly enable preemption by unlocking the cpu_base lock in two
places. During this interval it's possible for the running thread
to be swapped to a different CPU.

Consider the following example:

CPU #0                                 CPU #1
----                                   ----
hrtimer_start()                        ...
 lock_hrtimer_base()
 switch_hrtimer_base()
  this_cpu = 0;
  target_cpu_base = 0;
  raw_spin_unlock(&cpu_base->lock)
<migrate to CPU 1>
...                                    this_cpu == 0
                                       cpu == this_cpu
                                       timer->base = CPU #0
                                       timer->base != LOCAL_CPU

Since the cached this_cpu is no longer accurate, we'll skip the
hrtimer_check_target() check. Once we eventually go to program
the hardware, we'll decide not to do so since it knows the real
CPU that we're running on is not the same as the chosen base. As
a consequence, we may end up missing the hrtimer's deadline.

Fix this by updating the local CPU number each time we retake a
cpu_base lock in switch_hrtimer_base().

Another possibility is to disable preemption across the whole of
switch_hrtimer_base.  This looks suboptimal since preemption
would be disabled while waiting for lock(s).

Change-Id: I3f76d528ab2288ef5f950f4d048e7f3fa6cf1228
Signed-off-by: Michael Bohan <[email protected]>
When switching the hrtimer cpu_base, we briefly allow for
preemption to become enabled by unlocking the cpu_base lock.
During this time, the CPU corresponding to the new cpu_base
that was selected may in fact go offline. In this scenario, the
hrtimer is enqueued to a CPU that's not online, and therefore
it never fires.

As an example, consider this example:

CPU #0                          CPU #1
----                            ----
...                             hrtimer_start()
                                 lock_hrtimer_base()
                                 switch_hrtimer_base()
                                  cpu = hrtimer_get_target() -> 1
                                  spin_unlock(&cpu_base->lock)
                                <migrate thread to CPU #0>
                                <offline>
spin_lock(&new_base->lock)
this_cpu = 0
cpu != this_cpu
enqueue_hrtimer(cpu_base #1)

To prevent this scenario, verify that the CPU corresponding to
the new cpu_base is indeed online before selecting it in
hrtimer_switch_base(). If it's not online, fallback to using the
base of the current CPU.

Change-Id: I6359cb299ba7e08abf25d3360fcdd925b4c03b69
Signed-off-by: Michael Bohan <[email protected]>
Date	Thu, 4 Apr 2013 10:54:16 -0400

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the scalability data on an 8-node 80-core
Westmere box with a 3.7.10 kernel. The numactl command is used to
restrict the running of the high_systime workloads to 1/2/4/8 nodes
with hyperthreading on and off.

+-----------------+------------------+------------------+----------+
|  Configuration  | Mean Transaction | Mean Transaction | % Change |
|		  |  Rate w/o patch  | Rate with patch  |	   |
+-----------------+------------------------------------------------+
|		  |              User Range 1100 - 2000		   |
+-----------------+------------------------------------------------+
| 8 nodes, HT on  |      36980       |      148590      | +301.8%  |
| 8 nodes, HT off |      42799       |      145011      | +238.8%  |
| 4 nodes, HT on  |	 61318       |      118445      |  +51.1%  |
| 4 nodes, HT off |     158481       |      158592      |   +0.1%  |
| 2 nodes, HT on  |     180602       |      173967      |   -3.7%  |
| 2 nodes, HT off |     198409       |      198073      |   -0.2%  |
| 1 node , HT on  |     149042       |      147671      |   -0.9%  |
| 1 node , HT off |     126036       |      126533      |   +0.4%  |
+-----------------+------------------------------------------------+
|		  |              User Range 200 - 1000		   |
+-----------------+------------------------------------------------+
| 8 nodes, HT on  |      41525       |      122349      | +194.6%  |
| 8 nodes, HT off |      49866       |      124032      | +148.7%  |
| 4 nodes, HT on  |	 66409       |      106984      |  +61.1%  |
| 4 nodes, HT off |     119880       |      130508      |   +8.9%  |
| 2 nodes, HT on  |     138003       |      133948      |   -2.9%  |
| 2 nodes, HT off |     132792       |      131997      |   -0.6%  |
| 1 node , HT on  |     116593       |      115859      |   -0.6%  |
| 1 node , HT off |     104499       |      104597      |   +0.1%  |
+-----------------+------------------+------------------+----------+
AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count.  The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
|    8 nodes    |      65.34%     |      0.69%       |  -99%   |
|    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
|    2 nodes    |       0.41%     |      0.32%       |  -22%   |
+---------------+-----------------+------------------+---------+
It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
| five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
| fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
| new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
| shared       |    +13.1%     |   +146.1%      |   +181.5%       |
| short        |     +7.4%     |     +5.0%      |     +4.2%       |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Date	Thu, 4 Apr 2013 10:54:17 -0400

The current mutex spinning code allow multiple tasks to spin on a
single mutex concurrently. There are two major problems with this
approach:

 1. This is not very energy efficient as the spinning tasks are not
    doing useful work. The spinning tasks may also block other more
    important or useful tasks from running as preemption is disabled.
    Only one of the spinners will get the mutex at any time. The
    other spinners will have to wait for much longer to get it.

 2. The mutex data structure on x86-64 should be 32 bytes. The spinning
    code spin on lock->owner which, in most cases, should be in the same
    64-byte cache line as the lock->wait_lock spinlock. As a result,
    the mutex spinners are contending the same cacheline with other
    CPUs trying to get the spinlock leading to increased time spent
    on the spinlock as well as on the mutex spinning.

These problems are worse on system with large number of CPUs. One way
to reduce the effect of these two problems is to allow only one task
to be spinning on a mutex at any time.

This patch adds a new spinner field in the mutex.h to limit the
number of spinner to only one task. That will increase the size of
the mutex by 8 bytes in a 64-bit environment (4 bytes in a 32-bit
environment).

The AIM7 benchmarks were run on 3.7.10 derived kernels to show the
performance changes with this patch on a 8-socket 80-core system
with hyperthreading off.  The table below shows the mean % change
in performance over a range of users for some AIM7 workloads with
just the less atomic operation patch (patch 1) vs the less atomic
operation patch plus this one (patches 1+2).

+--------------+-----------------+-----------------+-----------------+
|   Workload   | mean % change   | mean % change   | mean % change   |
|              | 10-100 users    | 200-1000 users  | 1100-2000 users |
+--------------+-----------------+-----------------+-----------------+
| alltests     |     -0.2%       |     -3.8%       |    -4.2%        |
| five_sec     |     -0.6%       |     -2.0%       |    -2.4%        |
| fserver      |     +2.2%       |    +16.2%       |    +2.2%        |
| high_systime |     -0.3%       |     -4.3%       |    -3.0%        |
| new_fserver  |     +3.9%       |    +16.0%       |    +9.5%        |
| shared       |     -1.7%       |     -5.0%       |    -4.0%        |
| short        |     -7.7%       |     +0.2%       |    +1.3%        |
+--------------+-----------------+-----------------+-----------------+
It can be seen that this patch improves performance for the fserver and
new_fserver workloads while suffering some slight drop in performance
for the other workloads.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Commit 5a50508 changes to rwsem from mutex, caused aim7 fork_test
performance to dropped 50%. Yuanhan liu did an analysis, found it was
caused by strict sequential writing. Ingo suggest stealing sem writing from
front task in wait queue. https://lkml.org/lkml/2013/1/29/84

So does this patch.
In this patch, I just allow write stealing to happen when the first waiter
is also writer. The performance fully is now fully recovered.

Reported-by: [email protected]
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Change-Id: I1524cdc6894a4bf8243c6c47a17f225a83d9cec2
ext4: prevent kernel panic in case of uninitialized jinode

In some cases the kernel crash occurs during system suspend/resume:

[ 4095.041351] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 4095.050689] pgd = c0004000
[ 4095.053985] [00000000] *pgd=00000000
[ 4095.058807] Internal error: Oops: 5 [#1] PREEMPT SMP
[ 4095.064483] Modules linked in: wl12xx mac80211 pvrsrvkm_sgx540_120 cfg80211 compat [last unloaded: wl12xx_sdio]
[ 4095.064575] CPU: 1    Tainted: G    B        (3.0.31-01807-gfac16a0 #1)
[ 4095.064605] PC is at jbd2_journal_file_inode+0x38/0x118
[ 4095.064666] LR is at mpage_da_map_and_submit+0x48c/0x618
[ 4095.064697] pc : [<c01da5a8>]    lr : [<c01aeac0>]    psr: 60000013
[ 4095.064697] sp : c6e07c80  ip : c6e07ca0  fp : c6e07c9c
[ 4095.064727] r10: 00000001  r9 : c6e06000  r8 : 00000179
[ 4095.064758] r7 : c6e07ca0  r6 : c73b8400  r5 : 00000000  r4 : c59a7d80
[ 4095.064758] r3 : 00000038  r2 : 00000800  r1 : 00000000  r0 : c7754fc0
[ 4095.064788] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[ 4095.064819] Control: 10c5387d  Table: 86cc804a  DAC: 00000015
[ 4095.064849]
[ 4095.064849] PC: 0xc01da528:
[ 4095.064880] a528  0a000003 e3a05000 e1a00005 e24bd020 e89da9f0 e5951010 e3e06000 e14b22dc
.....
[ 4095.070373] 7fe0: c00a48ac 00000013 00000000 c6e07ff8 c00a48ac c00c0a94 84752f09 60772177
[ 4095.070404] Backtrace:
[ 4095.070465] [<c01da570>] (jbd2_journal_file_inode+0x0/0x118) from [<c01aeac0>] (mpage_da_map_and_submit+0x48c/0x618)
[ 4095.070495]  r7:c6e07ca0 r6:c6e07d00 r5:c6e07d90 r4:c7754fc0
[ 4095.070556] [<c01ae634>] (mpage_da_map_and_submit+0x0/0x618) from [<c01af40c>] (ext4_da_writepages+0x2a4/0x5c8)
[ 4095.070617] [<c01af168>] (ext4_da_writepages+0x0/0x5c8) from [<c0112af4>] (do_writepages+0x34/0x40)
[ 4095.070678] [<c0112ac0>] (do_writepages+0x0/0x40) from [<c01645a4>] (writeback_single_inode+0xd4/0x288)
[ 4095.070709] [<c01644d0>] (writeback_single_inode+0x0/0x288) from [<c0164ed4>] (writeback_sb_inodes+0xb4/0x184)
[ 4095.070770] [<c0164e20>] (writeback_sb_inodes+0x0/0x184) from [<c01655a0>] (writeback_inodes_wb+0xc4/0x13c)
[ 4095.070831] [<c01654dc>] (writeback_inodes_wb+0x0/0x13c) from [<c01658f0>] (wb_writeback+0x2d8/0x464)
[ 4095.070861] [<c0165618>] (wb_writeback+0x0/0x464) from [<c0165cb8>] (wb_do_writeback+0x23c/0x2c4)
[ 4095.070922] [<c0165a7c>] (wb_do_writeback+0x0/0x2c4) from [<c0165df4>] (bdi_writeback_thread+0xb4/0x2dc)
[ 4095.070953] [<c0165d40>] (bdi_writeback_thread+0x0/0x2dc) from [<c00c0b18>] (kthread+0x90/0x98)
[ 4095.071014] [<c00c0a88>] (kthread+0x0/0x98) from [<c00a48ac>] (do_exit+0x0/0x72c)
[ 4095.071044]  r7:00000013 r6:c00a48ac r5:c00c0a88 r4:c78c7ec4
[ 4095.071105] Code: e89da8f0 e5963000 e3130002 1afffffa (e5913000)
[ 4095.071166] ---[ end trace 7fe9f9b727e5cf78 ]---
[ 4095.071197] Kernel panic - not syncing: Fatal exception

The probably reason of such behaviour is an inode opened in READ mode
has been marked as 'dirty' somehow and written back by ext4_da_writepages.
Cause jinode == NULL it could lead to the kernel panic.

The patch prevents kernel panic and helps to investigate the problem
providing an inode number.

Change-Id: I1d77a011b580db682b8e2d122ef3d5e44e0ce5c7
Signed-off-by: Volodymyr Mieshkov <[email protected]>
This patch fixes several mempolicy leaks in the tmpfs mount logic.
These leaks are slow - on the order of one object leaked per mount
attempt.

Leak 1 (umount doesn't free mpol allocated in mount):
    while true; do
        mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
        umount /mnt
    done

Leak 2 (errors parsing remount options will leak mpol):
    mount -t tmpfs -o size=100M nodev /mnt
    while true; do
        mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
    done
    umount /mnt

Leak 3 (multiple mpol per mount leak mpol):
    while true; do
        mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
        umount /mnt
    done

This patch fixes all of the above.  I could have broken the patch into
three pieces but is seemed easier to review as one.

Signed-off-by: Greg Thelen <[email protected]>
modified for Mako kernel from LKML reference

Change-Id: I44b4bcf90506f5d30406a783378bc601bbe33622
Commit 5a50508 changes to rwsem from mutex, caused aim7 fork_test
performance to dropped 50%. Yuanhan liu did an analysis, found it was
caused by strict sequential writing. Ingo suggest stealing sem writing from
front task in wait queue. https://lkml.org/lkml/2013/1/29/84

So does this patch.
In this patch, I just allow write stealing to happen when the first waiter
is also writer. The performance fully is now fully recovered.

Reported-by: [email protected]
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Alex Shi <[email protected]>

Conflicts:

	lib/rwsem.c
Signed-off-by: Francisco Franco <[email protected]>
Signed-off-by: Seongmin Park <[email protected]>
Change-Id: I304c96eec3fe1ae94260e2e5984523f289bcf42e
Fix possible memory leak detected by kmemleak:

unreferenced object 0xc0f80f00 (size 64):
  comm "swapper/0", pid 1, jiffies 4294937508 (age 82.980s)
  hex dump (first 32 bytes):
    6d 6d 63 30 5f 64 65 74 65 63 74 00 72 79 2e 68  mmc0_detect.ry.h
    00 07 00 00 68 77 63 61 70 2e 68 00 02 00 00 70  ....hwcap.h....p
  backtrace:
    [<c010a1fc>] __kmalloc+0x164/0x220
    [<c01e1630>] kvasprintf+0x38/0x58
    [<c01e1668>] kasprintf+0x18/0x24
    [<c02fcf60>] mmc_alloc_host+0x114/0x1b4
    [<c0311c84>] msmsdcc_probe+0xc14/0x1fd8
    [<c022b40c>] platform_drv_probe+0x14/0x18
    [<c022a144>] driver_probe_device+0x144/0x334
    [<c022a394>] __driver_attach+0x60/0x84
    [<c022884c>] bus_for_each_dev+0x4c/0x78
    [<c0229720>] bus_add_driver+0xd0/0x250
    [<c022a884>] driver_register+0x9c/0x128
    [<c00086bc>] do_one_initcall+0x90/0x160
    [<c06d1904>] kernel_init+0xe8/0x1a4
    [<c000ee4c>] kernel_thread_exit+0x0/0x8
    [<ffffffff>] 0xffffffff

Change-Id: I3b29d71463af849a072cabbe56637adf6db6d0da
Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Francisco Franco <[email protected]>
…w power mode to freeze processes. Testing phase at the moment.

Signed-off-by: franciscofranco <[email protected]>
Calling wake_lock_destroy from inside a spinlock
protected region (or, in general, from atomic context)
leads to a 'scheduling while atomic bug' because the
internal wakeup source deletion logic calls
synchronize_rcu, which can sleep. Moreover,
since the interal lists are already protected with
RCUs and spinlocks, putting the wake_lock_destroy
call in a spinlock is redundant.

Change-Id: I10a2239b664a5f43e54495f24fe588fb09282305
Signed-off-by: Anurag Singh <[email protected]>
Signed-off-by: franciscofranco <[email protected]>
Alberto96 pushed a commit to Alberto96/samsung-kernel-aries that referenced this pull request Sep 3, 2013
…optimizations

Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on
assumptions about the implementation of memset and similar functions.
The current ARM optimized memset code does not return the value of
its first argument, as is usually expected from standard implementations.

For instance in the following function:

void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
{
	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
	waiter->magic = waiter;
	INIT_LIST_HEAD(&waiter->list);
}

compiled as:

800554d0 <debug_mutex_lock_common>:
800554d0:       e92d4008        push    {r3, lr}
800554d4:       e1a00001        mov     r0, r1
800554d8:       e3a02010        mov     r2, #16 ; 0x10
800554dc:       e3a01011        mov     r1, #17 ; 0x11
800554e0:       eb04426e        bl      80165ea0 <memset>
800554e4:       e1a03000        mov     r3, r0
800554e8:       e583000c        str     r0, [r3, coolya#12]
800554ec:       e5830000        str     r0, [r3]
800554f0:       e5830004        str     r0, [r3, stratosk#4]
800554f4:       e8bd8008        pop     {r3, pc}

GCC assumes memset returns the value of pointer 'waiter' in register r0; causing
register/memory corruptions.

This patch fixes the return value of the assembly version of memset.
It adds a 'mov' instruction and merges an additional load+store into
existing load/store instructions.
For ease of review, here is a breakdown of the patch into 4 simple steps:

Step 1
======
Perform the following substitutions:
ip -> r8, then
r0 -> ip,
and insert 'mov ip, r0' as the first statement of the function.
At this point, we have a memset() implementation returning the proper result,
but corrupting r8 on some paths (the ones that were using ip).

Step 2
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 1:

save r8:
-       str     lr, [sp, #-4]!
+       stmfd   sp!, {r8, lr}

and restore r8 on both exit paths:
-       ldmeqfd sp!, {pc}               @ Now <64 bytes to go.
+       ldmeqfd sp!, {r8, pc}           @ Now <64 bytes to go.
(...)
        tst     r2, #16
        stmneia ip!, {r1, r3, r8, lr}
-       ldr     lr, [sp], stratosk#4
+       ldmfd   sp!, {r8, lr}

Step 3
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 0:

save r8:
-       stmfd   sp!, {r4-r7, lr}
+       stmfd   sp!, {r4-r8, lr}

and restore r8 on both exit paths:
        bgt     3b
-       ldmeqfd sp!, {r4-r7, pc}
+       ldmeqfd sp!, {r4-r8, pc}
(...)
        tst     r2, #16
        stmneia ip!, {r4-r7}
-       ldmfd   sp!, {r4-r7, lr}
+       ldmfd   sp!, {r4-r8, lr}

Step 4
======
Rewrite register list "r4-r7, r8" as "r4-r8".

Signed-off-by: Ivan Djelic <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Signed-off-by: Dirk Behme <[email protected]>
Signed-off-by: Russell King <[email protected]>
Alberto96 added a commit to Alberto96/samsung-kernel-aries that referenced this pull request Dec 21, 2013
…optimizations

Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on
assumptions about the implementation of memset and similar functions.
The current ARM optimized memset code does not return the value of
its first argument, as is usually expected from standard implementations.

For instance in the following function:

void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
{
	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
	waiter->magic = waiter;
	INIT_LIST_HEAD(&waiter->list);
}

compiled as:

800554d0 <debug_mutex_lock_common>:
800554d0:       e92d4008        push    {r3, lr}
800554d4:       e1a00001        mov     r0, r1
800554d8:       e3a02010        mov     r2, #16 ; 0x10
800554dc:       e3a01011        mov     r1, #17 ; 0x11
800554e0:       eb04426e        bl      80165ea0 <memset>
800554e4:       e1a03000        mov     r3, r0
800554e8:       e583000c        str     r0, [r3, coolya#12]
800554ec:       e5830000        str     r0, [r3]
800554f0:       e5830004        str     r0, [r3, stratosk#4]
800554f4:       e8bd8008        pop     {r3, pc}

GCC assumes memset returns the value of pointer 'waiter' in register r0; causing
register/memory corruptions.

This patch fixes the return value of the assembly version of memset.
It adds a 'mov' instruction and merges an additional load+store into
existing load/store instructions.
For ease of review, here is a breakdown of the patch into 4 simple steps:

Step 1
======
Perform the following substitutions:
ip -> r8, then
r0 -> ip,
and insert 'mov ip, r0' as the first statement of the function.
At this point, we have a memset() implementation returning the proper result,
but corrupting r8 on some paths (the ones that were using ip).

Step 2
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 1:

save r8:
-       str     lr, [sp, #-4]!
+       stmfd   sp!, {r8, lr}

and restore r8 on both exit paths:
-       ldmeqfd sp!, {pc}               @ Now <64 bytes to go.
+       ldmeqfd sp!, {r8, pc}           @ Now <64 bytes to go.
(...)
        tst     r2, #16
        stmneia ip!, {r1, r3, r8, lr}
-       ldr     lr, [sp], stratosk#4
+       ldmfd   sp!, {r8, lr}

Step 3
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 0:

save r8:
-       stmfd   sp!, {r4-r7, lr}
+       stmfd   sp!, {r4-r8, lr}

and restore r8 on both exit paths:
        bgt     3b
-       ldmeqfd sp!, {r4-r7, pc}
+       ldmeqfd sp!, {r4-r8, pc}
(...)
        tst     r2, #16
        stmneia ip!, {r4-r7}
-       ldmfd   sp!, {r4-r7, lr}
+       ldmfd   sp!, {r4-r8, lr}

Step 4
======
Rewrite register list "r4-r7, r8" as "r4-r8".

Signed-off-by: Ivan Djelic <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Signed-off-by: Dirk Behme <[email protected]>
Signed-off-by: Russell King <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants