Updates #4

mnazzzim · 2013-04-18T10:15:21Z

Please check these commits. Would be good if they were in semaphore.

Thank You

When switching to a new cpu_base in switch_hrtimer_base(), we briefly enable preemption by unlocking the cpu_base lock in two places. During this interval it's possible for the running thread to be swapped to a different CPU. Consider the following example: CPU #0 CPU #1 ---- ---- hrtimer_start() ... lock_hrtimer_base() switch_hrtimer_base() this_cpu = 0; target_cpu_base = 0; raw_spin_unlock(&cpu_base->lock) <migrate to CPU 1> ... this_cpu == 0 cpu == this_cpu timer->base = CPU #0 timer->base != LOCAL_CPU Since the cached this_cpu is no longer accurate, we'll skip the hrtimer_check_target() check. Once we eventually go to program the hardware, we'll decide not to do so since it knows the real CPU that we're running on is not the same as the chosen base. As a consequence, we may end up missing the hrtimer's deadline. Fix this by updating the local CPU number each time we retake a cpu_base lock in switch_hrtimer_base(). Another possibility is to disable preemption across the whole of switch_hrtimer_base. This looks suboptimal since preemption would be disabled while waiting for lock(s). Change-Id: I3f76d528ab2288ef5f950f4d048e7f3fa6cf1228 Signed-off-by: Michael Bohan <[email protected]>

When switching the hrtimer cpu_base, we briefly allow for preemption to become enabled by unlocking the cpu_base lock. During this time, the CPU corresponding to the new cpu_base that was selected may in fact go offline. In this scenario, the hrtimer is enqueued to a CPU that's not online, and therefore it never fires. As an example, consider this example: CPU #0 CPU #1 ---- ---- ... hrtimer_start() lock_hrtimer_base() switch_hrtimer_base() cpu = hrtimer_get_target() -> 1 spin_unlock(&cpu_base->lock) <migrate thread to CPU #0> <offline> spin_lock(&new_base->lock) this_cpu = 0 cpu != this_cpu enqueue_hrtimer(cpu_base #1) To prevent this scenario, verify that the CPU corresponding to the new cpu_base is indeed online before selecting it in hrtimer_switch_base(). If it's not online, fallback to using the base of the current CPU. Change-Id: I6359cb299ba7e08abf25d3360fcdd925b4c03b69 Signed-off-by: Michael Bohan <[email protected]>

Date Thu, 4 Apr 2013 10:54:16 -0400 In the __mutex_lock_common() function, an initial entry into the lock slow path will cause two atomic_xchg instructions to be issued. Together with the atomic decrement in the fast path, a total of three atomic read-modify-write instructions will be issued in rapid succession. This can cause a lot of cache bouncing when many tasks are trying to acquire the mutex at the same time. This patch will reduce the number of atomic_xchg instructions used by checking the counter value first before issuing the instruction. The atomic_read() function is just a simple memory read. The atomic_xchg() function, on the other hand, can be up to 2 order of magnitude or even more in cost when compared with atomic_read(). By using atomic_read() to check the value first before calling atomic_xchg(), we can avoid a lot of unnecessary cache coherency traffic. The only downside with this change is that a task on the slow path will have a tiny bit less chance of getting the mutex when competing with another task in the fast path. The same is true for the atomic_cmpxchg() function in the mutex-spin-on-owner loop. So an atomic_read() is also performed before calling atomic_cmpxchg(). The mutex locking and unlocking code for the x86 architecture can allow any negative number to be used in the mutex count to indicate that some tasks are waiting for the mutex. I am not so sure if that is the case for the other architectures. So the default is to avoid atomic_xchg() if the count has already been set to -1. For x86, the check is modified to include all negative numbers to cover a larger case. The following table shows the scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the high_systime workloads to 1/2/4/8 nodes with hyperthreading on and off. +-----------------+------------------+------------------+----------+ | Configuration | Mean Transaction | Mean Transaction | % Change | | | Rate w/o patch | Rate with patch | | +-----------------+------------------------------------------------+ | | User Range 1100 - 2000 | +-----------------+------------------------------------------------+ | 8 nodes, HT on | 36980 | 148590 | +301.8% | | 8 nodes, HT off | 42799 | 145011 | +238.8% | | 4 nodes, HT on | 61318 | 118445 | +51.1% | | 4 nodes, HT off | 158481 | 158592 | +0.1% | | 2 nodes, HT on | 180602 | 173967 | -3.7% | | 2 nodes, HT off | 198409 | 198073 | -0.2% | | 1 node , HT on | 149042 | 147671 | -0.9% | | 1 node , HT off | 126036 | 126533 | +0.4% | +-----------------+------------------------------------------------+ | | User Range 200 - 1000 | +-----------------+------------------------------------------------+ | 8 nodes, HT on | 41525 | 122349 | +194.6% | | 8 nodes, HT off | 49866 | 124032 | +148.7% | | 4 nodes, HT on | 66409 | 106984 | +61.1% | | 4 nodes, HT off | 119880 | 130508 | +8.9% | | 2 nodes, HT on | 138003 | 133948 | -2.9% | | 2 nodes, HT off | 132792 | 131997 | -0.6% | | 1 node , HT on | 116593 | 115859 | -0.6% | | 1 node , HT off | 104499 | 104597 | +0.1% | +-----------------+------------------+------------------+----------+ AIM7 benchmark run has a pretty large run-to-run variance due to random nature of the subtests executed. So a difference of less than +-5% may not be really significant. This patch improves high_systime workload performance at 4 nodes and up by maintaining transaction rates without significant drop-off at high node count. The patch has practically no impact on 1 and 2 nodes system. The table below shows the percentage time (as reported by perf record -a -s -g) spent on the __mutex_lock_slowpath() function by the high_systime workload at 1500 users for 2/4/8-node configurations with hyperthreading off. +---------------+-----------------+------------------+---------+ | Configuration | %Time w/o patch | %Time with patch | %Change | +---------------+-----------------+------------------+---------+ | 8 nodes | 65.34% | 0.69% | -99% | | 4 nodes | 8.70% | 1.02% | -88% | | 2 nodes | 0.41% | 0.32% | -22% | +---------------+-----------------+------------------+---------+ It is obvious that the dramatic performance improvement at 8 nodes was due to the drastic cut in the time spent within the __mutex_lock_slowpath() function. The table below show the improvements in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+---------------+----------------+-----------------+ | alltests | +0.6% | +104.2% | +185.9% | | five_sec | +1.9% | +0.9% | +0.9% | | fserver | +1.4% | -7.7% | +5.1% | | new_fserver | -0.5% | +3.2% | +3.1% | | shared | +13.1% | +146.1% | +181.5% | | short | +7.4% | +5.0% | +4.2% | +--------------+---------------+----------------+-----------------+ Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]>

Date Thu, 4 Apr 2013 10:54:17 -0400 The current mutex spinning code allow multiple tasks to spin on a single mutex concurrently. There are two major problems with this approach: 1. This is not very energy efficient as the spinning tasks are not doing useful work. The spinning tasks may also block other more important or useful tasks from running as preemption is disabled. Only one of the spinners will get the mutex at any time. The other spinners will have to wait for much longer to get it. 2. The mutex data structure on x86-64 should be 32 bytes. The spinning code spin on lock->owner which, in most cases, should be in the same 64-byte cache line as the lock->wait_lock spinlock. As a result, the mutex spinners are contending the same cacheline with other CPUs trying to get the spinlock leading to increased time spent on the spinlock as well as on the mutex spinning. These problems are worse on system with large number of CPUs. One way to reduce the effect of these two problems is to allow only one task to be spinning on a mutex at any time. This patch adds a new spinner field in the mutex.h to limit the number of spinner to only one task. That will increase the size of the mutex by 8 bytes in a 64-bit environment (4 bytes in a 32-bit environment). The AIM7 benchmarks were run on 3.7.10 derived kernels to show the performance changes with this patch on a 8-socket 80-core system with hyperthreading off. The table below shows the mean % change in performance over a range of users for some AIM7 workloads with just the less atomic operation patch (patch 1) vs the less atomic operation patch plus this one (patches 1+2). +--------------+-----------------+-----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+-----------------+-----------------+-----------------+ | alltests | -0.2% | -3.8% | -4.2% | | five_sec | -0.6% | -2.0% | -2.4% | | fserver | +2.2% | +16.2% | +2.2% | | high_systime | -0.3% | -4.3% | -3.0% | | new_fserver | +3.9% | +16.0% | +9.5% | | shared | -1.7% | -5.0% | -4.0% | | short | -7.7% | +0.2% | +1.3% | +--------------+-----------------+-----------------+-----------------+ It can be seen that this patch improves performance for the fserver and new_fserver workloads while suffering some slight drop in performance for the other workloads. Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]>

Commit 5a50508 changes to rwsem from mutex, caused aim7 fork_test performance to dropped 50%. Yuanhan liu did an analysis, found it was caused by strict sequential writing. Ingo suggest stealing sem writing from front task in wait queue. https://lkml.org/lkml/2013/1/29/84 So does this patch. In this patch, I just allow write stealing to happen when the first waiter is also writer. The performance fully is now fully recovered. Reported-by: [email protected] Cc: Ingo Molnar <[email protected]> Signed-off-by: Alex Shi <[email protected]>

Change-Id: I1524cdc6894a4bf8243c6c47a17f225a83d9cec2

ext4: prevent kernel panic in case of uninitialized jinode In some cases the kernel crash occurs during system suspend/resume: [ 4095.041351] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 4095.050689] pgd = c0004000 [ 4095.053985] [00000000] *pgd=00000000 [ 4095.058807] Internal error: Oops: 5 [#1] PREEMPT SMP [ 4095.064483] Modules linked in: wl12xx mac80211 pvrsrvkm_sgx540_120 cfg80211 compat [last unloaded: wl12xx_sdio] [ 4095.064575] CPU: 1 Tainted: G B (3.0.31-01807-gfac16a0 #1) [ 4095.064605] PC is at jbd2_journal_file_inode+0x38/0x118 [ 4095.064666] LR is at mpage_da_map_and_submit+0x48c/0x618 [ 4095.064697] pc : [<c01da5a8>] lr : [<c01aeac0>] psr: 60000013 [ 4095.064697] sp : c6e07c80 ip : c6e07ca0 fp : c6e07c9c [ 4095.064727] r10: 00000001 r9 : c6e06000 r8 : 00000179 [ 4095.064758] r7 : c6e07ca0 r6 : c73b8400 r5 : 00000000 r4 : c59a7d80 [ 4095.064758] r3 : 00000038 r2 : 00000800 r1 : 00000000 r0 : c7754fc0 [ 4095.064788] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel [ 4095.064819] Control: 10c5387d Table: 86cc804a DAC: 00000015 [ 4095.064849] [ 4095.064849] PC: 0xc01da528: [ 4095.064880] a528 0a000003 e3a05000 e1a00005 e24bd020 e89da9f0 e5951010 e3e06000 e14b22dc ..... [ 4095.070373] 7fe0: c00a48ac 00000013 00000000 c6e07ff8 c00a48ac c00c0a94 84752f09 60772177 [ 4095.070404] Backtrace: [ 4095.070465] [<c01da570>] (jbd2_journal_file_inode+0x0/0x118) from [<c01aeac0>] (mpage_da_map_and_submit+0x48c/0x618) [ 4095.070495] r7:c6e07ca0 r6:c6e07d00 r5:c6e07d90 r4:c7754fc0 [ 4095.070556] [<c01ae634>] (mpage_da_map_and_submit+0x0/0x618) from [<c01af40c>] (ext4_da_writepages+0x2a4/0x5c8) [ 4095.070617] [<c01af168>] (ext4_da_writepages+0x0/0x5c8) from [<c0112af4>] (do_writepages+0x34/0x40) [ 4095.070678] [<c0112ac0>] (do_writepages+0x0/0x40) from [<c01645a4>] (writeback_single_inode+0xd4/0x288) [ 4095.070709] [<c01644d0>] (writeback_single_inode+0x0/0x288) from [<c0164ed4>] (writeback_sb_inodes+0xb4/0x184) [ 4095.070770] [<c0164e20>] (writeback_sb_inodes+0x0/0x184) from [<c01655a0>] (writeback_inodes_wb+0xc4/0x13c) [ 4095.070831] [<c01654dc>] (writeback_inodes_wb+0x0/0x13c) from [<c01658f0>] (wb_writeback+0x2d8/0x464) [ 4095.070861] [<c0165618>] (wb_writeback+0x0/0x464) from [<c0165cb8>] (wb_do_writeback+0x23c/0x2c4) [ 4095.070922] [<c0165a7c>] (wb_do_writeback+0x0/0x2c4) from [<c0165df4>] (bdi_writeback_thread+0xb4/0x2dc) [ 4095.070953] [<c0165d40>] (bdi_writeback_thread+0x0/0x2dc) from [<c00c0b18>] (kthread+0x90/0x98) [ 4095.071014] [<c00c0a88>] (kthread+0x0/0x98) from [<c00a48ac>] (do_exit+0x0/0x72c) [ 4095.071044] r7:00000013 r6:c00a48ac r5:c00c0a88 r4:c78c7ec4 [ 4095.071105] Code: e89da8f0 e5963000 e3130002 1afffffa (e5913000) [ 4095.071166] ---[ end trace 7fe9f9b727e5cf78 ]--- [ 4095.071197] Kernel panic - not syncing: Fatal exception The probably reason of such behaviour is an inode opened in READ mode has been marked as 'dirty' somehow and written back by ext4_da_writepages. Cause jinode == NULL it could lead to the kernel panic. The patch prevents kernel panic and helps to investigate the problem providing an inode number. Change-Id: I1d77a011b580db682b8e2d122ef3d5e44e0ce5c7 Signed-off-by: Volodymyr Mieshkov <[email protected]>

This patch fixes several mempolicy leaks in the tmpfs mount logic. These leaks are slow - on the order of one object leaked per mount attempt. Leak 1 (umount doesn't free mpol allocated in mount): while true; do mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt umount /mnt done Leak 2 (errors parsing remount options will leak mpol): mount -t tmpfs -o size=100M nodev /mnt while true; do mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null done umount /mnt Leak 3 (multiple mpol per mount leak mpol): while true; do mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt umount /mnt done This patch fixes all of the above. I could have broken the patch into three pieces but is seemed easier to review as one. Signed-off-by: Greg Thelen <[email protected]> modified for Mako kernel from LKML reference Change-Id: I44b4bcf90506f5d30406a783378bc601bbe33622

Commit 5a50508 changes to rwsem from mutex, caused aim7 fork_test performance to dropped 50%. Yuanhan liu did an analysis, found it was caused by strict sequential writing. Ingo suggest stealing sem writing from front task in wait queue. https://lkml.org/lkml/2013/1/29/84 So does this patch. In this patch, I just allow write stealing to happen when the first waiter is also writer. The performance fully is now fully recovered. Reported-by: [email protected] Cc: Ingo Molnar <[email protected]> Signed-off-by: Alex Shi <[email protected]> Conflicts: lib/rwsem.c

Signed-off-by: Francisco Franco <[email protected]> Signed-off-by: Seongmin Park <[email protected]>

Change-Id: I304c96eec3fe1ae94260e2e5984523f289bcf42e

Fix possible memory leak detected by kmemleak: unreferenced object 0xc0f80f00 (size 64): comm "swapper/0", pid 1, jiffies 4294937508 (age 82.980s) hex dump (first 32 bytes): 6d 6d 63 30 5f 64 65 74 65 63 74 00 72 79 2e 68 mmc0_detect.ry.h 00 07 00 00 68 77 63 61 70 2e 68 00 02 00 00 70 ....hwcap.h....p backtrace: [<c010a1fc>] __kmalloc+0x164/0x220 [<c01e1630>] kvasprintf+0x38/0x58 [<c01e1668>] kasprintf+0x18/0x24 [<c02fcf60>] mmc_alloc_host+0x114/0x1b4 [<c0311c84>] msmsdcc_probe+0xc14/0x1fd8 [<c022b40c>] platform_drv_probe+0x14/0x18 [<c022a144>] driver_probe_device+0x144/0x334 [<c022a394>] __driver_attach+0x60/0x84 [<c022884c>] bus_for_each_dev+0x4c/0x78 [<c0229720>] bus_add_driver+0xd0/0x250 [<c022a884>] driver_register+0x9c/0x128 [<c00086bc>] do_one_initcall+0x90/0x160 [<c06d1904>] kernel_init+0xe8/0x1a4 [<c000ee4c>] kernel_thread_exit+0x0/0x8 [<ffffffff>] 0xffffffff Change-Id: I3b29d71463af849a072cabbe56637adf6db6d0da Signed-off-by: Sujit Reddy Thumma <[email protected]> Signed-off-by: Francisco Franco <[email protected]>

…w power mode to freeze processes. Testing phase at the moment. Signed-off-by: franciscofranco <[email protected]>

Calling wake_lock_destroy from inside a spinlock protected region (or, in general, from atomic context) leads to a 'scheduling while atomic bug' because the internal wakeup source deletion logic calls synchronize_rcu, which can sleep. Moreover, since the interal lists are already protected with RCUs and spinlocks, putting the wake_lock_destroy call in a spinlock is redundant. Change-Id: I10a2239b664a5f43e54495f24fe588fb09282305 Signed-off-by: Anurag Singh <[email protected]> Signed-off-by: franciscofranco <[email protected]>

…optimizations Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on assumptions about the implementation of memset and similar functions. The current ARM optimized memset code does not return the value of its first argument, as is usually expected from standard implementations. For instance in the following function: void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) { memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); waiter->magic = waiter; INIT_LIST_HEAD(&waiter->list); } compiled as: 800554d0 <debug_mutex_lock_common>: 800554d0: e92d4008 push {r3, lr} 800554d4: e1a00001 mov r0, r1 800554d8: e3a02010 mov r2, #16 ; 0x10 800554dc: e3a01011 mov r1, #17 ; 0x11 800554e0: eb04426e bl 80165ea0 <memset> 800554e4: e1a03000 mov r3, r0 800554e8: e583000c str r0, [r3, coolya#12] 800554ec: e5830000 str r0, [r3] 800554f0: e5830004 str r0, [r3, stratosk#4] 800554f4: e8bd8008 pop {r3, pc} GCC assumes memset returns the value of pointer 'waiter' in register r0; causing register/memory corruptions. This patch fixes the return value of the assembly version of memset. It adds a 'mov' instruction and merges an additional load+store into existing load/store instructions. For ease of review, here is a breakdown of the patch into 4 simple steps: Step 1 ====== Perform the following substitutions: ip -> r8, then r0 -> ip, and insert 'mov ip, r0' as the first statement of the function. At this point, we have a memset() implementation returning the proper result, but corrupting r8 on some paths (the ones that were using ip). Step 2 ====== Make sure r8 is saved and restored when (! CALGN(1)+0) == 1: save r8: - str lr, [sp, #-4]! + stmfd sp!, {r8, lr} and restore r8 on both exit paths: - ldmeqfd sp!, {pc} @ Now <64 bytes to go. + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. (...) tst r2, #16 stmneia ip!, {r1, r3, r8, lr} - ldr lr, [sp], stratosk#4 + ldmfd sp!, {r8, lr} Step 3 ====== Make sure r8 is saved and restored when (! CALGN(1)+0) == 0: save r8: - stmfd sp!, {r4-r7, lr} + stmfd sp!, {r4-r8, lr} and restore r8 on both exit paths: bgt 3b - ldmeqfd sp!, {r4-r7, pc} + ldmeqfd sp!, {r4-r8, pc} (...) tst r2, #16 stmneia ip!, {r4-r7} - ldmfd sp!, {r4-r7, lr} + ldmfd sp!, {r4-r8, lr} Step 4 ====== Rewrite register list "r4-r7, r8" as "r4-r8". Signed-off-by: Ivan Djelic <[email protected]> Reviewed-by: Nicolas Pitre <[email protected]> Signed-off-by: Dirk Behme <[email protected]> Signed-off-by: Russell King <[email protected]>

mnazzzim and others added 16 commits April 18, 2013 12:09

3.0.73

2b571b1

3.0.74

c1cef86

lib/rwsem.c: fix compatibility issues with Linux 3.0.y

11c06ae

Change-Id: I1524cdc6894a4bf8243c6c47a17f225a83d9cec2

drivers: mmc: disable CRC

258c64c

Signed-off-by: Francisco Franco <[email protected]> Signed-off-by: Seongmin Park <[email protected]>

lib/string & lib/memcpy: use GNU implementation.

3ac5dc0

Change-Id: I304c96eec3fe1ae94260e2e5984523f289bcf42e

power: decrease the amount of time the device waits after entering lo…

a256def

…w power mode to freeze processes. Testing phase at the moment. Signed-off-by: franciscofranco <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates #4

Updates #4

mnazzzim commented Apr 18, 2013

Updates #4

Are you sure you want to change the base?

Updates #4

Conversation

mnazzzim commented Apr 18, 2013