Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory issue causes kernel to kill process_replication_queue job #14

Open
csjx opened this issue Dec 15, 2017 · 1 comment
Open
Labels

Comments

@csjx
Copy link
Member

csjx commented Dec 15, 2017

On ncei.nceas.ucsb.edu, which is a small test server with 8GB of RAM, we're seeing the kernel oom-killer repeatedly kill the python process that results from the cron job running python manage.py process_replication_queue. Since the job runs once per minute, and the vast majority of them succeed, I suspect it's jobs that are replicating large files across the wire that are causing the issue, and that perhaps there's some code that is spooling data to memory prior to writing to disk. Here's a sample from /var/log/syslog:

Dec 13 06:35:08 ncei kernel: [10412914.856904] gmain invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Dec 13 06:35:08 ncei kernel: [10412914.856912] gmain cpuset=/ mems_allowed=0
Dec 13 06:35:08 ncei kernel: [10412914.856924] CPU: 3 PID: 1175 Comm: gmain Not tainted 4.4.0-91-generic #114-Ubuntu
Dec 13 06:35:08 ncei kernel: [10412914.856927] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
Dec 13 06:35:08 ncei kernel: [10412914.856932]  0000000000000286 9ecc974be8221a0f ffff880035ba79f8 ffffffff813f9c83
Dec 13 06:35:08 ncei kernel: [10412914.856937]  ffff880035ba7bb0 ffff8801091fe600 ffff880035ba7a68 ffffffff8120b88e
Dec 13 06:35:08 ncei kernel: [10412914.856959]  ffff880035ba7a30 ffff880035ba7a40 ffffffff814dae5a 00000000ffffffff
Dec 13 06:35:08 ncei kernel: [10412914.856964] Call Trace:
Dec 13 06:35:08 ncei kernel: [10412914.856981]  [<ffffffff813f9c83>] dump_stack+0x63/0x90
Dec 13 06:35:08 ncei kernel: [10412914.856991]  [<ffffffff8120b88e>] dump_header+0x5a/0x1c5
Dec 13 06:35:08 ncei kernel: [10412914.857001]  [<ffffffff814dae5a>] ? virtballoon_oom_notify+0x2a/0x80
Dec 13 06:35:08 ncei kernel: [10412914.857010]  [<ffffffff81192d62>] oom_kill_process+0x202/0x3c0
Dec 13 06:35:08 ncei kernel: [10412914.857014]  [<ffffffff81193189>] out_of_memory+0x219/0x460
Dec 13 06:35:08 ncei kernel: [10412914.857021]  [<ffffffff81199178>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Dec 13 06:35:08 ncei kernel: [10412914.857026]  [<ffffffff81199596>] __alloc_pages_nodemask+0x286/0x2a0
Dec 13 06:35:08 ncei kernel: [10412914.857040]  [<ffffffff811e30dc>] alloc_pages_current+0x8c/0x110
Dec 13 06:35:08 ncei kernel: [10412914.857048]  [<ffffffff8118f32b>] __page_cache_alloc+0xab/0xc0
Dec 13 06:35:08 ncei kernel: [10412914.857051]  [<ffffffff8119183a>] filemap_fault+0x14a/0x3f0
Dec 13 06:35:08 ncei kernel: [10412914.857067]  [<ffffffff812a3a96>] ext4_filemap_fault+0x36/0x50
Dec 13 06:35:08 ncei kernel: [10412914.857077]  [<ffffffff811be850>] __do_fault+0x50/0xe0
Dec 13 06:35:08 ncei kernel: [10412914.857081]  [<ffffffff811c2372>] handle_mm_fault+0xfa2/0x1820
Dec 13 06:35:08 ncei kernel: [10412914.857091]  [<ffffffff8125ce77>] ? eventfd_ctx_read+0x67/0x210
Dec 13 06:35:08 ncei kernel: [10412914.857097]  [<ffffffff811ed9e7>] ? kmem_cache_alloc+0x187/0x1f0
Dec 13 06:35:08 ncei kernel: [10412914.857107]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec 13 06:35:08 ncei kernel: [10412914.857112]  [<ffffffff81255d34>] ? SyS_inotify_add_watch+0x1d4/0x350
Dec 13 06:35:08 ncei kernel: [10412914.857115]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec 13 06:35:08 ncei kernel: [10412914.857127]  [<ffffffff81844178>] page_fault+0x28/0x30
Dec 13 06:35:08 ncei kernel: [10412914.857130] Mem-Info:
Dec 13 06:35:08 ncei kernel: [10412914.857139] active_anon:1949307 inactive_anon:31638 isolated_anon:0
Dec 13 06:35:08 ncei kernel: [10412914.857139]  active_file:78 inactive_file:51 isolated_file:1
Dec 13 06:35:08 ncei kernel: [10412914.857139]  unevictable:915 dirty:0 writeback:0 unstable:0
Dec 13 06:35:08 ncei kernel: [10412914.857139]  slab_reclaimable:12314 slab_unreclaimable:6499
Dec 13 06:35:08 ncei kernel: [10412914.857139]  mapped:35771 shmem:56441 pagetables:6109 bounce:0
Dec 13 06:35:08 ncei kernel: [10412914.857139]  free:25630 free_pcp:208 free_cma:0
Dec 13 06:35:08 ncei kernel: [10412914.857147] Node 0 DMA free:15884kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857159] lowmem_reserve[]: 0 3449 7928 7928 7928
Dec 13 06:35:08 ncei kernel: [10412914.857165] Node 0 DMA32 free:47136kB min:29348kB low:36684kB high:44020kB active_anon:3405552kB inactive_anon:58248kB active_file:304kB inactive_file:196kB unevictable:2136kB isolated(anon):0kB isolated(file):4kB present:3653620kB managed:3572932kB mlocked:2136kB dirty:0kB writeback:0kB mapped:76620kB shmem:110656kB slab_reclaimable:16728kB slab_unreclaimable:10004kB kernel_stack:2912kB pagetables:14628kB unstable:0kB bounce:0kB free_pcp:832kB local_pcp:84kB free_cma:0kB writeback_tmp:0kB pages_scanned:8936 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857174] lowmem_reserve[]: 0 0 4478 4478 4478
Dec 13 06:35:08 ncei kernel: [10412914.857179] Node 0 Normal free:39500kB min:38100kB low:47624kB high:57148kB active_anon:4391676kB inactive_anon:68304kB active_file:8kB inactive_file:8kB unevictable:1524kB isolated(anon):0kB isolated(file):0kB present:4718592kB managed:4586184kB mlocked:1524kB dirty:0kB writeback:0kB mapped:66464kB shmem:115108kB slab_reclaimable:32528kB slab_unreclaimable:15976kB kernel_stack:2032kB pagetables:9808kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12944 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857187] lowmem_reserve[]: 0 0 0 0 0
Dec 13 06:35:08 ncei kernel: [10412914.857192] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Dec 13 06:35:08 ncei kernel: [10412914.857209] Node 0 DMA32: 231*4kB (UME) 204*8kB (UE) 297*16kB (UME) 221*32kB (UME) 102*64kB (UME) 48*128kB (UME) 29*256kB (ME) 13*512kB (ME) 6*1024kB (E) 0*2048kB 0*4096kB = 47276kB
Dec 13 06:35:08 ncei kernel: [10412914.857225] Node 0 Normal: 479*4kB (UME) 238*8kB (UMEH) 166*16kB (UMEH) 138*32kB (UMEH) 65*64kB (UMEH) 29*128kB (UEH) 17*256kB (UMEH) 8*512kB (UME) 4*1024kB (EH) 0*2048kB 2*4096kB (H) = 39500kB
Dec 13 06:35:08 ncei kernel: [10412914.857245] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 13 06:35:08 ncei kernel: [10412914.857248] 57215 total pagecache pages
Dec 13 06:35:08 ncei kernel: [10412914.857251] 0 pages in swap cache
Dec 13 06:35:08 ncei kernel: [10412914.857254] Swap cache stats: add 174259381, delete 174259381, find 59360304/74390678
Dec 13 06:35:08 ncei kernel: [10412914.857256] Free swap  = 0kB
Dec 13 06:35:08 ncei kernel: [10412914.857258] Total swap = 0kB
Dec 13 06:35:08 ncei kernel: [10412914.857260] 2097049 pages RAM
Dec 13 06:35:08 ncei kernel: [10412914.857261] 0 pages HighMem/MovableOnly
Dec 13 06:35:08 ncei kernel: [10412914.857263] 53295 pages reserved
Dec 13 06:35:08 ncei kernel: [10412914.857264] 0 pages cma reserved
Dec 13 06:35:08 ncei kernel: [10412914.857266] 0 pages hwpoisoned
Dec 13 06:35:08 ncei kernel: [10412914.857268] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 13 06:35:08 ncei kernel: [10412914.857280] [  351]     0   351    10972     1929      24       3        0             0 systemd-journal
Dec 13 06:35:08 ncei kernel: [10412914.857284] [  407]     0   407    25742      412      18       3        6             0 lvmetad
Dec 13 06:35:08 ncei kernel: [10412914.857289] [  915]   107   915    10757      450      26       3        5          -900 dbus-daemon
Dec 13 06:35:08 ncei kernel: [10412914.857292] [  986]     0   986     1100      309       8       3        0             0 acpid
Dec 13 06:35:08 ncei kernel: [10412914.857296] [  988]     0   988     7252      434      20       3        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857299] [  993]     0   993   132578      697      26       4        0             0 lxcfs
Dec 13 06:35:08 ncei kernel: [10412914.857303] [  997]     0   997     6511      399      19       3       17             0 atd
Dec 13 06:35:08 ncei kernel: [10412914.857307] [  999]   104   999    64099      422      27       4       15             0 rsyslogd
Dec 13 06:35:08 ncei kernel: [10412914.857311] [ 1005]     0  1005    68972      361      36       3        4             0 accounts-daemon
Dec 13 06:35:08 ncei kernel: [10412914.857314] [ 1007]     0  1007     7590      655      18       3        0             0 systemd-logind
Dec 13 06:35:08 ncei kernel: [10412914.857318] [ 1052]   112  1052     1884       79       9       3        0             0 vnstatd
Dec 13 06:35:08 ncei kernel: [10412914.857322] [ 1061]     0  1061    16380      547      36       3        0         -1000 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857325] [ 1087]     0  1087     1306      456       9       3        1             0 iscsid
Dec 13 06:35:08 ncei kernel: [10412914.857329] [ 1088]     0  1088     1431      877       9       3        0           -17 iscsid
Dec 13 06:35:08 ncei kernel: [10412914.857332] [ 1163]     0  1163     3985      347      14       3        0             0 agetty
Dec 13 06:35:08 ncei kernel: [10412914.857336] [ 1183]     0  1183     4868      251      14       3        0             0 irqbalance
Dec 13 06:35:08 ncei kernel: [10412914.857339] [ 1202]     0  1202     3764      432      13       3        0             0 xinetd
Dec 13 06:35:08 ncei kernel: [10412914.857342] [ 1203]     0  1203    69278      946      39       3        7             0 polkitd
Dec 13 06:35:08 ncei kernel: [10412914.857346] [ 1355]     0  1355    73096     2462      44       3        0             0 fail2ban-server
Dec 13 06:35:08 ncei kernel: [10412914.857350] [ 1539]  1000  1539    11311      468      26       3        0             0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857353] [ 1545]  1000  1545    15276      457      32       3        0             0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857356] [ 1718]  1000  1718    12141     5185      28       3        0             0 tmux
Dec 13 06:35:08 ncei kernel: [10412914.857360] [ 1726]  1000  1726     5697      927      16       3        0             0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857364] [31232]   111 31232    27508      167      25       3        0             0 ntpd
Dec 13 06:35:08 ncei kernel: [10412914.857368] [29620]     0 29620    11050      459      22       3        0         -1000 systemd-udevd
Dec 13 06:35:08 ncei kernel: [10412914.857372] [26198]  1002 26198    11312      483      26       3        0             0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857375] [26199]  1002 26199    15291      473      32       3        0             0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857379] [ 9572]  1002  9572     1831      280       9       3       16             0 tail
Dec 13 06:35:08 ncei kernel: [10412914.857383] [10237]     0 10237   111212     3085      38       6       11          -900 snapd
Dec 13 06:35:08 ncei kernel: [10412914.857386] [28237]   113 28237    73676      644      68       3        0          -900 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857390] [28239]   113 28239    73709    33942     127       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857394] [28240]   113 28240    73676      715      95       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857397] [28241]   113 28241    73676     1691      61       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857401] [28242]   113 28242    73770      576      64       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857404] [28243]   113 28243    37456      533      57       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857407] [16097]     0 16097     3344      154      10       3        0             0 mdadm
Dec 13 06:35:08 ncei kernel: [10412914.857411] [21806]     0 21806    24724      901      52       3        0             0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857415] [14547]  1000 14547     5675      883      16       4        0             0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857419] [ 5773]     0  5773    23842      639      51       3        0             0 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857423] [ 5775]  1004  5775    11312      482      26       3        0             0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857426] [ 5778]  1004  5778    15338      521      32       3        0             0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857430] [ 5857]  1004  5857    23842      552      49       3        0             0 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857433] [ 5858]  1004  5858     5669      881      16       3        0             0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857437] [24535]  1005 24535   183205    32535     196       4        0             0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857440] [24536]  1005 24536   199979    33775     215       4        0             0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857444] [24537]    33 24537   408886     2232     101       5        0             0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857448] [24627]    33 24627   507215     1607     107       5        0             0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857451] [25674]     0 25674    12555      492      31       4        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857455] [25675]  1005 25675     1127      156       7       3        0             0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857459] [25681]  1005 25681  1923548  1831359    3655      11        0             0 python
Dec 13 06:35:08 ncei kernel: [10412914.857462] [25695]   113 25695    75674     5735     127       3        0             0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857465] [25881]     0 25881    12555      311      31       4        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857469] [25882]     0 25882    12555      311      31       4        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857472] [25885]  1005 25885     1127      170       8       3        0             0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857476] [25884]  1005 25884     1127      143       8       3        0             0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857479] [25890]  1005 25890     6109      564      15       3        0             0 python
Dec 13 06:35:08 ncei kernel: [10412914.857483] [25891]  1005 25891     6109      908      14       3        0             0 python
Dec 13 06:35:08 ncei kernel: [10412914.857487] [25914]     0 25914     5258      381      16       3        0             0 check_mk_agent
Dec 13 06:35:08 ncei kernel: [10412914.857491] [25934]     0 25934    12555      382      31       4        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857494] [25935]     0 25935    12555      383      31       4        0             0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857497] [25941]  1005 25941     1127       62       8       3        0             0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857501] [25942]  1005 25942     1127       74       8       3        0             0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857504] [25948]  1005 25948     5079      290      14       3        0             0 python
Dec 13 06:35:08 ncei kernel: [10412914.857507] [25949]  1005 25949     5143      309      14       3        0             0 python
Dec 13 06:35:08 ncei kernel: [10412914.857510] [25950]     0 25950     7563      356      19       3        0             0 ps
Dec 13 06:35:08 ncei kernel: [10412914.857514] [25951]     0 25951     3341      166      12       3        0             0 sed
Dec 13 06:35:08 ncei kernel: [10412914.857517] Out of memory: Kill process 25681 (python) score 897 or sacrifice child
Dec 13 06:35:08 ncei kernel: [10412914.861127] Killed process 25681 (python) total-vm:7694192kB, anon-rss:7325436kB, file-rss:0kB
@csjx csjx added the bug label Dec 15, 2017
@rogerdahl
Copy link
Collaborator

rogerdahl commented Jul 28, 2019

Hey Chris, I'm cleaning out some old issues on GitHub. Looking at the syslog for the bug you reported, there's certainly one Python process that stands out:

Dec 13 06:35:08 ncei kernel: [10412914.857459] [25681]  1005 25681  1923548  1831359    3655      11        0             0 python

7.5 GB of reserved virtual memory on an 8GB box would explain why it ends up getting killed. Did you confirm that it was a GMN related process? GMN is designed to stream pretty much everything, and management commands like the queue processing tasks should exit out right away if another command of the same kind is already running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants