Skip to content

Job Log failures analysis

Amos Kong edited this page Aug 4, 2020 · 13 revisions

Find main failed reason

  1. open job log by less latest/job.log

    • For large log files, in order to reduce noise, you may want to use the following command to open the file: 'grep -v "compaction -" job.log | grep -v "LeveledManifest -" | grep -v "Timed out waiting for server response"| grep -v "stream_session -" | grep -v "range_streamer -" | less'

    • In case you would like to grep out also c-s output, you can add: grep -v "0, 0,"

  2. skip to the end of big file by G, (Ctrl + c if it's stuck in calculate line number)

  3. search upward by N, keywords: 'Cleaning up' you can see main fail reason (backtrace) here

Search for the root-cause

In many cases, something bad happened during one nemesis and the test proceeds to the next nemesis. As soon as the test reaches to the next "harmful" nemesis, the running load may crash. Therefore, in such cases we need to look for the "last good state of the cluster", then search for the nemesis who firstly put the cluster in the degraded state. For example, if a nemesis like "TerminateAndReplace" failed to add a replacement node, the cluster will have one node in "DN" state. So, searching the last good state of this node (searching " UN ip_address" for example from the end of the file), and then searching for the nemesis that took place after this known good state (by searching for "Set current"). Another good option is to search for "Traceback" or "traceback", that will show python exception during the test.

Check if coredump occurred

less latest/job.log|grep -i 'Found coredump'

Check Reactor stall time

less latest/job.log |grep -i 'Reactor stall'

There is a tool from Yulia, it analysis the backtrace and generates a decoded unique stall backtrace. Very helpful for huge Backtrace message. https://github.com/scylladb/scylla-cluster-tests/pull/751

Check if there exists other Backtrace than Reactor stall

# Reactor stall number
less latest/job.log |grep 'Reactor stall' |wc -l
# Backtrace number of Reactor stall
less latest/job.log |grep 'Backtrace:' |wc -l
# Backtrace number of other Backtrace than `Reactor stall`
less latest/job.log |grep -v 'Backtrace:' |wc -l

Env setup keywords

  • nodes ready: how many nodes are ready
  • SSH access: get instance ip
  • Node setup failed: node setup failed
  • Destroy nodes:
  • scylla_setup: scylla setup
  • Scylla version: get version string
  • :9042: CQL server starts
  • :3000: monitor related urls
  • Try to start prometheus API server on port: prometheus API server port
  • 'I/O Scheduler is not properly configured!'
  • 'Failed to start Scylla Server'
  • 'Failed to write into /proc/irq'

More error keywards

less latest/job.log |grep -i exception |less
less latest/job.log |grep -i error |less
less latest/job.log |grep -i fail |less
  • exception
  • error
  • fail
  • killed

Cassandra-stress output keyworkds

  • END$
  • FAILURE$
  • Cannot find summary in c-stress results
  • Stress script content:
  • TAG: loader_idx:: multiple workloads info: loader/cpu/keyspace idx

Filter executed nemesis

Reference the log timestamp, we know one error occurs around one nemesis, which step

$ less latest/job.log|egrep '(<<<<|>>>>)'
2019-01-21 12:01:53,677 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method terminate_and_replace_node
2019-01-21 12:08:10,801 nemesis          L0399 INFO | sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method terminate_and_replace_node
2019-01-21 12:14:08,016 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method nodetool_refresh
2019-01-21 12:20:14,376 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method modify_table
....

More output

$ less latest/job.log|egrep -i '(<<<<|>>>>|nemesis)'

Results

  • .png: grafana snapshot url

Rolling Upgrade

`$ less sct.log |grep -E 'ing a Node|begin$|ended$|Run some Queries|Populate DB with'