Job Log failures analysis

Find main failed reason

open job log by less latest/job.log
- For large log files, in order to reduce noise, you may want to use the following command to open the file: 'grep -v "compaction -" job.log | grep -v "LeveledManifest -" | grep -v "Timed out waiting for server response"| grep -v "stream_session -" | grep -v "range_streamer -" | less'
- In case you would like to grep out also c-s output, you can add: grep -v "0, 0,"
skip to the end of big file by G, (Ctrl + c if it's stuck in calculate line number)
search upward by N, keywords: 'Cleaning up' you can see main fail reason (backtrace) here

Search for the root-cause

In many cases, something bad happened during one nemesis and the test proceeds to the next nemesis. As soon as the test reaches to the next "harmful" nemesis, the running load may crash. Therefore, in such cases we need to look for the "last good state of the cluster", then search for the nemesis who firstly put the cluster in the degraded state. For example, if a nemesis like "TerminateAndReplace" failed to add a replacement node, the cluster will have one node in "DN" state. So, searching the last good state of this node (searching " UN ip_address" for example from the end of the file), and then searching for the nemesis that took place after this known good state (by searching for "Set current"). Another good option is to search for "Traceback" or "traceback", that will show python exception during the test.

Check if coredump occurred

less latest/job.log|grep -i 'Found coredump'

Check Reactor stall time

less latest/job.log |grep -i 'Reactor stall'

There is a tool from Yulia, it analysis the backtrace and generates a decoded unique stall backtrace. Very helpful for huge Backtrace message. https://github.com/scylladb/scylla-cluster-tests/pull/751

Check if there exists other Backtrace than `Reactor stall`

# Reactor stall number
less latest/job.log |grep 'Reactor stall' |wc -l
# Backtrace number of Reactor stall
less latest/job.log |grep 'Backtrace:' |wc -l
# Backtrace number of other Backtrace than `Reactor stall`
less latest/job.log |grep -v 'Backtrace:' |wc -l

Env setup keywords

nodes ready: how many nodes are ready
SSH access: get instance ip
Node setup failed: node setup failed
Destroy nodes:
scylla_setup: scylla setup
Scylla version: get version string
:9042: CQL server starts
:3000: monitor related urls
Try to start prometheus API server on port: prometheus API server port
'I/O Scheduler is not properly configured!'
'Failed to start Scylla Server'
'Failed to write into /proc/irq'

More error keywards

less latest/job.log |grep -i exception |less
less latest/job.log |grep -i error |less
less latest/job.log |grep -i fail |less

exception
error
fail
killed

Cassandra-stress output keyworkds

END$
FAILURE$
Cannot find summary in c-stress results
Stress script content:
TAG: loader_idx:: multiple workloads info: loader/cpu/keyspace idx

Filter executed nemesis

Reference the log timestamp, we know one error occurs around one nemesis, which step

$ less latest/job.log|egrep '(<<<<|>>>>)'
2019-01-21 12:01:53,677 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method terminate_and_replace_node
2019-01-21 12:08:10,801 nemesis          L0399 INFO | sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method terminate_and_replace_node
2019-01-21 12:14:08,016 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method nodetool_refresh
2019-01-21 12:20:14,376 nemesis          L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method modify_table
....

More output

$ less latest/job.log|egrep -i '(<<<<|>>>>|nemesis)'

Results

.png: grafana snapshot url

Rolling Upgrade

Supported Branches

Maintained Branches

Branches that we keep sending backports from master

branch-4.0
branch-4.1
branch-2019.1
branch-perf-v9
manager-2.1

Branches that we are sending critical backports from master only

branch-3.3
branch-2019.1
manager-2.0

Stale Branches

a branch that we still have jobs for, and using, but can't get backports from master anymore

manager-v1.2
manager-v1.3
manager-v1.3.1
manager-v1.4

Older unused branches will be deleted on a regular basis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Log failures analysis

Find main failed reason

Search for the root-cause

Check if coredump occurred

Check Reactor stall time

Check if there exists other Backtrace than `Reactor stall`

Env setup keywords

More error keywards

Cassandra-stress output keyworkds

Filter executed nemesis

Results

Rolling Upgrade

Supported Branches

Maintained Branches

Stale Branches

Clone this wiki locally

Job Log failures analysis

Find main failed reason

Search for the root-cause

Check if coredump occurred

Check Reactor stall time

Check if there exists other Backtrace than Reactor stall

Env setup keywords

More error keywards

Cassandra-stress output keyworkds

Filter executed nemesis

Results

Rolling Upgrade

Supported Branches

Maintained Branches

Stale Branches

Clone this wiki locally

Check if there exists other Backtrace than `Reactor stall`