Automated distributed data race detection from distributed logs via SMT constraint solving.
Report Bug
·
Request Feature
Data races have been shown to be a frequent source of distributed concurrency bugs in major distributed programs.
They occur when two memory accesses to the same variable (where at least one of them is a write) are concurrent
and thus, their relative order is unpredictable: if two memory accesses A
and B
are concurrent, then
A
may occur before B
or B
may occur before A
.
SPIDER provides automated distributed data race detection via SMT constraint solving. It receives as input an event trace captured at runtime, and generates a happens-before model that encodes the causal relationships between the events. After building the constraint system, SPIDER resorts to Z3, an SMT solver, to check for data races. It is capable of detecting race conditions even if no bug caused by race conditions manifests itself in the traced execution. SPIDER was initially intended to be run alongside Minha (in fact, it was originally named Minha-Checker) but it became an independent tool.
In order to compile SPIDER on your machine, you first need to have the following packages installed:
- Z3 in your PATH
- falcon-taz - the easiest way to obtain falcon-taz is to follow these steps:
# 1. clone the repo
git clone [email protected]:fntneves/falcon.git
# 2. change directory to the falcon-taz folder
cd falcon/falcon-taz
# 3. checkout the branch fix_message_handler (might not be
# needed in future versions of falcon-taz)
git checkout fix_message_handler
# 4. install the package using Maven
mvn clean install
# 1. clone this repo
git clone [email protected]:jcp19/SPIDER.git
# 2. build a jar using Maven
mvn clean package
After these commands, SPIDER will be available as a .jar in your target/
folder.
SPIDER was designed to inspect event traces from distributed systems and look for data races. Currently, it can only read traces in the Event Trace JSON API. As such, before using SPIDER, you must collect a trace from your target system using a suitable method. The language in which your target system is written is not important as long as there is an effective way to trace your system at runtime - take a look at Minha and falcon-tracer.
Having built the jar file, you can run it with a combination of the following command and flags:
java -jar spider.jar
-d,--dataRaces Check for data races.
-f,--file <arg> File containing the distributed trace.
-m,--messageRaces Check for message races.
-r,--removeRedundancy Removes redundant events before checking for race
conditions.
-f
is the only required flag and it is used for specifying the path to your trace file. The flag -d
indicates that SPIDER will look for data races between different threads while -m
indicates
that SPIDER should look into data races caused by racing messages (still an experimental feature!).
-r
is used to perform a (safe) optimization which decreases the time needed to find data races in the trace.
During its execution, SPIDER outputs a list of pairs of code locations that contain racing instructions.
This section provides a small but descriptive example of how to use SPIDER. We show how to find data races in the following Java program:
class Example1 implements Runnable {
static int counter = 0;
public static void main(String[] args) {
Thread t1 = new Thread(new Ex1());
t1.start();
counter++;
System.out.println("The value of counter is " + counter);
}
public void run() {
counter++;
}
}
This is by no means a complex program (heck, it only runs two simple threads on a single node!) but
it suffers from two data races - that's enough to demonstrate SPIDER's functionality.
The two races in question occur between the instructions on lines 7 and 12 because the
increments to the static variable counter
are not synchronized
and
between the instructions on lines 8 and 12 because the increment operation on line 12 is not
causally related with the reading of counter
's value on line 8. Because of this,
Example1 has two possible outputs:
- Possible Output #1:
The value of counter is 1
- Possible Output #2:
The value of counter is 2
Running this program with a tracing mechanism will produce a trace similar to this one:
{
"thread": "[email protected]",
"type": "START",
"timestamp": 1525270020050
}{
"thread": "[email protected]",
"type": "CREATE",
"child": "[email protected]",
"timestamp": 1525270020069
}{
"thread": "[email protected]",
"type": "START",
"timestamp": 1525270021812
}{
"thread": "[email protected]",
"loc": "demos.Example1.main.7",
"variable": "demos.Example1.counter",
"type": "W",
"timestamp": 1525270021837
}{
"thread": "[email protected]",
"loc": "demos.Example1.main.8",
"variable": "demos.Example1.counter",
"type": "R",
"timestamp": 1525270021852
}{
"thread": "[email protected]",
"loc": "demos.Example1.run.12",
"variable": "demos.Example1.counter",
"type": "W",
"timestamp": 1525270021875
}{
"thread": "[email protected]",
"type": "END",
"timestamp": 1525270022188
}{
"thread": "[email protected]",
"type": "END",
"timestamp": 1525270022200
}
Note: if you are looking for more complex examples, we suggest you take a look at the programs in this repo whose traces are available here.
We are now ready to run SPIDER with the flag -d
to detect data races resulting from concurrent accesses in
different threads, which will produce the following simplified output:
java -jar spider.jar -d -f traces/Example1.log
[main] Loading events from traces/Example1.log
[main] Trace successfully loaded!
[RaceDetector] Generate program order constraints
[RaceDetector] Generate fork-start constraints
[RaceDetector] Generate join-end constraints
[RaceDetector] Generate wait-notify constraints
[RaceDetector] Generate locking constraints
[RaceDetector] Generate communication constraints
[RaceDetector] Generate message handling constraints
Data Race Candidates:
-- ([email protected][email protected],
[email protected][email protected])
-- ([email protected][email protected],
[email protected][email protected])
Actual Data Races:
-- ([email protected][email protected],
[email protected][email protected])
-- ([email protected][email protected],
[email protected][email protected])
=======================
RESULTS
=======================
> Number of events in trace: 8
> Number of constraints in model: 6
> Time to generate constraint model: 0.001 seconds
## DATA RACES:
> Number of data race candidates: 2
> Number of actual data races: 2
> Time to check all candidates: 0.007 seconds
SPIDER's output is structured in the following maner: after some logging messages, it presents the candidate pairs of events which form a data race.
Even though the candidate pairs look complicated, they are actually very easy to read - each element of the pair is an identifier of an event which starts with
W
if it is a write or withR
if it is a read and is followed by the accessed variable, the thread where it occurs, the node where the thread is running and the line of code responsible for the event.
A pair of events constitutes a candidate if both events access the same variable from different threads and at least one of them is a write. After listing the candidate pairs, the output lists the actual data races, i.e. the candidate pairs whose events are concurrent. Finally, a brief summary is shown.
SPIDER builds a causality relation <
between the traced events such that events a
and b
are related (a < b
) if and only if a
must occur before b
in any execution of the system.
Note:
<
is in fact a poset.
Using <
, it is possible to detect which candidate pairs of memory accesses are in fact concurrent -
they are exactly the pairs of the form (a,b)
such that neither a < b
nor b < a
.
If you're more of a visual learner, you can look at the causality relation <
as a directed graph whose nodes
are the traced events and whose edges connect nodes a
and b
iff a < b
. Thus, ignoring the actual
identifiers for the events and threads, the causality relation for the Example1
trace shown above
looks like this:
this image was obtained in Alloy with this model of the causality relation.
Determining wether two events are concurrent can also be done "visually" - two events are concurrent if there
is not a directed path in the graph that contains both events. For example, there's no directed path that goes
through Write0
and Write1
. Thus, they are concurrent.
Internally, SPIDER builds the causality relation as a set of constraints (e.g. for Example1) and queries an SMT solver to decide if two memory accesses are concurrent given those constraints.
If you are curious and want to know more about this project (e.g. how we generate the constraints, benchmarks, etc) check out our TAP 2020 paper (draft).
See the open issues for a list of proposed features and known issues.
SPIDER was developed by
SPIDER couldn't be developed without the work that has been put into some great projects such as