abstract.tex

Detection, characterization, and mitigation of faults on supercomputers is
complicated by the large variety of interacting subsystems. Failures often
manifest as vague observations like ``my job failed" and may result from
faults in system
hardware/firmware/software, filesystems, networks, resource manager state, and
more.  Data such as system logs, environmental metrics, job history, cluster
state snapshots, published outage notices and user reports are routinely
collected. These data are typically stored in different locations and formats
for specific use by targeted consumers. Combining data sources for analysis
generally requires a consumer-dependent custom approach.  We present a
vocabulary for describing data, including format and access details, an
annotation schema for attaching observations to a dataset, and tools to aid in
discovery and publication of system-related insights. We present case studies in
which our analysis tools utilize information from disparate data sources to
investigate failures and performance issues from user and administrator perspectives.