This repository contains a representative subset of the first-party DNN training workloads on Microsoft's internal Philly clusters. The trace is a sanitized subset of the workload described in "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" in ATC’19. This work was done as part of Microsoft Research's Project Fiddle.
We include in this repository a jupyter notebook that highlights the main characteristics of the traces and shows how to parse them (a huge thank you to Keshav Santhanam for putting this together).
We provide the trace as is. If you do use this trace in your research, please make sure to cite our ATC’19 paper (mentioned above).
- Dataset size: 6.6 GB
- Compressed dataset size: 0.98 GB
- Number of files: 5 files
- Duration: Jobs submitted between 2017-08-07 - 2017-12-22
- Total number of jobs: 117325
Description: Contains information about each job, including each individual successful scheduling attempt.
Format: JSON
Example entry:
{
"status": "Pass",
"vc": "ee9e8c",
"jobid": "application_1506638472019_14199",
"attempts": [
{
"start_time": "2017-10-07 01:12:09",
"end_time": "2017-10-07 01:13:23",
"detail": [
{
"ip": "m47",
"gpus": [
"gpu0",
"gpu1",
"gpu2",
"gpu3",
"gpu4",
"gpu5",
"gpu6",
"gpu7"
]
}
]
},
{
"start_time": "2017-10-07 01:13:30",
"end_time": "2017-10-09 06:53:12",
"detail": [
{
"ip": "m412",
"gpus": [
"gpu0",
"gpu1",
"gpu2",
"gpu3",
"gpu4",
"gpu5",
"gpu6",
"gpu7"
]
}
]
}
],
"submitted_time": "2017-10-07 01:11:39",
"user": "ce2f4c"
}
List of keys:
status
: The job's status upon completion. One ofPass
,Killed
, orFailed
.vc
: The hash of the virtual cluster the job was run in.jobid
: The id of the job.attempts
: A list ofdict
s where eachdict
has the following keys:start_time
: The start time of the attempt.end_time
: The end time of the attempt.detail
: A list ofdict
s where eachdict
has the following keys:ip
: The id of the server the attempt was scheduled on.gpus
: A list of GPUs used by the attempt.
submitted_time
: The time the job was submitted to the scheduler.user
: A hash of the user id.
Notes:
- A job may have no recorded scheduling attempts.
- A scheduling attempt may have no recorded
start_time
and/orend_time
- this could be the result of a logging error. - If a job has a
None
value for its last attempt'send_time
, the job was still running at the time the snapshot was taken.
Description: Provides a per-minute record of each GPU's utilization as
reported by nvidia-smi
.
Format: CSV
Columns:
time | machineId | gpu0_util | gpu1_util | gpu2_util | gpu3_util | gpu4_util | gpu5_util | gpu6_util | gpu7_util |
---|
Example entry:
2017-10-03 00:08:00 PDT,m29,60.8,99.366666667,100.0,63.333333333,100.0,100.0,100.0,100.0,
Notes:
- Some
gpu*_util
values may be"NA"
, indicating the GPU was offline at the time of measurement.
Description: Provides a per-minute record of each server's CPU utilization.
Format: CSV
Columns:
time | machine_id | cpu_util |
---|
Example entry:
2017-11-27 00:04:00 PST,m29,31.845
Notes:
- Some
cpu_util
values may be"NA"
, indicating the server was offline at the time of measurement.
Description: Provides a per-minute record of each server's memory utilization.
Format: CSV
Columns:
time | machine_id | mem_total | mem_free |
---|
Example entry:
2017-10-03 00:06:00 PDT,m29,528272672.0,2030730.6667
Notes:
- Some
mem_total
andmem_free
values may be"NA"
, indicating the server was offline at the time of measurement.
Description: Lists the number of GPUs and per-GPU memory available on each server in the cluster.
Format: CSV
Columns:
machineId | number of GPUs | single GPU mem |
---|
Example entry:
m31,8, 24GB