- forked from https://github.com/jasoncpatton/chtc-htcondor-es
This package contains a set of scripts that assist in uploading data from a HTCondor pool to ElasticSearch. It queries for both historical and current job ClassAds, converts them to a JSON document, and uploads them to the local ElasticSearch instance.
The majority of the logic is in the conversion of ClassAds to values that are immediately useful in Kibana / ElasticSearch.
This service converts the raw ClassAds into JSON documents; one JSON document per job in the system. While most attributes are copied over verbatim, a few new ones are computed on insertion; this allows easier construction of ElasticSearch queries. This section documents the most commonly used and most useful attributes, along with their meaning.
Generic attributes:
RecordTime
: When the job exited the queue or when the JSON document was last updated, whichever came first. Use this field for time-based queries.GlobalJobId
: Use to uniquely identify individual jobs.BenchmarkJobDB12
: An estimate of the per-core performance of the machine the job last ran on, based on theDB12
benchmark. Higher is better.CoreHr
: The number of core-hours utilized by the job. If the job lasted for 24 hours and utilized 4 cores, the value ofCoreHr
will be 96. This includes all runs of the job, including any time spent on preempted instance.CpuTimeHr
: The amount of CPU time (sum of user and system) attributed to the job, in hours.CommittedCoreHr
: The core-hours only for the last run of the job (excluding preempted attempts).QueueHrs
: Number of hours the job spent in queue before last run.WallClockHr
: Number of hours the job spent running. This is invariant of the number of cores; most users will preferCoreHr
instead.CpuEff
: The total scheduled CPU time (user and system CPU) divided by core hours, in percentage. If the job lasted for 24 hours, utilized 4 cores, and used 72 hours of CPU time, thenCpuEff
would be 75.CpuBadput
: The badput associated with a job, in hours. This is the sum of all unsuccessful job attempts. If a job runs for 2 hours, is preempted, restarts, then completes successfully after 3 hours, theCpuBadput
is 2.MemoryMB
: The amount of RAM used by the job.Processor
: The processor model (from/proc/cpuinfo
) the job last ran on.RequestCpus
: Number of cores utilized by the job.RequestMemory
: Amount of memory requested by the job, in MB.ScheddName
: Name of HTCondor schedd where the job ran.Status
: State of the job; valid values includeCompleted
,Running
,Idle
, orHeld
. Note: due to some validation issues, non-experts should only look at completed jobs.x509userproxysubject
: The DN of the grid certificate associated with the job; for CMS jobs, this is not the best attribute to use to identify a user (preferCRAB_UserHN
).DB12CommittedCoreHr
,DB12CoreHr
,DB12CpuTimeHr
: The job'sCommittedCoreHr
,CoreHr
, andCpuTimeHr
values multiplied by theBenchmarkJobDB12
score.
usage: spider_cms.py [-h] [--process_queue] [--feed_es] [--feed_es_for_queues]
[--feed_amq] [--schedd_filter SCHEDD_FILTER]
[--skip_history] [--read_only] [--dry_run]
[--max_documents_to_process MAX_DOCUMENTS_TO_PROCESS]
[--keep_full_queue_data]
[--es_bunch_size ES_BUNCH_SIZE]
[--query_queue_batch_size QUERY_QUEUE_BATCH_SIZE]
[--upload_pool_size UPLOAD_POOL_SIZE]
[--query_pool_size QUERY_POOL_SIZE]
[--es_hostname ES_HOSTNAME] [--es_port ES_PORT]
[--es_index_template ES_INDEX_TEMPLATE]
[--log_dir LOG_DIR] [--log_level LOG_LEVEL]
[--email_alerts EMAIL_ALERTS] [--collectors COLLECTORS]
[--collectors_file COLLECTORS_FILE]
The collectors file is a json file with the pools and a list of the collectors:
{
"Global":[ "thefirstcollector.example.com:8080","thesecondcollector.other.com"],
"Volunteer":["next.example.com"],
"ITB":["newcollector.virtual.com"]
}