Skip to content

Building for Cray with CLE5

valleydlr edited this page Dec 16, 2016 · 7 revisions

Table of Contents

Modify your build environment to use the gnu compiler

 module unload PrgEnv-XXX (if XXX is not gnu)
 module load PrgEnv-gnu

Building LDMS and support libraries and headers

  • Download ovis and gpcd from github.com/ovis-hpc
  • Download libevent-2.0.x-stable from libevent.org
  • In this example source files are assumed to be in ~/Source and builds in ~/Build

Building libevent-2.0

  • cp <path>/libevent-2.0.x-stable.tar.gz ~/Source
  • cd ~/Source
  • untar libevent-2.0.x-stable.tar.gz
  • cd libevent-2.0.x-stable
  • Run autogen.sh
 ./autogen.sh
  • Create config file with the following: -- in this example configure.sh
 ../configure --prefix=<absolute path>/Build/libevent-2.0_build
  • make configure.sh executable
 chmod +x configure.sh
  • Create a build directory
 mkdir build
  • cd to build directory and run configure.sh
 cd build
 ../configure.sh
  • Build and install libevent-2.0 libs and includes
 make && make install

Building LDMS

  • cp <path>/ovis-master.zip ~/Source
  • cd ~/Source
  • unzip ovis-master.zip
  • cd ovis-master
  • Run autogen.sh
 ./autogen.sh
  • Create config file with the following: -- in this example configure.sh
 #!/bin/bash
 #
 # SYNOPSIS: Remove existing build directories, do the automake routine, rebuild,
 #           and install everything.
 #
 # REMARK: This script doesn't do uninstall. If you wish to uninstall (e.g. make
 # uninstall), please go into each build directories ( */build-$HOSTNAME ) and
 # call make uninstall there, or just simply do the following
 #     for D in */build-$HOSTNAME; do pushd $D; make uninstall; popd; done;
 #
 #
 BUILD_PATH=<absolute path to builds>
 PREFIX=$BUILD_PATH/OVIS-3.3
 #
 # add --enable-FEATURE here
 ENABLE="--enable-ugni  \
        --enable-ldms-python \
        --enable-kgnilnd \
        --enable-lustre \
        --enable-tsampler \
        --enable-cray_power_sampler \
        --enable-cray_system_sampler \
        --enable-aries-gpcdr \
        --enable-aries_mmr"
 #
 # add --disable-FEATURE here
 DISABLE="--disable-rpath \
         --disable-readline \
         --disable-baler \
         --disable-sos \
         --disable-mmap "
 #
 # libevent2 prefix
 LIBEVENT_PREFIX=$BUILD_PATH/libevent-2.0_build
 #
 #
 WITH="--with-rca=/opt/cray/rca/default --with-krca=/opt/cray/krca/default --with-cray-hss-devel=/opt/cray-hss-devel/default --with-aries-libgpcd=$BUILD_PATH/gpcd_build/lib/,$BUILD_PATH/gpcd_build/include/"
 #
 if [ -n "$LIBEVENT_PREFIX" ]; then
     WITH="$WITH --with-libevent=$LIBEVENT_PREFIX"
 fi
 #
 # 
 CFLAGS='-g -O3 -Wl,-z,defs'
 #
 # Exit immediately if a command failed
 set -e
 set -x
 #
 ../configure --prefix=$PREFIX $ENABLE $DISABLE $WITH CFLAGS="$CFLAGS" LDFLAGS=$LDFLAGS CPPFLAGS=$CPPFLAGS
  • make configure.sh executable
 chmod +x configure.sh
  • Create a build directory
 mkdir build
  • cd to build directory and run configure.sh
 cd build
 ../configure.sh
  • Build and install ldms libs, executables, and includes
 make && make install

Running LDMS

  • README files and sample scripts for the XC40 running Rhine Redwood reside in the ovis-master/util/sample_init_scripts/XC40_RR directory.
    • Note1: The scripts must be modified, as described in the README file, to fit a particular deployment configuration.
  • Note2: The "export LD_LIBRARY_PATH=" line in the ldmsd.conf file has an error. It should look the same as the one in the ldms_env file. In this case:
 export LD_LIBRARY_PATH=$TOP/OVIS-3.3/lib/:$TOP/OVIS-3.3/lib/ovis-ldms:$TOP/OVIS-3.3/lib/ovis-lib:$TOP/libevent-2.0_build/lib:$LD_LIBRARY_PATH

Simple Configuration and Testing

  • Write a script to set up environment (in this example the script will be named ldms_env). Note: If you want to use the ugni transport for RDMA data transfers you will need to configure a protection domain and obtain the corresponding cookie value for assignment to the ZAP_UGNI_COOKIE environment variable.
 #!/bin/bash
 #
 TOP=<absolute path to OVIS build>
 export LD_LIBRARY_PATH=$TOP/OVIS-3.3/lib/:$TOP/OVIS-3.3/lib/ovis-ldms:$TOP/OVIS-3.3/lib/ovis-lib:$TOP/libevent-2.0_build/lib:$LD_LIBRARY_PATH
 export LDMSD_PLUGIN_LIBPATH=$TOP/OVIS-3.3/lib/ovis-ldms/
 export ZAP_LIBPATH=$TOP/OVIS-3.3/lib/ovis-lib/
 export PATH=$TOP/OVIS-3.3/sbin:$PATH
 #
 # Use this if using shared secret authentication
 export LDMS_AUTH_FILE=<absolute path to ovis/etc>/shared_secret
 #
 # Use the following if running on a Cray XC
 ############################
 # Will need to configure a protection domain cookie first
 export ZAP_UGNI_PTAG=0
 export ZAP_UGNI_COOKIE=<hex value of cookie e.g., 0x86bb0000>
 #
 # Set interval for peroidically checking node state. Note that for ldms_ls this should still be defined
 export ZAP_UGNI_STATE_INTERVAL=1000000
 # Set offset relative to 0 seconds. Typically set to something negative so the refresh is just before an aggregation
 export ZAP_UGNI_STATE_OFFSET=-10000
 ########################
  • Write a configuration file to be used to start a LDMS daemon (ldmsd) with the "meminfo" sampler plugin (in this example this file will be called meminfo_configuration)
 load name=meminfo
 config name=meminfo producer=nid00012 component_id=12 instance=nid00012/meminfo 
 start name=meminfo interval=1000000 offset=0
  • Start a ldmsd as a sampler using meminfo_configuration. Note: This set of examples shows use of the "sock" transport. If using "ugni" you will replace all "sock" arguments to the -x flag with "ugni" in these examples. Also "xprt=sock" will need to be replaced with "xprt=ugni" in the agg_configuration examples below.
 source ldms_env
 ldmsd -x sock:60411 -S /tmp/ldmsd_sock -v CRITICAL -l /tmp/ldmsd_log -r /tmp/ldmsd.pid -c ./meminfo_configuration
  • make sure ldmsd is running
 # ps auxw | grep ldmsd
 root     40662  0.0  0.0 383032  2120 ?        Ssl  11:08   0:00 ldmsd -x sock:60411 -S /tmp/ldmsd_sock -v CRITICAL -l /tmp/ldmsd_log -r /tmp/ldmsd.pid -c ./meminfo_configuration
  • Now use "ldms_ls" utility to check metric sets
  • List sets being hosted by this ldmsd
 $ ldms_ls -h localhost -x sock -p 60411
 nid00012/meminfo
  • More verbose listing
 $ ldms_ls -h localhost -x sock -p 60411 -v
 nid00012/meminfo: consistent, last update: Fri Nov 25 11:16:49 2016 [1401us]
  METADATA --------
    Producer Name : nid00012
    Instance Name : nid00012/meminfo
      Schema Name : meminfo
             Size : 1856
     Metric Count : 44
               GN : 2
  DATA ------------
        Timestamp : Fri Nov 25 11:16:49 2016 [1401us]
         Duration : [0.000048s]
       Consistent : TRUE
             Size : 392
               GN : 9677
  -----------------
  • Long listing that includes metric names, data types, and current values (Note: "M" designates a value as meta-data and "D" designates data)
 $ ldms_ls -h localhost -x sock -p 60411 -l
 nid00012/meminfo: consistent, last update: Fri Nov 25 11:18:18 2016 [1599us]
 M u64        component_id                               12
 D u64        job_id                                     0
 D u64        MemTotal                                   132163924
 D u64        MemFree                                    129978224
 D u64        Buffers                                    0
 D u64        Cached                                     158704
 D u64        SwapCached                                 0
 D u64        Active                                     75064
 D u64        Inactive                                   142104
 D u64        Active(anon)                               67168
 D u64        Inactive(anon)                             123720
 D u64        Active(file)                               7896
 D u64        Inactive(file)                             18384
 D u64        Unevictable                                4080
 D u64        Mlocked                                    749688
 D u64        SwapTotal                                  0
 D u64        SwapFree                                   0
 D u64        Dirty                                      0
 D u64        Writeback                                  0
 D u64        AnonPages                                  62584
 D u64        Mapped                                     13348
 D u64        Shmem                                      131092
 D u64        Slab                                       209900
 D u64        SReclaimable                               14232
 D u64        SUnreclaim                                 195668
 D u64        KernelStack                                5256
 D u64        PageTables                                 1960
 D u64        NFS_Unstable                               0
 D u64        Bounce                                     0
 D u64        WritebackTmp                               0
 D u64        CommitLimit                                66081960
 D u64        Committed_AS                               289428
 D u64        VmallocTotal                               34359738367
 D u64        VmallocUsed                                2245696
 D u64        VmallocChunk                               34290198256
 D u64        HardwareCorrupted                          0
 D u64        HugePages_Total                            0
 D u64        HugePages_Free                             0
 D u64        HugePages_Rsvd                             0
 D u64        HugePages_Surp                             0
 D u64        Hugepagesize                               2048
 D u64        DirectMap4k                                7156
 D u64        DirectMap2M                                1955840
 D u64        DirectMap1G                                134217728
  • Write a configuration file for an aggregator (in this example called agg_configuration)
 prdcr_add name=grp1.nid00012.60411 host=nid00012 port=60411 xprt=sock type=active interval=30000000
 prdcr_start name=grp1.nid00012.60411
 updtr_add name=grp1 interval=1000000 offset=100000
 updtr_prdcr_add name=grp1 regex=grp1..*
 updtr_start name=grp1
  • Start a ldmsd as an aggregator using agg_configuration (Note: make sure to use a different port if running on the same host as your sampler)
 $ ldmsd -x sock:60412 -S /tmp/ldmsd_sock_agg -m 2GB -P 16 -v CRITICAL -l /tmp/ldmsd_log_agg -r /tmp/ldmsd_agg.pid -c ./agg_configuration
  • Note1: If aggregating from a substantial number of hosts you will want to specify how much memory should be allocated to this daemon using the "-m" flag and how many threads using the "-P" flag. Additionally you will want to increase the number of threads available to the transport and the queue depth using the ZAP_EVENT_WORKERS and ZAP_EVENT_QDEPTH environment variables. These can be added to the ldms_env script above like:
 export ZAP_EVENT_WORKERS=16 (sets the number of ZAP (transport) worker threads to 16)
 export ZAP_EVENT_QDEPTH=65536 (sets the queue depth buffer to 65536 entries)
    • Note2: You may also need to increase the number of file descriptors that can be concurrently open depending on the number of hosts being aggregated from. This can be added to the ldms_env script like:
 ulimit -n 100000 (sets the number to 100 thousand)
  • Check to make sure the daemon is running and contains the expected sets as above for the sampler daemon
  • Add storage configuration by editing your agg_configuration file to look like this:
 prdcr_add name=grp1.nid00012.60411 host=nid00012 port=60411 xprt=sock type=active interval=30000000
 prdcr_start name=grp1.nid00012.60411
 updtr_add name=grp1 interval=1000000 offset=100000
 updtr_prdcr_add name=grp1 regex=grp1..*
 updtr_start name=grp1
 load name=store_csv
 config name=store_csv path=/tmp/LDMS_CSV action=init altheader=1 buffer=0
 strgp_add name=meminfo-csv_store plugin=store_csv container=csv schema=meminfo
 strgp_start name=meminfo-csv_store
  • kill aggregator ldmsd (if currently running)
 kill <pid of aggregator ldmsd>
  • Re-run using updated configuration file
 $ ldmsd -x sock:60412 -S /tmp/ldmsd_sock_agg -m 2GB -P 16 -v CRITICAL -l /tmp/ldmsd_log_agg -r /tmp/ldmsd_agg.pid -c ./agg_configuration
  • You should now see the directory /tmp/LDMS-CSV/csv with files "meminfo" and "meminfo.HEADER" in it
  • For more aggregator and store configuration options please refer to the man pages "<build>/share/man/man8/ldmsd.8" and "&lt;build&gt;&lt;/build&gt&lt;/share/man/man7/plugin_store_csv.7"&gt;</build>respectively.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally