Skip to content

Building for Cray with CLE5

valleydlr edited this page Apr 26, 2017 · 7 revisions

Table of Contents

Modify your build environment to use the gnu compiler

 module unload PrgEnv-pgi
 module unload PrgEnv-cray
 module unload PrgEnv-intel
 # module unload PrgEnv-XXX (if other XXX is not gnu)
 module load PrgEnv-gnu

Checking your environment for RPM prerequisites

 rpm -q openssl-devel gcc  libevent python-base python-devel gettext-tools libevent-devel
  • Additional packages are needed if you want to enable extra features (not covered in this example)
 rpm -q libyaml-0-2 libyaml-devel swig

If you do not have access to query the rpm database on your platform, you can check for the presence of some required files:

 /usr/share/gettext/config.rpath
 /usr/include/openssl/md5.h
 /usr/include/python2.6/Python.h

and some files required for the extra features:

 /usr/include/yaml.h
 /usr/lib64/libyaml.so
 /usr/share/swig/2.0.12/python/cstring.i

Building LDMS and support libraries and headers

 mkdir ~/Source && cd ~/Source && git clone https://github.com/ovis-hpc/ovis.git ovis
  • In this example the ovis clone is assumed to be named ~/Source/ovis and builds in ~/Build.
  • The release branch to be compiled is OVIS-3.3.0, not master.
  • You may need to set an https_proxy environment variable before the git clone will work if you are behind a firewall.
  • Download libevent-2.0.22-stable from libevent.org. You may need to download to another host and then transfer the archive file to the CLE5 environment; wget and curl clients in some CLE5 environments are not current with more recent web server SSL standards.
  • In this example source files are assumed to be in ~/Source and installations in ~/Build

Building libevent-2.0

 cp <path>/libevent-2.0.22-stable.tar.gz ~/Source 
 cd ~/Source
 tar xzf libevent-2.0.22-stable.tar.gz
 cd libevent-2.0.22-stable
 ./autogen.sh
  • Create a configure.sh script with the following content:
 #!/bin/bash
 ../configure --prefix=$HOME/Build/libevent-2.0_build && make -j 16 && make install
  • make configure.sh executable
 chmod +x configure.sh
  • Create a build directory
 mkdir build
  • cd to build directory and run configure.sh
 cd build
 ../configure.sh

The result should leave libevent.so in $HOME/Build/libevent-2.0_build/lib

Building LDMS

  • Get the sources and generate the build system like so:
 cd ~/Source/ovis
 git checkout OVIS-3.X.Y
 git submodule init gpcd-support
 git submodule update gpcd-support
 git submodule init sos
 git submodule update sos
 ./autogen.sh
  • Create build script with the following: -- named in this example configure.sh
 #!/bin/bash
 #
 # SYNOPSIS: Remove existing build directories, do the automake routine, rebuild,
 #           and install everything.
 #
 # REMARK: This script doesn't do uninstall. If you wish to uninstall (e.g. make
 # uninstall), please go into each build directories ( */build-$HOSTNAME ) and
 # call make uninstall there, or just simply do the following
 #     for D in */build-$HOSTNAME; do pushd $D; make uninstall; popd; done;
 #
 #
 BUILD_PATH=$HOME/Build
 PREFIX=$BUILD_PATH/OVIS-3.X
 cd $HOME/Source/ovis/build
 mkdir -p $PREFIX
 #
 # add --enable-FEATURE here
 ENABLE="--enable-ugni  \
        --enable-ldms-python \
        --enable-kgnilnd \
        --enable-lustre \
        --enable-tsampler \
        --enable-cray_power_sampler \
        --enable-cray_system_sampler \
        --enable-aries-gpcdr \
        --enable-aries_mmr"
 #
 # add --disable-FEATURE here
 DISABLE="--disable-rpath \
         --disable-readline \
         --disable-baler \
         --disable-sos \
         --disable-mmap "
 #
 # libevent2 prefix
 LIBEVENT_PREFIX=$BUILD_PATH/libevent-2.0_build
 #
 #
 WITH_CRAY="--with-rca=/opt/cray/rca/default --with-krca=/opt/cray/krca/default --with-cray-hss-devel=/opt/cray-hss-devel/default --enable-gpcdlocal"
 #
 WITH_CRAY="$WITH_CRAY --with-libevent=$LIBEVENT_PREFIX"
 #
 # 
 CFLAGS='-g -O3 -Wl,-z,defs'
 #
 # Exit immediately if a command failed
 set -e
 set -x
 #
 $HOME/Source/ovis/configure --prefix=$PREFIX --with-pkglibdir=ovis-lib $ENABLE $DISABLE $WITH_CRAY CFLAGS="$CFLAGS" LDFLAGS=$LDFLAGS CPPFLAGS=$CPPFLAGS
  • make configure.sh executable
 chmod +x configure.sh
  • Create a build directory
 mkdir build
  • cd to build directory and run the build in a script shell to capture the output in install.log like so:
 cd build
 script -c '../configure.sh && make -j 16 && make install && echo success' install.log
  • Review the output in install.log if the final output on the screen does not end with 'success'

Running LDMS

  • A simple test can be run immediately (it does not use Cray-specific plugins):
 ~/Build/OVIS-3.3/bin/ldms_local_usertest.sh

This exercises the generic socket transport, sampler, aggregator, and store plugins. The screen output should end with a message like:

 logs and data stored under /tmp/username/ldmstest/142819
 done

Simple Cray Configuration and Testing

  • README files and sample scripts for the XC40 running Rhine Redwood reside in the ~/Source/ovis/util/sample_init_scripts/XC40_RR directory.
    • Note1: The scripts must be modified, as described in the README file, to fit a particular deployment configuration.
  • Note2: The "export LD_LIBRARY_PATH=" line in the ldmsd.conf file has an error. It should look the same as the one in the ldms_env file. In this case:
 export LD_LIBRARY_PATH=$TOP/OVIS-3.3/lib/:$TOP/OVIS-3.3/lib/ovis-ldms:$TOP/OVIS-3.3/lib/ovis-lib:$TOP/libevent-2.0_build/lib:$LD_LIBRARY_PATH
  • Note3: The "export LDMSD_PLUGIN_PATH=" line in the ldmsd.conf and ldms_env file has an error. It should look the same in both. In this case:
 export LDMSD_PLUGIN_LIBPATH=$TOP/OVIS-3.3/lib/ovis-lib/
  • Write a script to set up environment (in this example the script will be named ldms_env). Note: If you want to use the ugni transport for RDMA data transfers you will need to configure a protection domain and obtain the corresponding cookie value for assignment to the ZAP_UGNI_COOKIE environment variable.
 #!/bin/bash
 #
 TOP=$HOME/Build
 export LD_LIBRARY_PATH=$TOP/OVIS-3.3/lib/:$TOP/OVIS-3.3/lib/ovis-ldms:$TOP/OVIS-3.3/lib/ovis-lib:$TOP/libevent-2.0_build/lib:$LD_LIBRARY_PATH
 export LDMSD_PLUGIN_LIBPATH=$TOP/OVIS-3.3/lib/ovis-lib/
 export ZAP_LIBPATH=$TOP/OVIS-3.3/lib/ovis-lib/
 export PATH=$TOP/OVIS-3.3/sbin:$PATH
 #
 # Use this if using shared secret authentication
 export LDMS_AUTH_FILE=<absolute path to ovis/etc>/shared_secret
 #
 # Use the following if running on a Cray XC
 ############################
 # Will need to configure a protection domain cookie first
 export ZAP_UGNI_PTAG=0
 export ZAP_UGNI_COOKIE=<hex value of cookie e.g., 0x86bb0000>
 #
 # Set interval for peroidically checking node state. Note that for ldms_ls this should still be defined
 export ZAP_UGNI_STATE_INTERVAL=1000000
 # Set offset relative to 0 seconds. Typically set to something negative so the refresh is just before an aggregation
 export ZAP_UGNI_STATE_OFFSET=-10000
 ########################
  • Write a configuration file to be used to start a LDMS daemon (ldmsd) with the "meminfo" sampler plugin (in this example this file will be called meminfo_configuration)
 load name=meminfo
 config name=meminfo producer=nid00012 component_id=12 instance=nid00012/meminfo 
 start name=meminfo interval=1000000 offset=0
  • Start a ldmsd as a sampler using meminfo_configuration. Note: This set of examples shows use of the "sock" transport. If using "ugni" you will replace all "sock" arguments to the -x flag with "ugni" in these examples. Also "xprt=sock" will need to be replaced with "xprt=ugni" in the agg_configuration examples below.
 source ldms_env
 ldmsd -x sock:60411 -S /tmp/ldmsd_sock -v CRITICAL -l /tmp/ldmsd_log -r /tmp/ldmsd.pid -c ./meminfo_configuration
  • make sure ldmsd is running
 # ps auxw | grep ldmsd
 root     40662  0.0  0.0 383032  2120 ?        Ssl  11:08   0:00 ldmsd -x sock:60411 -S /tmp/ldmsd_sock -v CRITICAL -l /tmp/ldmsd_log -r /tmp/ldmsd.pid -c ./meminfo_configuration
  • Now use "ldms_ls" utility to check metric sets
  • List sets being hosted by this ldmsd
 $ ldms_ls -h localhost -x sock -p 60411
 nid00012/meminfo
  • More verbose listing
 $ ldms_ls -h localhost -x sock -p 60411 -v
 nid00012/meminfo: consistent, last update: Fri Nov 25 11:16:49 2016 [1401us]
  METADATA --------
    Producer Name : nid00012
    Instance Name : nid00012/meminfo
      Schema Name : meminfo
             Size : 1856
     Metric Count : 44
               GN : 2
  DATA ------------
        Timestamp : Fri Nov 25 11:16:49 2016 [1401us]
         Duration : [0.000048s]
       Consistent : TRUE
             Size : 392
               GN : 9677
  -----------------
  • Long listing that includes metric names, data types, and current values (Note: "M" designates a value as meta-data and "D" designates data)
 $ ldms_ls -h localhost -x sock -p 60411 -l
 nid00012/meminfo: consistent, last update: Fri Nov 25 11:18:18 2016 [1599us]
 M u64        component_id                               12
 D u64        job_id                                     0
 D u64        MemTotal                                   132163924
 D u64        MemFree                                    129978224
 D u64        Buffers                                    0
 D u64        Cached                                     158704
 D u64        SwapCached                                 0
 D u64        Active                                     75064
 D u64        Inactive                                   142104
 D u64        Active(anon)                               67168
 D u64        Inactive(anon)                             123720
 D u64        Active(file)                               7896
 D u64        Inactive(file)                             18384
 D u64        Unevictable                                4080
 D u64        Mlocked                                    749688
 D u64        SwapTotal                                  0
 D u64        SwapFree                                   0
 D u64        Dirty                                      0
 D u64        Writeback                                  0
 D u64        AnonPages                                  62584
 D u64        Mapped                                     13348
 D u64        Shmem                                      131092
 D u64        Slab                                       209900
 D u64        SReclaimable                               14232
 D u64        SUnreclaim                                 195668
 D u64        KernelStack                                5256
 D u64        PageTables                                 1960
 D u64        NFS_Unstable                               0
 D u64        Bounce                                     0
 D u64        WritebackTmp                               0
 D u64        CommitLimit                                66081960
 D u64        Committed_AS                               289428
 D u64        VmallocTotal                               34359738367
 D u64        VmallocUsed                                2245696
 D u64        VmallocChunk                               34290198256
 D u64        HardwareCorrupted                          0
 D u64        HugePages_Total                            0
 D u64        HugePages_Free                             0
 D u64        HugePages_Rsvd                             0
 D u64        HugePages_Surp                             0
 D u64        Hugepagesize                               2048
 D u64        DirectMap4k                                7156
 D u64        DirectMap2M                                1955840
 D u64        DirectMap1G                                134217728
  • Write a configuration file for an aggregator (in this example called agg_configuration)
 prdcr_add name=grp1.nid00012.60411 host=nid00012 port=60411 xprt=sock type=active interval=30000000
 prdcr_start name=grp1.nid00012.60411
 updtr_add name=grp1 interval=1000000 offset=100000
 updtr_prdcr_add name=grp1 regex=grp1..*
 updtr_start name=grp1
  • Start a ldmsd as an aggregator using agg_configuration (Note: make sure to use a different port if running on the same host as your sampler)
 $ ldmsd -x sock:60412 -S /tmp/ldmsd_sock_agg -m 2GB -P 16 -v CRITICAL -l /tmp/ldmsd_log_agg -r /tmp/ldmsd_agg.pid -c ./agg_configuration
  • Note1: If aggregating from a substantial number of hosts you will want to specify how much memory should be allocated to this daemon using the "-m" flag and how many threads using the "-P" flag. Additionally you will want to increase the number of threads available to the transport and the queue depth using the ZAP_EVENT_WORKERS and ZAP_EVENT_QDEPTH environment variables. These can be added to the ldms_env script above like:
 export ZAP_EVENT_WORKERS=16 (sets the number of ZAP (transport) worker threads to 16)
 export ZAP_EVENT_QDEPTH=65536 (sets the queue depth buffer to 65536 entries)
    • Note2: You may also need to increase the number of file descriptors that can be concurrently open depending on the number of hosts being aggregated from. This can be added to the ldms_env script like:
 ulimit -n 100000 (sets the number to 100 thousand)
  • Check to make sure the daemon is running and contains the expected sets as above for the sampler daemon
  • Add storage configuration by editing your agg_configuration file to look like this:
 prdcr_add name=grp1.nid00012.60411 host=nid00012 port=60411 xprt=sock type=active interval=30000000
 prdcr_start name=grp1.nid00012.60411
 updtr_add name=grp1 interval=1000000 offset=100000
 updtr_prdcr_add name=grp1 regex=grp1..*
 updtr_start name=grp1
 load name=store_csv
 config name=store_csv path=/tmp/LDMS_CSV action=init altheader=1 buffer=0
 strgp_add name=meminfo-csv_store plugin=store_csv container=csv schema=meminfo
 strgp_start name=meminfo-csv_store
  • kill aggregator ldmsd (if currently running)
 kill <pid of aggregator ldmsd>
  • Re-run using updated configuration file
 $ ldmsd -x sock:60412 -S /tmp/ldmsd_sock_agg -m 2GB -P 16 -v CRITICAL -l /tmp/ldmsd_log_agg -r /tmp/ldmsd_agg.pid -c ./agg_configuration
  • You should now see the directory /tmp/LDMS-CSV/csv with files "meminfo" and "meminfo.HEADER" in it
  • For more aggregator and store configuration options please refer to the man pages in:
 <build dir>/share/man/man8/ldmsd.8
 <build dir>/share/man/man7/Plugin_store_csv.7

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally