Skip to content

2019.04.15

ovis-hpc edited this page Apr 15, 2019 · 17 revisions

Directory and Submodule changes between V3 and V4

  • Project changes between OVIS v3 and v4
    • OVIS v3 exposed LDMS, Baler, and SOS as a single project
      • Lightweight Distributed Metric Service (LDMS) – data collection, transport, and storage
      • Baler – log file pattern tagging and analysis
      • Scalable Object Store (SOS) – object store targeting HPC ingest, read, and analyses needs
    • OVIS v4 exposes LDMS, Baler, and SOS as independent projects
      • Baler is dependent on SOS
      • LDMS store-sos plugin depends on SOS
      • SOS is independent of LDMS and Baler
  • Submodule confusion
    • The submodules in OVIS are no longer maintained
    • How to build with SOS without using submodules
      • Check out the desired version of SOS
      • Configuration
        • Prerequisites for Python support
          • Python 2.7+
          • Numpy
          • Cython 0.29+
          • You can use pip to install of these
          • If you don’t want/can’t support these dependencies, use –disable-python on the configure line
        • cd sos-latest-stable
        • mkdir build
        • cd build
        • ./autogen.sh
        • cd build
        • ../configure –prefix __install-dir__ \
          • --libdir __install-dir__/lib64 \
          • -–libexecdir __install-dir__/lib64
          • [--disable-python]
      • Building SOS
        • make && sudo make install
    • How to build zap_ugni without gpcd submodule (XC/Aries only; XE/Gemini uses system gpcdr)
      • Configuration that points to pre-built gpcd libs and headers
        • --enable-ugni
        • --with--aries-libgpcd=/opt/cray/gni/default/lib64/,/opt/cray/gni/default/include/gpcd/
      • gpcd libs and headers location
        • Cray CLE 6.? UP?:
          • /opt/cray/gni/default/include/gpcd/
          • /opt/cray/gni/default/lib64/
      • Getting the gpcd code
      • Building gpcd
        • Set up your environment to use the gnu compiler
        • cd gpcd
        • ./autogen.sh
        • mkdir build
        • cd build
        • ../configure --prefix=/gpcd
        • Make && make install
          • This will install libs in gpcd/lib/
          • This will install headers in gpcd/include/gpcdlocal/
        • NOTE: The local build names libs with the label “local” so you must set up symbolic links for actual lib names:
        • cd build_dir/lib
        • ln -s libgpcdlocal.so.0.0.0 libgpcd.so
        • ln -s libgpcdlocal.so.0.0.0 libgpcd.so.0

Configuration Considerations

  • LDMSD Transport
    • Determining what value to set for the Completion Queue (CQ) depth for aggregators:
      • Recommendation: Set ZAP_UGNI_CQ_DEPTH=65536
        • The default CQ depth 2K. This is not enough for aggregators.
      • The CQ contains slots that are consumed when RDMA requests are completed.
      • RDMA is requested by
        • lookup
          • Happens right after connect
        • update
          • Happens every time the updater schedules a set update
        • push
          • Happens when a set registered for push closes a transaction boundary (i.e. sample completes)
      • If the low level completion thread cannot keep up, this may cause the CQ to overflow and result in GNI_RESOURCE_ERRORs
    • Event queue depth
      • Recommendation: export ZAP_EVENT_QDEPTH=65536
      • I/O events are delivered to threads for handling by the application
        • Recommendation: export ZAP_EVENT_WORKERS=8
      • To maintain ordering an endpoint is assigned to one and only one of the I/O worker threads
      • Because the handling of an event can take a long time (e.g. storing the updated data), these queues may need to be deeper than expected and the number of threads larger
      • If the I/O still cannot keep up due to your system limitations, you may need to split up your store (e.g., use multiple containers, or multiple aggregators writing to different locations), so that the data does not go to a single sink.
  • LDMSD Set memory
    • Sets occupy memory that is mapped and locked for the purpose of exchanging with a peer using RDMA
    • Typically, these transports have limited remote memory access resources. To minimize LDMS utilization of these, we map all set memory a-priori and use it as needed to contain instances of sets
    • This is the -m option on the command line. Sets may require anywhere from 2K to 64K depending on the set
    • Default values are currently set based on sampler estimates, and will be sufficient even for the aggregator for test cases. For large-scale systems and many sets, you should increase the aggregator set memory size.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally