Public comment from ScienceLogic #126

krusynth · 2019-01-02T18:30:34Z

Comments on Update to Data Center Optimization Initiative

December 21, 2018

Executive Summary

ScienceLogic is pleased to provide the following comments regarding the most recent update to
the Federal Data Center Optimization Initiative (DCOI), specifically regarding the new metric of
Availability. Availability has long been the most critical underpinning to any IT operations. There
is a long-standing gap between facilities operations, IT Ops and SecOps, which requires closure
for the sake of accurate and holistic decision-making with respect to data center
optimization. We propose the articulation of a single availability metric, supplemented by
additional context from a variety of related sub-metrics, as best practices for Data Center
Optimization.

Data Center Availability Metric

This proposed Availability metric is an aggregation of the proposed sub-metrics listed below,
any of which could be the cause of data center unavailability to any individual organization,
while the remaining sub-metrics could simultaneously remain relatively healthy. ScienceLogic
currently uses the construct of a "Business Service" availability metric to differentiate from
individual ITSM, Device, Network or Security availability. These individual metrics combine to
form a composite metric for the Availability for a Facility, under the umbrella of overall
Datacenter Availability, which can be considered as the key “Business Service” being measured.

An Overview of Business Services

In the context of Data Center Availability, it is important to arrive at a single metric that is
unencumbered by the low-level detail of individual data center components such as servers,
switches, load balancers, etc. Hence the concept of a Business Service, which can represent a
collection of IT devices that combine to deliver a service that is important to the business –
either as a significant business function – such as email or eCommerce - or as an important IT
function such as Active Directory service, which could be considered an IT Service. IT Services
typically combine to form a business service.

Within an IT service there is typically a collection of devices such as routers, switches, servers,
hosts, operating systems, virtual machines, storage units etc., which together comprise an IT
Service or a Business service. While each of these elements is monitored for its individual status
or performance or health, using metrics such as CPU utilization, memory usage, fan or power
supply state, etc., these metrics, while vital at the device level, are too detailed at the service
level. At the business service level, we ‘roll up’ the individual device state into composite
metrics that represent the overall condition of the Business Service. Such metrics are typically
Health, Availability, and Risk.

To illustrate these concepts of Health, Availability and Risk, let us take a simplistic example. A
cluster of four redundant servers may be arranged where any one server could carry the entire
workload of user demand, but we use four servers for redundancy in case of failure to ensure
high availability. These four servers might represent the core of a typical Business Service. If
one server fails, the service is still 100% available, but the health degrades to 75% and the risk
rises to 25%. The Health, Availability and Risk metrics are important to the Executive oversight
of the service – much more than a single alert that shows an individual server CPU utilization
level has risen above a specific threshold, or that a backup power supply on one server may
have overheated.

These detailed metrics are still important and must continue to be collected. However, the
intent is to enable an executive or management-level view of the overall service, while
operationally it will still be necessary to collect detailed performance and status metrics on all
of the underlying infrastructure components.

Facility Availability Sub-Metrics

Business Service availability should encompass total availability from any of the technology or
practice areas mentioned above. In turn, each technology area should encompass availability
from other critical sub-components – and should understand the dependencies between them.
ScienceLogic's best practices assert that business service constructs should occur from the
ground up; hence availability should be computed for all device services, and then for all IT
Services, of which devices are a sub-set, but potentially critical to the availability of the whole
Business service. For any datacenter facility, for example, the failure (i.e. lack of availability) of
a core switch or router, could cause a complete service outage for all child devices attached to
that router. That device outage in turn could create an outage for any particular service or
organization relying on services or applications delivered by those particular devices that rely
on the core switch. This would create a relative score of unavailability for the impacted
organization/s, but not for the entire facility. Hence the Business Service can be further subdivided
into sub-Business Services, or into IT Services, that could be the construct of a series of
core switches related to an organization, application or service, for example.

Each sub-metric would contain time-stamps related specifically to the availability, and
unavailability of the service (whether IT Service or Business Service). Similarly, components
making up the body of the IT or Business Service could be classified into device groups or
organizations. Alternatively, these components could have attributes/tags associated in the
metadata that could be used to identify availability metrics tied to a particular facility, cage
within a facility, rack/s or organization, or even application or service common to a cluster of
organizations collocated within the facility. In addition, any one of these organizations or
device groups, when set into a maintenance mode, does not necessarily have to incur an SLA
penalty for unavailability, since maintenance windows can be scheduled, and Runbook actions
– which are automation scripts typically triggered by events or alerts – can be suppressed on a
scheduled maintenance period as needed, in order to avoid biasing the availability metric.

The Risk of a Non-Business Service Approach

Although facility availability can be measured as a singular metric, the inclination to measure
availability based on any single point of failure, or any single metric, could prove considerably
inaccurate for a facility. Take, for example, the leading causes of business downtime (Source:
Contingency Planning Research, a division of Eagle Rock Alliance) listed as: Power outage, water
damage, hardware failure, earthquake, fire, hurricane, building outage, corrupt data. So, whilst
the power source entering the datacenter facility is critical, it might be seen as available and
perfectly healthy, while another disaster in the form of corrupted data, or broken bandwidth,
may be equally devastating. Hence visibility that includes environmental systems and ranging
from network devices to a variety of internal IT components, is still necessary to understand the
total, holistic Data Center Availability. An understanding of causality (facility vs IT vs Act of God,
for example) is a critical input to any Data Center Availability calculation.

Health, Availability and Risk

Further to the individual technology/practice areas being used to combine metrics toward this
singular Business Service, each one of these technology/practice areas can be further
deconstructed to Health, Availability and Risk metrics. At its simplest, Health can be considered
the aggregation of several fault and performance parameters combining to retain an
established KPI or threshold or SLA at a state of relatively good health. That Health score can be
a fixed metric, or a relative score that falls outside of normal behavior, based on learned normal
operation. Any anomaly detected outside of 1, 2, or 3 standard deviations away from normal
behavior, could reduce the health score of a Business Service, and in practice would raise the
risk score of a Business Service - creating a relative metric that indicates a level of exposure, as
well as an increased probability of overall service failure, prior to a complete service outage
occurring. Availability SLAs can remain fully intact while individual device components can
begin elevating the risk to availability of services. Hence Health and Risk metrics can be views as
discrete, but helpful elements, in the broader optimization program.

Dependencies and Relationships Inside the Data Center

To enable the calculation and real-time observation of Data Center availability, it is important
to understand the relationships between the various Data Center components that make up
the overall service. This requires an understanding of the topological and logical relationships
between those components within an IT service - and the Business Services that each IT service
serves. This is a complex problem to solve and one which ScienceLogic has studied extensively.
Automated dependency mapping shows the relationships between elements inside a service in
real time, which enables a picture of the Business Service to be derived automatically. Also,
from an operational perspective, it enables automated, topology-based event correlation and
suppression, such that only root cause alerts are displayed to an operator, instead of flooding
an event console with hundreds or more downstream device-specific alerts caused by an
upstream device failure.

ScienceLogic, Reston, VA 20190 USA|www.sciencelogic.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public comment from ScienceLogic #126

Public comment from ScienceLogic #126

krusynth commented Jan 2, 2019