Skip to content

Commit

Permalink
Merge pull request #35 from yonch/collection-strategy
Browse files Browse the repository at this point in the history
add discussion of collection strategy
  • Loading branch information
yonch authored Jan 15, 2025
2 parents 6f46792 + 3c4d063 commit e133efb
Showing 1 changed file with 64 additions and 0 deletions.
64 changes: 64 additions & 0 deletions docs/collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Memory Collector Telemetry Strategy

## Overview

Our telemetry strategy prioritizes high-resolution, low-level data collection to build a foundation for understanding memory subsystem interference. By focusing on simplicity and data quality in the initial collector, we can enable rapid iteration and validation of detection algorithms.

**The key aspects of our approach are:**

- Collect per-process, per-core metrics at 1 millisecond granularity to capture interference at a meaningful timescale
- Collect per-process cache occupancy metrics at 1 millisecond granularity
- Generate synchronized datasets for joint analysis
- Implement in stages to manage complexity

**This "firehose" telemetry will enable us to build a dataset for offline analysis, allowing us to identify patterns and develop algorithms for real-time interference detection.**

## Telemetry Collection

The collector will monitor and record the following metrics for each process at 1 millisecond granularity:

- Process ID
- Core ID
- Core frequency during the measured interval
- Cycles
- Instructions
- Last level cache misses

Modern cloud environments routinely run dozens or even hundreds of applications on a single server, each with its own dynamic memory usage patterns. In an extreme case, with 100 applications changing phase every second on average, there would be a phase change every 10 milliseconds in aggregate.

**The 1 millisecond telemetry granularity enables us to detect this behavior and characterize interference at a meaningful timescale.**

In addition to these per-process metrics, we will also collect cache occupancy measurements using Intel RDT's Cache Monitoring Technology (CMT) or an equivalent mechanism. This data will be collected per process at the same 1 millisecond granularity.

**Monitoring cache usage per process is necessary because caches maintain state across context switches and are shared by all threads of a process.**

## Data Format

For the initial version, telemetry will be written to CSV files to simplify data collection and analysis. Each row will represent a single measurement interval for a specific process.

**We will generate two datasets:**

1. Per-process, per-core measurements (process ID, core ID, frequency, cycles, instructions, LLC misses)
2. Per-process cache occupancy measurements

**While these datasets will be separate, they will be synchronized and aligned by timestamp to enable joint analysis.**

## Implementation Stages

To manage complexity, we will implement telemetry collection in two stages:

1. Collect per-process, per-core measurements (process ID, core ID, frequency, cycles, instructions, LLC misses)
2. Add per-process cache occupancy measurements using Intel RDT or an equivalent mechanism

**This staged approach allows us to validate the core telemetry pipeline before adding the complexity of cache monitoring.**

For the cache monitoring stage, we will need to assign each process a unique identifier (e.g., CLOS for Intel RDT) to track its cache usage. This will require additional system-level coordination and metadata management.

## Analysis and Algorithm Development

By collecting high-resolution telemetry from multiple clusters, both real-world deployments and benchmark environments, we aim to build a representative dataset capturing a wide range of interference scenarios.

**Analyzing this data offline using big data techniques will help us identify common interference patterns, resource usage signatures, and relevant metrics for detecting contention.**

These insights will inform the development of algorithms for real-time interference detection in future collector versions. Starting with a thorough understanding of low-level behavior is key to building effective higher-level detection and mitigation strategies.

0 comments on commit e133efb

Please sign in to comment.