data versioning, model versioning , and model monitoring . MLOps also requires stakeholder collaboration, including data scientists, engineers, and IT operations.
-While DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations.
-Table 13.1 compares and summarizes them side by side.
+While DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations. Table 13.1 compares and summarizes them side by side.
diff --git a/contents/privacy_security/privacy_security.html b/contents/privacy_security/privacy_security.html
index 4e2c6440..71994ff7 100644
--- a/contents/privacy_security/privacy_security.html
+++ b/contents/privacy_security/privacy_security.html
@@ -1087,7 +1087,7 @@
Threat Type
Description
-Relevance to Embedded ML Hardware Security
+Relevance to ML Hardware Security
diff --git a/search.json b/search.json
index 73e749e8..466c9bd9 100644
--- a/search.json
+++ b/search.json
@@ -1038,7 +1038,7 @@
"href": "contents/hw_acceleration/hw_acceleration.html#sec-aihw",
"title": "10 AI Acceleration",
"section": "10.3 Accelerator Types",
- "text": "10.3 Accelerator Types\nHardware accelerators can take on many forms. They can exist as a widget (like the Neural Engine in the Apple M1 chip) or as entire chips specially designed to perform certain tasks very well. This section will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.\nWe then progressively consider more programmable and adaptable architectures, discussing GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs that sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.\nBy structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs between utilization, efficiency, programmability, and flexibility in accelerator design. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.\nFigure 10.2 illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functinoal diversity vs performance: general purpose architechtures can serve diverse applications but their application performance is degraded as compared to more customized architectures.\nThe progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.\n\n\n\n\n\n\nFigure 10.2: Design tradeoffs. Source: El-Rayis (2014).\n\n\nEl-Rayis, A. O. 2014. “Reconfigurable Architectures for the Next Generation of Mobile Device Telecommunications Systems.” : https://www.researchgate.net/publication/292608967.\n\n\n\n10.3.1 Application-Specific Integrated Circuits (ASICs)\nAn Application-Specific Integrated Circuit (ASIC) is a type of integrated circuit (IC) that is custom-designed for a specific application or workload rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. The Google TPU is an example of an ASIC.\nASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms dedicate enormous engineering resources to designing customized ASICs.\nThe rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.\n\nAdvantages\nDue to their customized nature, ASICs provide significant benefits over general-purpose processors like CPUs and GPUs. The key advantages include the following.\n\nMaximized Performance and Efficiency\nThe most fundamental advantage of ASICs is maximizing performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.\nFor example, Google’s Tensor Processing Units (TPUs) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google’s engineering teams need to define the chip specifications clearly, write the architecture description using Hardware Description Languages like Verilog, synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an optimized IC for machine learning workloads.\nAs a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general-purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.\n\n\nSpecialized On-Chip Memory\nASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple’s M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables data to be kept close to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which can be up to 100x slower.\nData locality and optimizing memory hierarchy are crucial for high throughput and low power. Table 10.1 shows “Numbers Everyone Should Know,” from Jeff Dean.\n\n\n\nTable 10.1: Latency comparison of operations in computing and networking.\n\n\n\n\n\n\n\n\n\n\nOperation\nLatency\nNotes\n\n\n\n\nL1 cache reference\n0.5 ns\n\n\n\nBranch mispredict\n5 ns\n\n\n\nL2 cache reference\n7 ns\n\n\n\nMutex lock/unlock\n25 ns\n\n\n\nMain memory reference\n100 ns\n\n\n\nCompress 1K bytes with Zippy\n3,000 ns\n3 us\n\n\nSend 1 KB bytes over 1 Gbps network\n10,000 ns\n10 us\n\n\nRead 4 KB randomly from SSD\n150,000 ns\n150 us\n\n\nRead 1 MB sequentially from memory\n250,000 ns\n250 us\n\n\nRound trip within same datacenter\n500,000 ns\n0.5 ms\n\n\nRead 1 MB sequentially from SSD\n1,000,000 ns\n1 ms\n\n\nDisk seek\n10,000,000 ns\n10 ms\n\n\nRead 1 MB sequentially from disk\n20,000,000 ns\n20 ms\n\n\nSend packet CA → Netherlands → CA\n150,000,000 ns\n150 ms\n\n\n\n\n\n\n\n\nCustom Datatypes and Operations\nUnlike general-purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16, which are widely used in ML models. For instance, Nvidia’s Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low-precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. Please refer to the Efficient Numeric Representations chapter for additional details.\n\n\nHigh Parallelism\nASIC architectures can leverage higher parallelism tuned for the target workload versus general-purpose CPUs or GPUs. More computational units tailored for the application mean more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.\n\n\nAdvanced Process Nodes\nCutting-edge manufacturing processes allow more transistors to be packed into smaller die areas, increasing density. ASICs designed specifically for high-volume applications can better amortize the costs of cutting-edge process nodes.\n\n\n\nDisadvantages\n\nLong Design Timelines\nThe engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involve long development cycles. For example, to tape out a 7nm chip, teams need to define specifications carefully, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large-scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.\nThere are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:\n\nML algorithms evolve rapidly: New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP last few years. When an ASIC finishes tapeout, the optimal architecture for a workload may have changed.\nDatasets grow quickly: ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.\nML applications change frequently: The industry focus shifts between computer vision, speech, NLP, recommender systems, etc. An ASIC optimized for image classification may have less relevance in a few years.\nFaster design cycles with GPUs/FPGAs: Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.\nTime-to-market needs: Getting a competitive edge in ML requires rapidly experimenting with and deploying new ideas. Waiting several years for an ASIC is different from fast iteration.\n\nThe pace of innovation in ML needs to be better matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. However, the rapid evolution of ML makes fixed-function hardware challenging.\n\n\nHigh Non-Recurring Engineering Costs\nThe fixed costs of taking an ASIC from design to high-volume manufacturing can be very capital-intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts is expensive. For instance, a 7nm chip tape-out alone could cost millions. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.\n\n\nComplex Integration and Programming\nASICs require extensive software integration work, including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, efficiently programming ASIC architectures can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.\nWhile ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.\n\n\n\n\n10.3.2 Field-Programmable Gate Arrays (FPGAs)\nFPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA are considering putting ASICs in data centers, Microsoft deployed FPGAs in its data centers (Putnam et al. 2014) in 2011 to efficiently serve diverse data center workloads.\n\nAdvantages\nFPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.\n\nFlexibility Through Reconfigurable Fabric\nThe key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than tapping out new ASICs. Figure 10.3 contains a table comparing three different FPGAs.\n\n\n\n\n\n\nFigure 10.3: Comparison of FPGAs. Source: Gwennap (n.d.).\n\n\nGwennap, Linley. n.d. “Certus-NX Innovates General-Purpose FPGAs.”\n\n\nFPGAs comprise basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.\nWhile FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. Microsoft has deployed FPGAs in its Azure data centers for machine learning workloads to serve diverse applications instead of ASICs. The programmability enables optimization across changing ML models.\n\n\nCustomized Parallelism and Pipelining\nFPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel’s HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model’s structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.\n\n\nLow Latency On-Chip Memory\nLarge amounts of high-bandwidth on-chip memory enable localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low-latency RAM blocks and dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that traverse PCIe or other system buses to reach off-chip GDDR6 memory.\n\n\nNative Support for Low Precision\nA key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16, used in quantized ML models. For example, Intel’s Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W Intel Stratix 10 NX FPGA. Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.\n\n\n\nDisadvantages\n\nLower Peak Throughput than ASICs\nFPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance, while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS Intel Stratix 10 NX FPGA.\nThis is because FPGAs comprise basic building blocks—configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile it into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.\n\n\nProgramming Complexity\nTo optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles than higher-level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.\n\n\nReconfiguration Overheads\nChanging FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.\n\n\nDiminishing Gains on Advanced Nodes\nWhile smaller process nodes greatly benefit ASICs, they provide fewer advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of the configurable fabric also diminish gains compared to fixed-function ASICs.\n\n\nCase Study\nFPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements and 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively (Xiong et al. 2021).\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei Cao, Xuegong Zhou, et al. 2021. “MRI-Based Brain Tumor Segmentation Using FPGA-Accelerated Neural Network.” BMC Bioinf. 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\n\n\n10.3.3 Digital Signal Processors (DSPs)\nThe first digital signal processor core was built in 1948 by Texas Instruments (The Evolution of Audio DSPs). Traditionally, DSPs would have logic to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate-MAC was one of the most common operations), and then write the result back to memory. The DSP would include specialized analog components to retrieve digital/audio data.\nOnce we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s rare to have entire chips dedicated to just DSP, but a System on Chip would include DSPs and general-purpose CPUs. For example, Qualcomm’s Hexagon Digital Signal Processor claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” Google Tensors, the chip in the Google Pixel phones, also includes CPUs and specialized DSP engines.\n\nAdvantages\nDSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.\n\nOptimized Architecture for Vector Math\nDSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (“Ceva SensPro Fuses AI and Vector DSP”) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.\n\n\nLow Latency On-Chip Memory\nDSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog’s SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.\n\n\nPower Efficiency\nDSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, Qualcomm’s Hexagon DSP can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.\n\n\nSupport for Integer and Floating Point Math\nUnlike GPUs that excel at single or half precision, DSPs can natively support 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs support dot product acceleration at INT8 precision for quantized neural networks.\n\n\n\nDisadvantages\nDSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. However, their advantages in power efficiency and integer math make them a strong edge computing option. So, while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:\n\nLower Peak Throughput than ASICs/GPUs\nDSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm’s Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.\n\n\nSlower Double Precision Performance\nMost DSPs must be optimized for the higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32, which provide better power efficiency. However, 64-bit floating point throughput is much lower, which can limit usage in models requiring high precision.\n\n\nConstrained Model Capacity\nThe limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.\n\n\nProgramming Complexity\nEfficient programming of DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have a steeper learning curve than high-level software frameworks, making development more complex.\n\n\n\n\n10.3.4 Graphics Processing Units (GPUs)\nThe term graphics processing unit has existed since at least the 1980s. There had always been a demand for graphics hardware in video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but higher resolution, could be at a high price point).\nThe term was popularized, however, in 1999 when NVIDIA launched the GeForce 256, mainly targeting the PC games market sector (Lindholm et al. 2008). As PC games became more sophisticated, NVIDIA GPUs became more programmable. Soon, users realized they could take advantage of this programmability, run various non-graphics-related workloads on GPUs, and benefit from the underlying architecture. And so, in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.\nIntel Arc Graphics and AMD Radeon RX have also developed their GPUs over time.\n\nAdvantages\n\nHigh Computational Throughput\nThe key advantage of GPUs is their ability to perform massively parallel floating-point calculations optimized for computer graphics and linear algebra (Raina, Madhavan, and Ng 2009). Modern GPUs like Nvidia’s A100 offer up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory tightly coupled with 1.6TB/s of graphics memory bandwidth.\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale Deep Unsupervised Learning Using Graphics Processors.” In Proceedings of the 26th Annual International Conference on Machine Learning, edited by Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, 382:873–80. ACM International Conference Proceeding Series. ACM. https://doi.org/10.1145/1553374.1553486.\nThis raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads (Zhihao Jia, Zaharia, and Aiken 2019). Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on a chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.\nFor example, Nvidia’s latest H100 GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.\n\n\nMature Software Ecosystem\nNvidia provides extensive runtime libraries like cuDNN and cuBLAS that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration without direct programming. CUDA provides lower-level control for custom computations.\nThis ecosystem enables quick leveraging of GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.\n\n\nBroad Availability\nThe economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient ML experimentation and innovation platform. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.\n\n\nProgrammable Architecture\nWhile not as flexible as FPGAs, GPUs provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.\n\n\n\nDisadvantages\nWhile GPUs have become the standard accelerator for deep learning, their architecture has some key downsides.\n\nLess Efficient than Custom ASICs\nThe statement “GPUs are less efficient than ASICs” could spark intense debate within the ML/AI field and cause this book to explode.\nTypically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. With their general-purpose architecture, GPUs are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.\nHowever, modern GPUs have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, and native support for pruning, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks to the point where they can rival the performance of ASICs for certain applications.\nConsequently, contemporary GPUs are convergent, incorporating specialized ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware. GPUs offer a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.\n\n\nHigh Memory Bandwidth Needs\nThe massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores, as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its computer. GPUs rely on wide 384-bit memory buses to high-bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out at around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.\n\n\nProgramming Complexity\nWhile tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging, achieving both high utilization and memory locality requires low-level tuning (Zhe Jia et al. 2018). Abstractions like TensorFlow can leave performance on the table.\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” ArXiv Preprint. https://arxiv.org/abs/1804.06826.\n\n\nLimited On-Chip Memory\nGPUs have relatively small on-chip memory caches compared to ML models’ large working set requirements during training. They rely on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.\n\n\nFixed Architecture\nUnlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.\n\n\n\nCase Study\nThe recent groundbreaking research conducted by OpenAI (Brown et al. 2020) with their GPT-3 model. GPT-3, a language model with 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.\n\n\n\n10.3.5 Central Processing Units (CPUs)\nThe term CPUs has a long history that dates back to 1955 (Weik 1955) while the first microprocessor CPU-the Intel 4004-was invented in 1971 (Who Invented the Microprocessor?). Compilers compile high-level programming languages like Python, Java, or C to assemble instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set.” It must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description of instruction set architectures-ISAs).\n\nWeik, Martin H. 1955. A Survey of Domestic Electronic Digital Computing Systems. Ballistic Research Laboratories.\nAn overview of significant developments in CPUs:\n\nSingle-core Era (1950s- 2000): This era is known for aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.\nMulticore Era (2000s): Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now, tasks can be split across many different cores, each with its own datapath and control unit. Many of the issues in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.\nSea of accelerators (2010s): Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached to the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as judges, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and programming the accelerator became a non-trivial hurdle that sparked interest in design-specific libraries (DSLs).\nPresence in data centers: Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center.\nOn the edge: Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.\n\nTraditionally, CPUs have been synonymous with general-purpose computing, a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing,” they were usually implemented as a co-processor (a modular component that worked with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.\n\nAdvantages\nWhile raw throughput is limited, general-purpose CPUs provide practical AI acceleration benefits.\n\nGeneral Programmability\nCPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems, which allow running any application, from databases and web servers to analytics pipelines (Hennessy and Patterson 2019).\n\nHennessy, John L., and David A. Patterson. 2019. “A New Golden Age for Computer Architecture.” Commun. ACM 62 (2): 48–60. https://doi.org/10.1145/3282307.\nThis avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.\n\n\nMature Software Ecosystem\nFor decades, highly optimized math libraries like BLAS, LAPACK, and FFTW have leveraged vectorized instructions and multithreading on CPUs (Dongarra 2009). Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.\n\nDongarra, Jack J. 2009. “The Evolution of High Performance Computing on System z.” IBM J. Res. Dev. 53: 3–4.\nHardware vendors like Intel and AMD also provide low-level libraries to optimize performance for deep learning primitives fully (AI Inference Acceleration on CPUs). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.\n\n\nWide Availability\nThe economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades (Ranganathan 2011). This wide availability in data centers reduces hardware costs for basic ML deployment.\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to Nanostores: Rethinking Data-Centric Systems.” Computer 44 (1): 39–48. https://doi.org/10.1109/mc.2011.18.\nEven small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces the need to purchase specialized ML accelerators in many situations.\n\n\nLow Power for Inference\nOptimizations like ARM Neon and Intel AVX vector extensions provide power-efficient integer and floating point throughput optimized for “bursty” workloads such as inference (Ignatov et al. 2018). While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM’s Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices (ARM).\n\n\n\nDisadvantages\nWhile providing some advantages, general-purpose CPUs also have limitations for AI workloads.\n\nLower Throughput than Accelerators\nCPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design reduces computational throughput for the highly parallelizable math operations common in ML models (N. P. Jouppi et al. 2017a).\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12. ISCA ’17. New York, NY, USA: ACM. https://doi.org/10.1145/3079856.3080246.\n\n\nNot Optimized for Data Parallelism\nThe architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI (Sze et al. 2017). They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provides little benefit for the array operations used in neural networks (AI Inference Acceleration on CPUs). However, modern CPUs are equipped with vector instructions like AVX-512 specifically to accelerate certain key operations like matrix multiplication.\nGPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.\n\n\nHigher Memory Latency\nCPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators (DDR). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.\n\n\nPower Inefficiency Under Heavy Workloads\nWhile suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs (Ignatov et al. 2018). Accelerators explicitly optimize the data flow, memory, and computation for sustained ML workloads. CPUs are energy-inefficient for training large models.\n\n\n\n\n10.3.6 Comparison\nTable 10.2 Compare the different types of hardware features.\n\n\n\nTable 10.2: Comparison of different hardware accelerators for AI workloads.\n\n\n\n\n\n\n\n\n\n\n\nAccelerator\nDescription\nKey Advantages\nKey Disadvantages\n\n\n\n\nASICs\nCustom ICs designed for target workloads like AI inference\n\nMaximizes perf/watt.\nOptimized for tensor ops\nLow latency on-chip memory\n\n\nFixed architecture lacks flexibility\nHigh NRE cost\nLong design cycles\n\n\n\nFPGAs\nReconfigurable fabric with programmable logic and routing\n\nFlexible architecture\nLow latency memory access\n\n\nLower perf/watt than ASICs\nComplex programming\n\n\n\nGPUs\nOriginally for graphics, now used for neural network acceleration\n\nHigh throughput\nParallel scalability\nSoftware ecosystem with CUDA\n\n\nNot as power efficient as ASICs\nRequire high memory bandwidth\n\n\n\nCPUs\nGeneral purpose processors\n\nProgrammability\nUbiquitous availability\n\n\nLower performance for AI workloads\n\n\n\n\n\n\n\nIn general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the target application’s scale, cost, flexibility, and other requirements.\nAlthough first developed for data center deployment, Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge.",
+ "text": "10.3 Accelerator Types\nHardware accelerators can take on many forms. They can exist as a widget (like the Neural Engine in the Apple M1 chip) or as entire chips specially designed to perform certain tasks very well. This section will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.\nWe then progressively consider more programmable and adaptable architectures, discussing GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs that sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.\nBy structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs between utilization, efficiency, programmability, and flexibility in accelerator design. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.\nFigure 10.2 illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functinoal diversity vs performance: general purpose architechtures can serve diverse applications but their application performance is degraded as compared to more customized architectures.\nThe progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.\n\n\n\n\n\n\nFigure 10.2: Design tradeoffs. Source: El-Rayis (2014).\n\n\nEl-Rayis, A. O. 2014. “Reconfigurable Architectures for the Next Generation of Mobile Device Telecommunications Systems.” : https://www.researchgate.net/publication/292608967.\n\n\n\n10.3.1 Application-Specific Integrated Circuits (ASICs)\nAn Application-Specific Integrated Circuit (ASIC) is a type of integrated circuit (IC) that is custom-designed for a specific application or workload rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. The Google TPU is an example of an ASIC.\nASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms dedicate enormous engineering resources to designing customized ASICs.\nThe rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.\n\nAdvantages\nDue to their customized nature, ASICs provide significant benefits over general-purpose processors like CPUs and GPUs. The key advantages include the following.\n\nMaximized Performance and Efficiency\nThe most fundamental advantage of ASICs is maximizing performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.\nFor example, Google’s Tensor Processing Units (TPUs) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google’s engineering teams need to define the chip specifications clearly, write the architecture description using Hardware Description Languages like Verilog, synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an optimized IC for machine learning workloads.\nAs a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general-purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.\n\n\nSpecialized On-Chip Memory\nASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple’s M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables data to be kept close to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which can be up to 100x slower.\nData locality and optimizing memory hierarchy are crucial for high throughput and low power. Table 10.1 shows “Numbers Everyone Should Know,” from Jeff Dean.\n\n\n\nTable 10.1: Latency comparison of operations in computing and networking.\n\n\n\n\n\n\n\n\n\n\nOperation\nLatency\nNotes\n\n\n\n\nL1 cache reference\n0.5 ns\n\n\n\nBranch mispredict\n5 ns\n\n\n\nL2 cache reference\n7 ns\n\n\n\nMutex lock/unlock\n25 ns\n\n\n\nMain memory reference\n100 ns\n\n\n\nCompress 1K bytes with Zippy\n3,000 ns\n3 us\n\n\nSend 1 KB bytes over 1 Gbps network\n10,000 ns\n10 us\n\n\nRead 4 KB randomly from SSD\n150,000 ns\n150 us\n\n\nRead 1 MB sequentially from memory\n250,000 ns\n250 us\n\n\nRound trip within same datacenter\n500,000 ns\n0.5 ms\n\n\nRead 1 MB sequentially from SSD\n1,000,000 ns\n1 ms\n\n\nDisk seek\n10,000,000 ns\n10 ms\n\n\nRead 1 MB sequentially from disk\n20,000,000 ns\n20 ms\n\n\nSend packet CA → Netherlands → CA\n150,000,000 ns\n150 ms\n\n\n\n\n\n\n\n\nCustom Datatypes and Operations\nUnlike general-purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16, which are widely used in ML models. For instance, Nvidia’s Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low-precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. Please refer to the Efficient Numeric Representations chapter for additional details.\n\n\nHigh Parallelism\nASIC architectures can leverage higher parallelism tuned for the target workload versus general-purpose CPUs or GPUs. More computational units tailored for the application mean more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.\n\n\nAdvanced Process Nodes\nCutting-edge manufacturing processes allow more transistors to be packed into smaller die areas, increasing density. ASICs designed specifically for high-volume applications can better amortize the costs of cutting-edge process nodes.\n\n\n\nDisadvantages\n\nLong Design Timelines\nThe engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involve long development cycles. For example, to tape out a 7nm chip, teams need to define specifications carefully, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large-scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.\nThere are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:\n\nML algorithms evolve rapidly: New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP last few years. When an ASIC finishes tapeout, the optimal architecture for a workload may have changed.\nDatasets grow quickly: ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.\nML applications change frequently: The industry focus shifts between computer vision, speech, NLP, recommender systems, etc. An ASIC optimized for image classification may have less relevance in a few years.\nFaster design cycles with GPUs/FPGAs: Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.\nTime-to-market needs: Getting a competitive edge in ML requires rapidly experimenting with and deploying new ideas. Waiting several years for an ASIC is different from fast iteration.\n\nThe pace of innovation in ML needs to be better matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. However, the rapid evolution of ML makes fixed-function hardware challenging.\n\n\nHigh Non-Recurring Engineering Costs\nThe fixed costs of taking an ASIC from design to high-volume manufacturing can be very capital-intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts is expensive. For instance, a 7nm chip tape-out alone could cost millions. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.\n\n\nComplex Integration and Programming\nASICs require extensive software integration work, including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, efficiently programming ASIC architectures can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.\nWhile ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.\n\n\n\n\n10.3.2 Field-Programmable Gate Arrays (FPGAs)\nFPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA are considering putting ASICs in data centers, Microsoft deployed FPGAs in its data centers (Putnam et al. 2014) in 2011 to efficiently serve diverse data center workloads.\n\nAdvantages\nFPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.\n\nFlexibility Through Reconfigurable Fabric\nThe key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than tapping out new ASICs. Figure 10.3 contains a table comparing three different FPGAs.\n\n\n\n\n\n\nFigure 10.3: Comparison of FPGAs. Source: Gwennap (n.d.).\n\n\nGwennap, Linley. n.d. “Certus-NX Innovates General-Purpose FPGAs.”\n\n\nFPGAs comprise basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.\nWhile FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. Microsoft has deployed FPGAs in its Azure data centers for machine learning workloads to serve diverse applications instead of ASICs. The programmability enables optimization across changing ML models.\n\n\nCustomized Parallelism and Pipelining\nFPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel’s HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model’s structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.\n\n\nLow Latency On-Chip Memory\nLarge amounts of high-bandwidth on-chip memory enable localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low-latency RAM blocks and dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that traverse PCIe or other system buses to reach off-chip GDDR6 memory.\n\n\nNative Support for Low Precision\nA key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16, used in quantized ML models. For example, Intel’s Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W Intel Stratix 10 NX FPGA. Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.\n\n\n\nDisadvantages\n\nLower Peak Throughput than ASICs\nFPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance, while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS Intel Stratix 10 NX FPGA.\nThis is because FPGAs comprise basic building blocks—configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile it into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.\n\n\nProgramming Complexity\nTo optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles than higher-level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.\n\n\nReconfiguration Overheads\nChanging FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.\n\n\nDiminishing Gains on Advanced Nodes\nWhile smaller process nodes greatly benefit ASICs, they provide fewer advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of the configurable fabric also diminish gains compared to fixed-function ASICs.\n\n\nCase Study\nFPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements and 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively (Xiong et al. 2021).\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei Cao, Xuegong Zhou, et al. 2021. “MRI-Based Brain Tumor Segmentation Using FPGA-Accelerated Neural Network.” BMC Bioinf. 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\n\n\n10.3.3 Digital Signal Processors (DSPs)\nThe first digital signal processor core was built in 1948 by Texas Instruments (The Evolution of Audio DSPs). Traditionally, DSPs would have logic to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate-MAC was one of the most common operations), and then write the result back to memory. The DSP would include specialized analog components to retrieve digital/audio data.\nOnce we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s rare to have entire chips dedicated to just DSP, but a System on Chip would include DSPs and general-purpose CPUs. For example, Qualcomm’s Hexagon Digital Signal Processor claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” Google Tensors, the chip in the Google Pixel phones, also includes CPUs and specialized DSP engines.\n\nAdvantages\nDSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.\n\nOptimized Architecture for Vector Math\nDSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (“Ceva SensPro Fuses AI and Vector DSP”) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.\n\n\nLow Latency On-Chip Memory\nDSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog’s SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.\n\n\nPower Efficiency\nDSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, Qualcomm’s Hexagon DSP can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.\n\n\nSupport for Integer and Floating Point Math\nUnlike GPUs that excel at single or half precision, DSPs can natively support 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs support dot product acceleration at INT8 precision for quantized neural networks.\n\n\n\nDisadvantages\nDSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. However, their advantages in power efficiency and integer math make them a strong edge computing option. So, while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:\n\nLower Peak Throughput than ASICs/GPUs\nDSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm’s Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.\n\n\nSlower Double Precision Performance\nMost DSPs must be optimized for the higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32, which provide better power efficiency. However, 64-bit floating point throughput is much lower, which can limit usage in models requiring high precision.\n\n\nConstrained Model Capacity\nThe limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.\n\n\nProgramming Complexity\nEfficient programming of DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have a steeper learning curve than high-level software frameworks, making development more complex.\n\n\n\n\n10.3.4 Graphics Processing Units (GPUs)\nThe term graphics processing unit has existed since at least the 1980s. There had always been a demand for graphics hardware in video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but higher resolution, could be at a high price point).\nThe term was popularized, however, in 1999 when NVIDIA launched the GeForce 256, mainly targeting the PC games market sector (Lindholm et al. 2008). As PC games became more sophisticated, NVIDIA GPUs became more programmable. Soon, users realized they could take advantage of this programmability, run various non-graphics-related workloads on GPUs, and benefit from the underlying architecture. And so, in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.\nIntel Arc Graphics and AMD Radeon RX have also developed their GPUs over time.\n\nAdvantages\n\nHigh Computational Throughput\nThe key advantage of GPUs is their ability to perform massively parallel floating-point calculations optimized for computer graphics and linear algebra (Raina, Madhavan, and Ng 2009). Modern GPUs like Nvidia’s A100 offer up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory tightly coupled with 1.6TB/s of graphics memory bandwidth.\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale Deep Unsupervised Learning Using Graphics Processors.” In Proceedings of the 26th Annual International Conference on Machine Learning, edited by Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, 382:873–80. ACM International Conference Proceeding Series. ACM. https://doi.org/10.1145/1553374.1553486.\nThis raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads (Zhihao Jia, Zaharia, and Aiken 2019). Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on a chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.\nFor example, Nvidia’s latest H100 GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.\n\n\nMature Software Ecosystem\nNvidia provides extensive runtime libraries like cuDNN and cuBLAS that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration without direct programming. CUDA provides lower-level control for custom computations.\nThis ecosystem enables quick leveraging of GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.\n\n\nBroad Availability\nThe economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient ML experimentation and innovation platform. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.\n\n\nProgrammable Architecture\nWhile not as flexible as FPGAs, GPUs provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.\n\n\n\nDisadvantages\nWhile GPUs have become the standard accelerator for deep learning, their architecture has some key downsides.\n\nLess Efficient than Custom ASICs\nThe statement “GPUs are less efficient than ASICs” could spark intense debate within the ML/AI field and cause this book to explode.\nTypically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. With their general-purpose architecture, GPUs are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.\nHowever, modern GPUs have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, and native support for pruning, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks to the point where they can rival the performance of ASICs for certain applications.\nConsequently, contemporary GPUs are convergent, incorporating specialized ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware. GPUs offer a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.\n\n\nHigh Memory Bandwidth Needs\nThe massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores, as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its computer. GPUs rely on wide 384-bit memory buses to high-bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out at around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.\n\n\nProgramming Complexity\nWhile tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging, achieving both high utilization and memory locality requires low-level tuning (Zhe Jia et al. 2018). Abstractions like TensorFlow can leave performance on the table.\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” ArXiv Preprint. https://arxiv.org/abs/1804.06826.\n\n\nLimited On-Chip Memory\nGPUs have relatively small on-chip memory caches compared to ML models’ large working set requirements during training. They rely on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.\n\n\nFixed Architecture\nUnlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.\n\n\n\nCase Study\nThe recent groundbreaking research conducted by OpenAI (Brown et al. 2020) with their GPT-3 model. GPT-3, a language model with 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.\n\n\n\n10.3.5 Central Processing Units (CPUs)\nThe term CPUs has a long history that dates back to 1955 (Weik 1955) while the first microprocessor CPU-the Intel 4004-was invented in 1971 (Who Invented the Microprocessor?). Compilers compile high-level programming languages like Python, Java, or C to assemble instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set.” It must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description of instruction set architectures-ISAs).\n\nWeik, Martin H. 1955. A Survey of Domestic Electronic Digital Computing Systems. Ballistic Research Laboratories.\nAn overview of significant developments in CPUs:\n\nSingle-core Era (1950s- 2000): This era is known for aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.\nMulticore Era (2000s): Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now, tasks can be split across many different cores, each with its own datapath and control unit. Many of the issues in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.\nSea of accelerators (2010s): Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached to the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as judges, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and programming the accelerator became a non-trivial hurdle that sparked interest in design-specific libraries (DSLs).\nPresence in data centers: Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center.\nOn the edge: Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.\n\nTraditionally, CPUs have been synonymous with general-purpose computing, a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing,” they were usually implemented as a co-processor (a modular component that worked with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.\n\nAdvantages\nWhile raw throughput is limited, general-purpose CPUs provide practical AI acceleration benefits.\n\nGeneral Programmability\nCPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems, which allow running any application, from databases and web servers to analytics pipelines (Hennessy and Patterson 2019).\n\nHennessy, John L., and David A. Patterson. 2019. “A New Golden Age for Computer Architecture.” Commun. ACM 62 (2): 48–60. https://doi.org/10.1145/3282307.\nThis avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.\n\n\nMature Software Ecosystem\nFor decades, highly optimized math libraries like BLAS, LAPACK, and FFTW have leveraged vectorized instructions and multithreading on CPUs (Dongarra 2009). Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.\n\nDongarra, Jack J. 2009. “The Evolution of High Performance Computing on System z.” IBM J. Res. Dev. 53: 3–4.\nHardware vendors like Intel and AMD also provide low-level libraries to optimize performance for deep learning primitives fully (AI Inference Acceleration on CPUs). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.\n\n\nWide Availability\nThe economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades (Ranganathan 2011). This wide availability in data centers reduces hardware costs for basic ML deployment.\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to Nanostores: Rethinking Data-Centric Systems.” Computer 44 (1): 39–48. https://doi.org/10.1109/mc.2011.18.\nEven small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces the need to purchase specialized ML accelerators in many situations.\n\n\nLow Power for Inference\nOptimizations like ARM Neon and Intel AVX vector extensions provide power-efficient integer and floating point throughput optimized for “bursty” workloads such as inference (Ignatov et al. 2018). While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM’s Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices (ARM).\n\n\n\nDisadvantages\nWhile providing some advantages, general-purpose CPUs also have limitations for AI workloads.\n\nLower Throughput than Accelerators\nCPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design reduces computational throughput for the highly parallelizable math operations common in ML models (N. P. Jouppi et al. 2017a).\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12. ISCA ’17. New York, NY, USA: ACM. https://doi.org/10.1145/3079856.3080246.\n\n\nNot Optimized for Data Parallelism\nThe architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI (Sze et al. 2017). They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provides little benefit for the array operations used in neural networks (AI Inference Acceleration on CPUs). However, modern CPUs are equipped with vector instructions like AVX-512 specifically to accelerate certain key operations like matrix multiplication.\nGPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.\n\n\nHigher Memory Latency\nCPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators (DDR). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.\n\n\nPower Inefficiency Under Heavy Workloads\nWhile suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs (Ignatov et al. 2018). Accelerators explicitly optimize the data flow, memory, and computation for sustained ML workloads. CPUs are energy-inefficient for training large models.\n\n\n\n\n10.3.6 Comparison\nTable 10.2 compares the different types of hardware features.\n\n\n\nTable 10.2: Comparison of different hardware accelerators for AI workloads.\n\n\n\n\n\n\n\n\n\n\n\nAccelerator\nDescription\nKey Advantages\nKey Disadvantages\n\n\n\n\nASICs\nCustom ICs designed for target workloads like AI inference\n\nMaximizes perf/watt.\nOptimized for tensor ops\nLow latency on-chip memory\n\n\nFixed architecture lacks flexibility\nHigh NRE cost\nLong design cycles\n\n\n\nFPGAs\nReconfigurable fabric with programmable logic and routing\n\nFlexible architecture\nLow latency memory access\n\n\nLower perf/watt than ASICs\nComplex programming\n\n\n\nGPUs\nOriginally for graphics, now used for neural network acceleration\n\nHigh throughput\nParallel scalability\nSoftware ecosystem with CUDA\n\n\nNot as power efficient as ASICs\nRequire high memory bandwidth\n\n\n\nCPUs\nGeneral purpose processors\n\nProgrammability\nUbiquitous availability\n\n\nLower performance for AI workloads\n\n\n\n\n\n\n\nIn general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the target application’s scale, cost, flexibility, and other requirements.\nAlthough first developed for data center deployment, Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge.",
"crumbs": [
"Training",
"10 AI Acceleration "
@@ -1379,7 +1379,7 @@
"href": "contents/ops/ops.html#historical-context",
"title": "13 ML Operations",
"section": "13.2 Historical Context",
- "text": "13.2 Historical Context\nMLOps has its roots in DevOps, a set of practices combining software development (Dev) and IT operations (Ops) to shorten the development lifecycle and provide continuous delivery of high-quality software. The parallels between MLOps and DevOps are evident in their focus on automation, collaboration, and continuous improvement. In both cases, the goal is to break down silos between different teams (developers, operations, and, in the case of MLOps, data scientists and ML engineers) and to create a more streamlined and efficient process. It is useful to understand the history of this evolution better to understand MLOps in the context of traditional systems.\n\n13.2.1 DevOps\nThe term “DevOps” was first coined in 2009 by Patrick Debois, a consultant and Agile practitioner. Debois organized the first DevOpsDays conference in Ghent, Belgium, in 2009. The conference brought together development and operations professionals to discuss ways to improve collaboration and automate processes.\nDevOps has its roots in the Agile movement, which began in the early 2000s. Agile provided the foundation for a more collaborative approach to software development and emphasized small, iterative releases. However, Agile primarily focuses on collaboration between development teams. As Agile methodologies became more popular, organizations realized the need to extend this collaboration to operations teams.\nThe siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between these teams led to the DevOps movement. DevOps can be seen as an extension of the Agile principles, including operations teams.\nThe key principles of DevOps include collaboration, automation, continuous integration, delivery, and feedback. DevOps focuses on automating the entire software delivery pipeline, from development to deployment. It aims to improve the collaboration between development and operations teams, utilizing tools like Jenkins, Docker, and Kubernetes to streamline the development lifecycle.\nWhile Agile and DevOps share common principles around collaboration and feedback, DevOps specifically targets integrating development and IT operations - expanding Agile beyond just development teams. It introduces practices and tools to automate software delivery and improve the speed and quality of software releases.\n\n\n13.2.2 MLOps\nMLOps, on the other hand, stands for MLOps, and it extends the principles of DevOps to the ML lifecycle. MLOps aims to automate and streamline the end-to-end ML lifecycle, from data preparation and model development to deployment and monitoring. The main focus of MLOps is to facilitate collaboration between data scientists, data engineers, and IT operations and to automate the deployment, monitoring, and management of ML models. Some key factors led to the rise of MLOps.\n\nData drift: Data drift degrades model performance over time, motivating the need for rigorous monitoring and automated retraining procedures provided by MLOps.\nReproducibility: The lack of reproducibility in machine learning experiments motivated MLOps systems to track code, data, and environment variables to enable reproducible ML workflows.\nExplainability: The black box nature and lack of explainability of complex models motivated the need for MLOps capabilities to increase model transparency and explainability.\nMonitoring: The inability to reliably monitor model performance post-deployment highlighted the need for MLOps solutions with robust model performance instrumentation and alerting.\nFriction: The friction in manually retraining and deploying models motivated the need for MLOps systems that automate machine learning deployment pipelines.\nOptimization: The complexity of configuring machine learning infrastructure motivated the need for MLOps platforms with optimized, ready-made ML infrastructure.\n\nWhile DevOps and MLOps share the common goal of automating and streamlining processes, their focus and challenges differ. DevOps primarily deals with the challenges of software development and IT operations. In contrast, MLOps deals with the additional complexities of managing ML models, such as data versioning, model versioning, and model monitoring. MLOps also requires stakeholder collaboration, including data scientists, engineers, and IT operations.\nWhile DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations.\nTable 13.1 compares and summarizes them side by side.\n\n\n\nTable 13.1: Comparison of DevOps and MLOps.\n\n\n\n\n\n\n\n\n\n\nAspect\nDevOps\nMLOps\n\n\n\n\nObjective\nStreamlining software development and operations processes\nOptimizing the lifecycle of machine learning models\n\n\nMethodology\nContinuous Integration and Continuous Delivery (CI/CD) for software development\nSimilar to CI/CD but focuses on machine learning workflows\n\n\nPrimary Tools\nVersion control (Git), CI/CD tools (Jenkins, Travis CI), Configuration management (Ansible, Puppet)\nData versioning tools, Model training and deployment tools, CI/CD pipelines tailored for ML\n\n\nPrimary Concerns\nCode integration, Testing, Release management, Automation, Infrastructure as code\nData management, Model versioning, Experiment tracking, Model deployment, Scalability of ML workflows\n\n\nTypical Outcomes\nFaster and more reliable software releases, Improved collaboration between development and operations teams\nEfficient management and deployment of machine learning models, Enhanced collaboration between data scientists and engineers\n\n\n\n\n\n\nLearn more about ML Lifecycles through a case study featuring speech recognition in Video 13.1.\n\n\n\n\n\n\nVideo 13.1: MLOps",
+ "text": "13.2 Historical Context\nMLOps has its roots in DevOps, a set of practices combining software development (Dev) and IT operations (Ops) to shorten the development lifecycle and provide continuous delivery of high-quality software. The parallels between MLOps and DevOps are evident in their focus on automation, collaboration, and continuous improvement. In both cases, the goal is to break down silos between different teams (developers, operations, and, in the case of MLOps, data scientists and ML engineers) and to create a more streamlined and efficient process. It is useful to understand the history of this evolution better to understand MLOps in the context of traditional systems.\n\n13.2.1 DevOps\nThe term “DevOps” was first coined in 2009 by Patrick Debois, a consultant and Agile practitioner. Debois organized the first DevOpsDays conference in Ghent, Belgium, in 2009. The conference brought together development and operations professionals to discuss ways to improve collaboration and automate processes.\nDevOps has its roots in the Agile movement, which began in the early 2000s. Agile provided the foundation for a more collaborative approach to software development and emphasized small, iterative releases. However, Agile primarily focuses on collaboration between development teams. As Agile methodologies became more popular, organizations realized the need to extend this collaboration to operations teams.\nThe siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between these teams led to the DevOps movement. DevOps can be seen as an extension of the Agile principles, including operations teams.\nThe key principles of DevOps include collaboration, automation, continuous integration, delivery, and feedback. DevOps focuses on automating the entire software delivery pipeline, from development to deployment. It aims to improve the collaboration between development and operations teams, utilizing tools like Jenkins, Docker, and Kubernetes to streamline the development lifecycle.\nWhile Agile and DevOps share common principles around collaboration and feedback, DevOps specifically targets integrating development and IT operations - expanding Agile beyond just development teams. It introduces practices and tools to automate software delivery and improve the speed and quality of software releases.\n\n\n13.2.2 MLOps\nMLOps, on the other hand, stands for MLOps, and it extends the principles of DevOps to the ML lifecycle. MLOps aims to automate and streamline the end-to-end ML lifecycle, from data preparation and model development to deployment and monitoring. The main focus of MLOps is to facilitate collaboration between data scientists, data engineers, and IT operations and to automate the deployment, monitoring, and management of ML models. Some key factors led to the rise of MLOps.\n\nData drift: Data drift degrades model performance over time, motivating the need for rigorous monitoring and automated retraining procedures provided by MLOps.\nReproducibility: The lack of reproducibility in machine learning experiments motivated MLOps systems to track code, data, and environment variables to enable reproducible ML workflows.\nExplainability: The black box nature and lack of explainability of complex models motivated the need for MLOps capabilities to increase model transparency and explainability.\nMonitoring: The inability to reliably monitor model performance post-deployment highlighted the need for MLOps solutions with robust model performance instrumentation and alerting.\nFriction: The friction in manually retraining and deploying models motivated the need for MLOps systems that automate machine learning deployment pipelines.\nOptimization: The complexity of configuring machine learning infrastructure motivated the need for MLOps platforms with optimized, ready-made ML infrastructure.\n\nWhile DevOps and MLOps share the common goal of automating and streamlining processes, their focus and challenges differ. DevOps primarily deals with the challenges of software development and IT operations. In contrast, MLOps deals with the additional complexities of managing ML models, such as data versioning, model versioning, and model monitoring. MLOps also requires stakeholder collaboration, including data scientists, engineers, and IT operations.\nWhile DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations. Table 13.1 compares and summarizes them side by side.\n\n\n\nTable 13.1: Comparison of DevOps and MLOps.\n\n\n\n\n\n\n\n\n\n\nAspect\nDevOps\nMLOps\n\n\n\n\nObjective\nStreamlining software development and operations processes\nOptimizing the lifecycle of machine learning models\n\n\nMethodology\nContinuous Integration and Continuous Delivery (CI/CD) for software development\nSimilar to CI/CD but focuses on machine learning workflows\n\n\nPrimary Tools\nVersion control (Git), CI/CD tools (Jenkins, Travis CI), Configuration management (Ansible, Puppet)\nData versioning tools, Model training and deployment tools, CI/CD pipelines tailored for ML\n\n\nPrimary Concerns\nCode integration, Testing, Release management, Automation, Infrastructure as code\nData management, Model versioning, Experiment tracking, Model deployment, Scalability of ML workflows\n\n\nTypical Outcomes\nFaster and more reliable software releases, Improved collaboration between development and operations teams\nEfficient management and deployment of machine learning models, Enhanced collaboration between data scientists and engineers\n\n\n\n\n\n\nLearn more about ML Lifecycles through a case study featuring speech recognition in Video 13.1.\n\n\n\n\n\n\nVideo 13.1: MLOps",
"crumbs": [
"Deployment",
"13 ML Operations "
@@ -1544,7 +1544,7 @@
"href": "contents/privacy_security/privacy_security.html#security-threats-to-ml-hardware",
"title": "14 Security & Privacy",
"section": "14.5 Security Threats to ML Hardware",
- "text": "14.5 Security Threats to ML Hardware\nA systematic examination of security threats to embedded machine learning hardware is essential to comprehensively understanding potential vulnerabilities in ML systems. Initially, hardware vulnerabilities arising from intrinsic design flaws that can be exploited will be explored. This foundational knowledge is crucial for recognizing the origins of hardware weaknesses. Following this, physical attacks will be examined, representing the most direct and overt methods of compromising hardware integrity. Building on this, fault injection attacks will be analyzed, demonstrating how deliberate manipulations can induce system failures.\nAdvancing to side-channel attacks next will show the increasing complexity, as these rely on exploiting indirect information leakages, requiring a nuanced understanding of hardware operations and environmental interactions. Leaky interfaces will show how external communication channels can become vulnerable, leading to accidental data exposures. Counterfeit hardware discussions benefit from prior explorations of hardware integrity and exploitation techniques, as they often compound these issues with additional risks due to their questionable provenance. Finally, supply chain risks encompass all concerns above and frame them within the context of the hardware’s journey from production to deployment, highlighting the multifaceted nature of hardware security and the need for vigilance at every stage.\nTable 14.1 overview table summarizing the topics:\n\n\n\nTable 14.1: Threat types on hardware security.\n\n\n\n\n\n\n\n\n\n\nThreat Type\nDescription\nRelevance to Embedded ML Hardware Security\n\n\n\n\nHardware Bugs\nIntrinsic flaws in hardware designs that can compromise system integrity.\nFoundation of hardware vulnerability.\n\n\nPhysical Attacks\nDirect exploitation of hardware through physical access or manipulation.\nBasic and overt threat model.\n\n\nFault-injection Attacks\nInduction of faults to cause errors in hardware operation, leading to potential system crashes.\nSystematic manipulation leading to failure.\n\n\nSide-Channel Attacks\nExploitation of leaked information from hardware operation to extract sensitive data.\nIndirect attack via environmental observation.\n\n\nLeaky Interfaces\nVulnerabilities arising from interfaces that expose data unintentionally.\nData exposure through communication channels.\n\n\nCounterfeit Hardware\nUse of unauthorized hardware components that may have security flaws.\nCompounded vulnerability issues.\n\n\nSupply Chain Risks\nRisks introduced through the hardware lifecycle, from production to deployment.\nCumulative & multifaceted security challenges.\n\n\n\n\n\n\n\n14.5.1 Hardware Bugs\nHardware is not immune to the pervasive issue of design flaws or bugs. Attackers can exploit these vulnerabilities to access, manipulate, or extract sensitive data, breaching the confidentiality and integrity that users and services depend on. An example of such vulnerabilities came to light with the discovery of Meltdown and Spectre—two hardware vulnerabilities that exploit critical vulnerabilities in modern processors. These bugs allow attackers to bypass the hardware barrier that separates applications, allowing a malicious program to read the memory of other programs and the operating system.\nMeltdown (Kocher et al. 2019a) and Spectre (Kocher et al. 2019b) work by taking advantage of optimizations in modern CPUs that allow them to speculatively execute instructions out of order before validity checks have been completed. This reveals data that should be inaccessible, which the attack captures through side channels like caches. The technical complexity demonstrates the difficulty of eliminating vulnerabilities even with extensive validation.\n\n———, et al. 2019a. “Spectre Attacks: Exploiting Speculative Execution.” In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00002.\n\nKocher, Paul, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, et al. 2019b. “Spectre Attacks: Exploiting Speculative Execution.” In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00002.\nIf an ML system is processing sensitive data, such as personal user information or proprietary business analytics, Meltdown and Spectre represent a real and present danger to data security. Consider the case of an ML accelerator card designed to speed up machine learning processes, such as the ones we discussed in the A.I. Hardware chapter. These accelerators work with the CPU to handle complex calculations, often related to data analytics, image recognition, and natural language processing. If such an accelerator card has a vulnerability akin to Meltdown or Spectre, it could leak the data it processes. An attacker could exploit this flaw not just to siphon off data but also to gain insights into the ML model’s workings, including potentially reverse-engineering the model itself (thus, going back to the issue of model theft.\nA real-world scenario where this could be devastating would be in the healthcare industry. ML systems routinely process highly sensitive patient data to help diagnose, plan treatment, and forecast outcomes. A bug in the system’s hardware could lead to the unauthorized disclosure of personal health information, violating patient privacy and contravening strict regulatory standards like the Health Insurance Portability and Accountability Act (HIPAA)\nThe Meltdown and Spectre vulnerabilities are stark reminders that hardware security is not just about preventing unauthorized physical access but also about ensuring that the hardware’s architecture does not become a conduit for data exposure. Similar hardware design flaws regularly emerge in CPUs, accelerators, memory, buses, and other components. This necessitates ongoing retroactive mitigations and performance trade-offs in deployed systems. Proactive solutions like confidential computing architectures could mitigate entire classes of vulnerabilities through fundamentally more secure hardware design. Thwarting hardware bugs requires rigor at every design stage, validation, and deployment.\n\n\n14.5.2 Physical Attacks\nPhysical tampering refers to the direct, unauthorized manipulation of physical computing resources to undermine the integrity of machine learning systems. It’s a particularly insidious attack because it circumvents traditional cybersecurity measures, which often focus more on software vulnerabilities than hardware threats.\nPhysical tampering can take many forms, from the relatively simple, such as someone inserting a USB device loaded with malicious software into a server, to the highly sophisticated, such as embedding a hardware Trojan during the manufacturing process of a microchip (discussed later in greater detail in the Supply Chain section). ML systems are susceptible to this attack because they rely on the accuracy and integrity of their hardware to process and analyze vast amounts of data correctly.\nConsider an ML-powered drone used for geographical mapping. The drone’s operation relies on a series of onboard systems, including a navigation module that processes inputs from various sensors to determine its path. If an attacker gains physical access to this drone, they could replace the genuine navigation module with a compromised one that includes a backdoor. This manipulated module could then alter the drone’s flight path to conduct surveillance over restricted areas or even smuggle contraband by flying undetected routes.\nAnother example is the physical tampering of biometric scanners used for access control in secure facilities. By introducing a modified sensor that transmits biometric data to an unauthorized receiver, an attacker can access personal identification data to authenticate individuals.\nThere are several ways that physical tampering can occur in ML hardware:\n\nManipulating sensors: Consider an autonomous vehicle equipped with cameras and LiDAR for environmental perception. A malicious actor could deliberately manipulate the physical alignment of these sensors to create occlusion zones or distort distance measurements. This could compromise object detection capabilities and potentially endanger vehicle occupants.\nHardware trojans: Malicious circuit modifications can introduce trojans designed to activate upon specific input conditions. For instance, an ML accelerator chip might operate as intended until encountering a predetermined trigger, at which point it behaves erratically.\nTampering with memory: Physically exposing and manipulating memory chips could allow the extraction of encrypted ML model parameters. Fault injection techniques can also corrupt model data to degrade accuracy.\nIntroducing backdoors: Gaining physical access to servers, an adversary could use hardware keyloggers to capture passwords and create backdoor accounts for persistent access. These could then be used to exfiltrate ML training data over time.\nSupply chain attacks: Manipulating third-party hardware components or compromising manufacturing and shipping channels creates systemic vulnerabilities that are difficult to detect and remediate.\n\n\n\n14.5.3 Fault-injection Attacks\nBy intentionally introducing faults into ML hardware, attackers can induce errors in the computational process, leading to incorrect outputs. This manipulation compromises the integrity of ML operations and can serve as a vector for further exploitation, such as system reverse engineering or security protocol bypass. Fault injection involves deliberately disrupting standard computational operations in a system through external interference (Joye and Tunstall 2012). By precisely triggering computational errors, adversaries can alter program execution in ways that degrade reliability or leak sensitive information.\n\nJoye, Marc, and Michael Tunstall. 2012. Fault Analysis in Cryptography. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-29656-7.\n\nBarenghi, Alessandro, Guido M. Bertoni, Luca Breveglieri, Mauro Pellicioli, and Gerardo Pelosi. 2010. “Low Voltage Fault Attacks to AES.” In 2010 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 7–12. IEEE; IEEE. https://doi.org/10.1109/hst.2010.5513121.\n\nHutter, Michael, Jorn-Marc Schmidt, and Thomas Plos. 2009. “Contact-Based Fault Injections and Power Analysis on RFID Tags.” In 2009 European Conference on Circuit Theory and Design, 409–12. IEEE; IEEE. https://doi.org/10.1109/ecctd.2009.5275012.\n\nAmiel, Frederic, Christophe Clavier, and Michael Tunstall. 2006. “Fault Analysis of DPA-Resistant Algorithms.” In International Workshop on Fault Diagnosis and Tolerance in Cryptography, 223–36. Springer.\n\nAgrawal, Dakshi, Selcuk Baktir, Deniz Karakoyunlu, Pankaj Rohatgi, and Berk Sunar. 2007. “Trojan Detection Using IC Fingerprinting.” In 2007 IEEE Symposium on Security and Privacy (SP ’07), 29–45. Springer; IEEE. https://doi.org/10.1109/sp.2007.36.\n\nSkorobogatov, Sergei. 2009. “Local Heating Attacks on Flash Memory Devices.” In 2009 IEEE International Workshop on Hardware-Oriented Security and Trust, 1–6. IEEE; IEEE. https://doi.org/10.1109/hst.2009.5225028.\n\nSkorobogatov, Sergei P, and Ross J Anderson. 2003. “Optical Fault Induction Attacks.” In Cryptographic Hardware and Embedded Systems-CHES 2002: 4th International Workshop Redwood Shores, CA, USA, August 1315, 2002 Revised Papers 4, 2–12. Springer.\nVarious physical tampering techniques can be used for fault injection. Low voltage (Barenghi et al. 2010), power spikes (Hutter, Schmidt, and Plos 2009), clock glitches (Amiel, Clavier, and Tunstall 2006), electromagnetic pulses (Agrawal et al. 2007), temperate increase (S. Skorobogatov 2009) and laser strikes (S. P. Skorobogatov and Anderson 2003) are common hardware attack vectors. They are precisely timed to induce faults like flipped bits or skipped instructions during critical operations.\nFor ML systems, consequences include impaired model accuracy, denial of service, extraction of private training data or model parameters, and reverse engineering of model architectures. Attackers could use fault injection to force misclassifications, disrupt autonomous systems, or steal intellectual property.\nFor example, in (Breier et al. 2018), the authors successfully injected a fault attack into a deep neural network deployed on a microcontroller. They used a laser to heat specific transistors, forcing them to switch states. In one instance, they used this method to attack a ReLU activation function, resulting in the function always outputting a value of 0, regardless of the input. In the assembly code in Figure 14.2, the attack caused the executing program always to skip the jmp end instruction on line 6. This means that HiddenLayerOutput[i] is always set to 0, overwriting any values written to it on lines 4 and 5. As a result, the targeted neurons are rendered inactive, resulting in misclassifications.\n\n\n\n\n\n\nFigure 14.2: Fault-injection demonstrated with assembly code. Source: Breier et al. (2018).\n\n\nBreier, Jakub, Xiaolu Hou, Dirmanto Jap, Lei Ma, Shivam Bhasin, and Yang Liu. 2018. “Deeplaser: Practical Fault Attack on Deep Neural Networks.” ArXiv Preprint abs/1806.05859. https://arxiv.org/abs/1806.05859.\n\n\nAn attacker’s strategy could be to infer information about the activation functions using side-channel attacks (discussed next). Then, the attacker could attempt to target multiple activation function computations by randomly injecting faults into the layers as close to the output layer as possible, increasing the likelihood and impact of the attack.\nEmbedded devices are particularly vulnerable due to limited physical hardening and resource constraints that restrict robust runtime defenses. Without tamper-resistant packaging, attacker access to system buses and memory enables precise fault strikes. Lightweight embedded ML models also lack redundancy to overcome errors.\nThese attacks can be particularly insidious because they bypass traditional software-based security measures, often not accounting for physical disruptions. Furthermore, because ML systems rely heavily on the accuracy and reliability of their hardware for tasks like pattern recognition, decision-making, and automated responses, any compromise in their operation due to fault injection can have severe and wide-ranging consequences.\nMitigating fault injection risks necessitates a multilayer approach. Physical hardening through tamper-proof enclosures and design obfuscation helps reduce access. Lightweight anomaly detection can identify unusual sensor inputs or erroneous model outputs (Hsiao et al. 2023). Error-correcting memories minimize disruption, while data encryption safeguards information. Emerging model watermarking techniques trace stolen parameters.\n\nHsiao, Yu-Shun, Zishen Wan, Tianyu Jia, Radhika Ghosal, Abdulrahman Mahmoud, Arijit Raychowdhury, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 2023. “MAVFI: An End-to-End Fault Analysis Framework with Anomaly Detection and Recovery for Micro Aerial Vehicles.” In 2023 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 1–6. IEEE; IEEE. https://doi.org/10.23919/date56975.2023.10137246.\nHowever, balancing robust protections with embedded systems’ tight size and power limits remains challenging. Cryptography limits and lack of secure co-processors on cost-sensitive embedded hardware restrict options. Ultimately, fault injection resilience demands a cross-layer perspective spanning electrical, firmware, software, and physical design layers.\n\n\n14.5.4 Side-Channel Attacks\nSide-channel attacks are a category of security breach that depends on information gained from a computer system’s physical implementation. Unlike direct attacks on software or network vulnerabilities, side-channel attacks exploit a system’s hardware characteristics. These attacks can be particularly effective against complex machine learning systems, where large amounts of data are processed, and a high level of security is expected.\nThe fundamental premise of a side-channel attack is that a device’s operation can inadvertently leak information. Such leaks can come from various sources, including the electrical power a device consumes (Kocher, Jaffe, and Jun 1999), the electromagnetic fields it emits (Gandolfi, Mourtel, and Olivier 2001), the time it takes to process certain operations, or even the sounds it produces. Each channel can indirectly glimpse the system’s internal processes, revealing information that can compromise security.\n\nKocher, Paul, Joshua Jaffe, and Benjamin Jun. 1999. “Differential Power Analysis.” In Advances in CryptologyCRYPTO’99: 19th Annual International Cryptology Conference Santa Barbara, California, USA, August 1519, 1999 Proceedings 19, 388–97. Springer.\n\nGandolfi, Karine, Christophe Mourtel, and Francis Olivier. 2001. “Electromagnetic Analysis: Concrete Results.” In Cryptographic Hardware and Embedded SystemsCHES 2001: Third International Workshop Paris, France, May 1416, 2001 Proceedings 3, 251–61. Springer.\n\nKocher, Paul, Joshua Jaffe, Benjamin Jun, and Pankaj Rohatgi. 2011. “Introduction to Differential Power Analysis.” Journal of Cryptographic Engineering 1 (1): 5–27. https://doi.org/10.1007/s13389-011-0006-y.\nFor instance, consider a machine learning system performing encrypted transactions. Encryption algorithms are supposed to secure data but require computational work to encrypt and decrypt information. An attacker can analyze the power consumption patterns of the device performing encryption to figure out the cryptographic key. With sophisticated statistical methods, small variations in power usage during the encryption process can be correlated with the data being processed, eventually revealing the key. Some differential analysis attack techniques are Differential Power Analysis (DPA) (Kocher et al. 2011), Differential Electromagnetic Analysis (DEMA), and Correlation Power Analysis (CPA).\nFor example, consider an attacker trying to break the AES encryption algorithm using a differential analysis attack. The attacker would first need to collect many power or electromagnetic traces (a trace is a record of consumptions or emissions) of the device while performing AES encryption.\nOnce the attacker has collected sufficient traces, they would then use a statistical technique to identify correlations between the traces and the different values of the plaintext (original, unencrypted text) and ciphertext (encrypted text). These correlations would then be used to infer the value of a bit in the AES key and, eventually, the entire key. Differential analysis attacks are dangerous because they are low-cost, effective, and non-intrusive, allowing attackers to bypass algorithmic and hardware-level security measures. Compromises by these attacks are also hard to detect because they do not physically modify the device or break the encryption algorithm.\nBelow is a simplified visualization of how analyzing the power consumption patterns of the encryption device can help us extract information about the algorithm’s operations and, in turn, the secret data. Say we have a device that takes a 5-byte password as input. We will analyze and compare the different voltage patterns that are measured while the encryption device performs operations on the input to authenticate the password.\nFirst, consider the power analysis of the device’s operations after entering a correct password in the first picture in Figure 14.3. The dense blue graph outputs the encryption device’s voltage measurement. What matters here is the comparison between the different analysis charts rather than the specific details of what is going on in each scenario.\n\n\n\n\n\n\nFigure 14.3: Power analysis of an encryption device with a correct password. Source: Colin O’Flynn.\n\n\n\nLet’s look at the power analysis chart when we enter an incorrect password in Figure 14.4. The first three bytes of the password are correct. As a result, we can see that the voltage patterns are very similar or identical between the two charts, up to and including the fourth byte. After the device processes the fourth byte, it determines a mismatch between the secret key and the attempted input. We notice a change in the pattern at the transition point between the fourth and fifth bytes: the voltage has gone up (the current has gone down) because the device has stopped processing the rest of the input.\n\n\n\n\n\n\nFigure 14.4: Power analysis of an encryption device with a (partially) wrong password. Source: Colin O’Flynn.\n\n\n\nFigure 14.5 describes another chart of a completely wrong password. After the device finishes processing the first byte, it determines that it is incorrect and stops further processing - the voltage goes up and the current down.\n\n\n\n\n\n\nFigure 14.5: Power analysis of an encryption device with a wrong password. Source: Colin O’Flynn.\n\n\n\nThe example above shows how we can infer information about the encryption process and the secret key by analyzing different inputs and trying to ‘eavesdrop’ on the device’s operations on each input byte. For a more detailed explanation, watch Video 14.3 below.\n\n\n\n\n\n\nVideo 14.3: Power Attack\n\n\n\n\n\n\nAnother example is an ML system for speech recognition, which processes voice commands to perform actions. By measuring the time it takes for the system to respond to commands or the power used during processing, an attacker could infer what commands are being processed and thus learn about the system’s operational patterns. Even more subtle, the sound emitted by a computer’s fan or hard drive could change in response to the workload, which a sensitive microphone could pick up and analyze to determine what kind of operations are being performed.\nIn real-world scenarios, side-channel attacks have been used to extract encryption keys and compromise secure communications. One of the earliest recorded side-channel attacks dates back to the 1960s when British intelligence agency MI5 faced the challenge of deciphering encrypted communications from the Egyptian Embassy in London. Their cipher-breaking attempts were thwarted by the computational limitations of the time until an ingenious observation changed the game.\nMI5 agent Peter Wright proposed using a microphone to capture the subtle acoustic signatures emitted from the embassy’s rotor cipher machine during encryption (Burnet and Thomas 1989). The distinct mechanical clicks of the rotors as operators configured them daily leaked critical information about the initial settings. This simple side channel of sound enabled MI5 to dramatically reduce the complexity of deciphering messages. This early acoustic leak attack highlights that side-channel attacks are not merely a digital age novelty but a continuation of age-old cryptanalytic principles. The notion that where there is a signal, there is an opportunity for interception remains foundational. From mechanical clicks to electrical fluctuations and beyond, side channels enable adversaries to extract secrets indirectly through careful signal analysis.\n\nBurnet, David, and Richard Thomas. 1989. “Spycatcher: The Commodification of Truth.” J. Law Soc. 16 (2): 210. https://doi.org/10.2307/1410360.\n\nAsonov, D., and R. Agrawal. 2004. “Keyboard Acoustic Emanations.” In IEEE Symposium on Security and Privacy, 2004. Proceedings. 2004, 3–11. IEEE; IEEE. https://doi.org/10.1109/secpri.2004.1301311.\n\nGnad, Dennis R. E., Fabian Oboril, and Mehdi B. Tahoori. 2017. “Voltage Drop-Based Fault Attacks on FPGAs Using Valid Bitstreams.” In 2017 27th International Conference on Field Programmable Logic and Applications (FPL), 1–7. IEEE; IEEE. https://doi.org/10.23919/fpl.2017.8056840.\n\nZhao, Mark, and G. Edward Suh. 2018. “FPGA-Based Remote Power Side-Channel Attacks.” In 2018 IEEE Symposium on Security and Privacy (SP), 229–44. IEEE; IEEE. https://doi.org/10.1109/sp.2018.00049.\nToday, acoustic cryptanalysis has evolved into attacks like keyboard eavesdropping (Asonov and Agrawal 2004). Electrical side channels range from power analysis on cryptographic hardware (Gnad, Oboril, and Tahoori 2017) to voltage fluctuations (Zhao and Suh 2018) on machine learning accelerators. Timing, electromagnetic emission, and even heat footprints can likewise be exploited. New and unexpected side channels often emerge as computing becomes more interconnected and miniaturized.\nJust as MI5’s analog acoustic leak transformed their codebreaking, modern side-channel attacks circumvent traditional boundaries of cyber defense. Understanding the creative spirit and historical persistence of side channel exploits is key knowledge for developers and defenders seeking to secure modern machine learning systems comprehensively against digital and physical threats.\n\n\n14.5.5 Leaky Interfaces\nLeaky interfaces in embedded systems are often overlooked backdoors that can become significant security vulnerabilities. While designed for legitimate purposes such as communication, maintenance, or debugging, these interfaces may inadvertently provide attackers with a window through which they can extract sensitive information or inject malicious data.\nAn interface becomes “leaky” when it exposes more information than it should, often due to a lack of stringent access controls or inadequate shielding of the transmitted data. Here are some real-world examples of leaky interface issues causing security problems in IoT and embedded devices:\n\nBaby Monitors: Many WiFi-enabled baby monitors have been found to have unsecured interfaces for remote access. This allowed attackers to gain live audio and video feeds from people’s homes, representing a major privacy violation.\nPacemakers: Interface vulnerabilities were discovered in some pacemakers that could allow attackers to manipulate cardiac functions if exploited. This presents a potentially life-threatening scenario.\nSmart Lightbulbs: A researcher found he could access unencrypted data from smart lightbulbs via a debug interface, including WiFi credentials, allowing him to gain access to the connected network (Greengard 2015).\nSmart Cars: If left unsecured, The OBD-II diagnostic port has been shown to provide an attack vector into automotive systems. Researchers could use it to control brakes and other components (Miller and Valasek 2015).\n\n\nGreengard, Samuel. 2015. The Internet of Things. The MIT Press. https://doi.org/10.7551/mitpress/10277.001.0001.\n\nMiller, Charlie, and Chris Valasek. 2015. “Remote Exploitation of an Unaltered Passenger Vehicle.” Black Hat USA 2015 (S 91): 1–91.\nWhile the above are not directly connected with ML, consider the example of a smart home system with an embedded ML component that controls home security based on behavior patterns it learns over time. The system includes a maintenance interface accessible via the local network for software updates and system checks. If this interface does not require strong authentication or the data transmitted through it is not encrypted, an attacker on the same network could gain access. They could then eavesdrop on the homeowner’s daily routines or reprogram the security settings by manipulating the firmware.\nSuch leaks are a privacy issue and a potential entry point for more damaging exploits. The exposure of training data, model parameters, or ML outputs from a leak could help adversaries construct adversarial examples or reverse-engineer models. Access through a leaky interface could also be used to alter an embedded device’s firmware, loading it with malicious code that could turn off the device, intercept data, or use it in botnet attacks.\nTo mitigate these risks, a multi-layered approach is necessary, spanning technical controls like authentication, encryption, anomaly detection, policies and processes like interface inventories, access controls, auditing, and secure development practices. Turning off unnecessary interfaces and compartmentalizing risks via a zero-trust model provide additional protection.\nAs designers of embedded ML systems, we should assess interfaces early in development and continually monitor them post-deployment as part of an end-to-end security lifecycle. Understanding and securing interfaces is crucial for ensuring the overall security of embedded ML.\n\n\n14.5.6 Counterfeit Hardware\nML systems are only as reliable as the underlying hardware. In an era where hardware components are global commodities, the rise of counterfeit or cloned hardware presents a significant challenge. Counterfeit hardware encompasses any components that are unauthorized reproductions of original parts. Counterfeit components infiltrate ML systems through complex supply chains that stretch across borders and involve numerous stages from manufacture to delivery.\nA single lapse in the supply chain’s integrity can result in the insertion of counterfeit parts designed to closely imitate the functions and appearance of genuine hardware. For instance, a facial recognition system for high-security access control may be compromised if equipped with counterfeit processors. These processors could fail to accurately process and verify biometric data, potentially allowing unauthorized individuals to access restricted areas.\nThe challenge with counterfeit hardware is multifaceted. It undermines the quality and reliability of ML systems, as these components may degrade faster or perform unpredictably due to substandard manufacturing. The security risks are also profound; counterfeit hardware can contain vulnerabilities ripe for exploitation by malicious actors. For example, a cloned network router in an ML data center might include a hidden backdoor, enabling data interception or network intrusion without detection.\nFurthermore, counterfeit hardware poses legal and compliance risks. Companies inadvertently utilizing counterfeit parts in their ML systems may face serious legal repercussions, including fines and sanctions for failing to comply with industry regulations and standards. This is particularly true for sectors where compliance with specific safety and privacy regulations is mandatory, such as healthcare and finance.\nThe issue of counterfeit hardware is exacerbated by economic pressures to reduce costs, which can compel businesses to source from lower-cost suppliers without stringent verification processes. This economizing can inadvertently introduce counterfeit parts into otherwise secure systems. Additionally, detecting these counterfeits is inherently difficult since they are created to pass as the original components, often requiring sophisticated equipment and expertise to identify.\nIn ML, where decisions are made in real time and based on complex computations, the consequences of hardware failure are inconvenient and potentially dangerous. Stakeholders in the field of ML need to understand these risks thoroughly. The issues presented by counterfeit hardware necessitate a deep dive into the current challenges facing ML system integrity and emphasize the importance of vigilant, informed management of the hardware life cycle within these advanced systems.\n\n\n14.5.7 Supply Chain Risks\nThe threat of counterfeit hardware is closely tied to broader supply chain vulnerabilities. Globalized, interconnected supply chains create multiple opportunities for compromised components to infiltrate a product’s lifecycle. Supply chains involve numerous entities, from design to manufacturing, assembly, distribution, and integration. A lack of transparency and oversight of each partner makes verifying integrity at every step challenging. Lapses anywhere along the chain can allow the insertion of counterfeit parts.\nFor example, a contracted manufacturer may unknowingly receive and incorporate recycled electronic waste containing dangerous counterfeits. An untrustworthy distributor could smuggle in cloned components. Insider threats at any vendor might deliberately mix counterfeits into legitimate shipments.\nOnce counterfeits enter the supply stream, they move quickly through multiple hands before ending up in ML systems where detection is difficult. Advanced counterfeits like refurbished parts or clones with repackaged externals can masquerade as authentic components, passing visual inspection.\nTo identify fakes, thorough technical profiling using micrography, X-ray screening, component forensics, and functional testing is often required. However, such costly analysis is impractical for large-volume procurement.\nStrategies like supply chain audits, screening suppliers, validating component provenance, and adding tamper-evident protections can help mitigate risks. However, given global supply chain security challenges, a zero-trust approach is prudent. Designing ML systems to use redundant checking, fail-safes, and continuous runtime monitoring provides resilience against component compromises.\nRigorous validation of hardware sources coupled with fault-tolerant system architectures offers the most robust defense against the pervasive risks of convoluted, opaque global supply chains.\n\n\n14.5.8 Case Study\nIn 2018, Bloomberg Businessweek published an alarming story that got much attention in the tech world. The article claimed that Supermicro had secretly planted tiny spy chips on server hardware. Reporters said Chinese state hackers working with Supermicro could sneak these tiny chips onto motherboards during manufacturing. The tiny chips allegedly gave the hackers backdoor access to servers used by over 30 major companies, including Apple and Amazon.\nIf true, this would allow hackers to spy on private data or even tamper with systems. However, after investigating, Apple and Amazon found no proof that such hacked Supermicro hardware existed. Other experts questioned whether the Bloomberg article was accurate reporting.\nWhether the story is completely true or not is not our concern from a pedagogical viewpoint. However, this incident drew attention to the risks of global supply chains for hardware, especially manufactured in China. When companies outsource and buy hardware components from vendors worldwide, there needs to be more visibility into the process. In this complex global pipeline, there are concerns that counterfeits or tampered hardware could be slipped in somewhere along the way without tech companies realizing it. Companies relying too much on single manufacturers or distributors creates risk. For instance, due to the over-reliance on TSMC for semiconductor manufacturing, the U.S. has invested 50 billion dollars into the CHIPS Act.\nAs ML moves into more critical systems, verifying hardware integrity from design through production and delivery is crucial. The reported Supermicro backdoor demonstrated that for ML security, we cannot take global supply chains and manufacturing for granted. We must inspect and validate hardware at every link in the chain.",
+ "text": "14.5 Security Threats to ML Hardware\nA systematic examination of security threats to embedded machine learning hardware is essential to comprehensively understanding potential vulnerabilities in ML systems. Initially, hardware vulnerabilities arising from intrinsic design flaws that can be exploited will be explored. This foundational knowledge is crucial for recognizing the origins of hardware weaknesses. Following this, physical attacks will be examined, representing the most direct and overt methods of compromising hardware integrity. Building on this, fault injection attacks will be analyzed, demonstrating how deliberate manipulations can induce system failures.\nAdvancing to side-channel attacks next will show the increasing complexity, as these rely on exploiting indirect information leakages, requiring a nuanced understanding of hardware operations and environmental interactions. Leaky interfaces will show how external communication channels can become vulnerable, leading to accidental data exposures. Counterfeit hardware discussions benefit from prior explorations of hardware integrity and exploitation techniques, as they often compound these issues with additional risks due to their questionable provenance. Finally, supply chain risks encompass all concerns above and frame them within the context of the hardware’s journey from production to deployment, highlighting the multifaceted nature of hardware security and the need for vigilance at every stage.\nTable 14.1 overview table summarizing the topics:\n\n\n\nTable 14.1: Threat types on hardware security.\n\n\n\n\n\n\n\n\n\n\nThreat Type\nDescription\nRelevance to ML Hardware Security\n\n\n\n\nHardware Bugs\nIntrinsic flaws in hardware designs that can compromise system integrity.\nFoundation of hardware vulnerability.\n\n\nPhysical Attacks\nDirect exploitation of hardware through physical access or manipulation.\nBasic and overt threat model.\n\n\nFault-injection Attacks\nInduction of faults to cause errors in hardware operation, leading to potential system crashes.\nSystematic manipulation leading to failure.\n\n\nSide-Channel Attacks\nExploitation of leaked information from hardware operation to extract sensitive data.\nIndirect attack via environmental observation.\n\n\nLeaky Interfaces\nVulnerabilities arising from interfaces that expose data unintentionally.\nData exposure through communication channels.\n\n\nCounterfeit Hardware\nUse of unauthorized hardware components that may have security flaws.\nCompounded vulnerability issues.\n\n\nSupply Chain Risks\nRisks introduced through the hardware lifecycle, from production to deployment.\nCumulative & multifaceted security challenges.\n\n\n\n\n\n\n\n14.5.1 Hardware Bugs\nHardware is not immune to the pervasive issue of design flaws or bugs. Attackers can exploit these vulnerabilities to access, manipulate, or extract sensitive data, breaching the confidentiality and integrity that users and services depend on. An example of such vulnerabilities came to light with the discovery of Meltdown and Spectre—two hardware vulnerabilities that exploit critical vulnerabilities in modern processors. These bugs allow attackers to bypass the hardware barrier that separates applications, allowing a malicious program to read the memory of other programs and the operating system.\nMeltdown (Kocher et al. 2019a) and Spectre (Kocher et al. 2019b) work by taking advantage of optimizations in modern CPUs that allow them to speculatively execute instructions out of order before validity checks have been completed. This reveals data that should be inaccessible, which the attack captures through side channels like caches. The technical complexity demonstrates the difficulty of eliminating vulnerabilities even with extensive validation.\n\n———, et al. 2019a. “Spectre Attacks: Exploiting Speculative Execution.” In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00002.\n\nKocher, Paul, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, et al. 2019b. “Spectre Attacks: Exploiting Speculative Execution.” In 2019 IEEE Symposium on Security and Privacy (SP). IEEE. https://doi.org/10.1109/sp.2019.00002.\nIf an ML system is processing sensitive data, such as personal user information or proprietary business analytics, Meltdown and Spectre represent a real and present danger to data security. Consider the case of an ML accelerator card designed to speed up machine learning processes, such as the ones we discussed in the A.I. Hardware chapter. These accelerators work with the CPU to handle complex calculations, often related to data analytics, image recognition, and natural language processing. If such an accelerator card has a vulnerability akin to Meltdown or Spectre, it could leak the data it processes. An attacker could exploit this flaw not just to siphon off data but also to gain insights into the ML model’s workings, including potentially reverse-engineering the model itself (thus, going back to the issue of model theft.\nA real-world scenario where this could be devastating would be in the healthcare industry. ML systems routinely process highly sensitive patient data to help diagnose, plan treatment, and forecast outcomes. A bug in the system’s hardware could lead to the unauthorized disclosure of personal health information, violating patient privacy and contravening strict regulatory standards like the Health Insurance Portability and Accountability Act (HIPAA)\nThe Meltdown and Spectre vulnerabilities are stark reminders that hardware security is not just about preventing unauthorized physical access but also about ensuring that the hardware’s architecture does not become a conduit for data exposure. Similar hardware design flaws regularly emerge in CPUs, accelerators, memory, buses, and other components. This necessitates ongoing retroactive mitigations and performance trade-offs in deployed systems. Proactive solutions like confidential computing architectures could mitigate entire classes of vulnerabilities through fundamentally more secure hardware design. Thwarting hardware bugs requires rigor at every design stage, validation, and deployment.\n\n\n14.5.2 Physical Attacks\nPhysical tampering refers to the direct, unauthorized manipulation of physical computing resources to undermine the integrity of machine learning systems. It’s a particularly insidious attack because it circumvents traditional cybersecurity measures, which often focus more on software vulnerabilities than hardware threats.\nPhysical tampering can take many forms, from the relatively simple, such as someone inserting a USB device loaded with malicious software into a server, to the highly sophisticated, such as embedding a hardware Trojan during the manufacturing process of a microchip (discussed later in greater detail in the Supply Chain section). ML systems are susceptible to this attack because they rely on the accuracy and integrity of their hardware to process and analyze vast amounts of data correctly.\nConsider an ML-powered drone used for geographical mapping. The drone’s operation relies on a series of onboard systems, including a navigation module that processes inputs from various sensors to determine its path. If an attacker gains physical access to this drone, they could replace the genuine navigation module with a compromised one that includes a backdoor. This manipulated module could then alter the drone’s flight path to conduct surveillance over restricted areas or even smuggle contraband by flying undetected routes.\nAnother example is the physical tampering of biometric scanners used for access control in secure facilities. By introducing a modified sensor that transmits biometric data to an unauthorized receiver, an attacker can access personal identification data to authenticate individuals.\nThere are several ways that physical tampering can occur in ML hardware:\n\nManipulating sensors: Consider an autonomous vehicle equipped with cameras and LiDAR for environmental perception. A malicious actor could deliberately manipulate the physical alignment of these sensors to create occlusion zones or distort distance measurements. This could compromise object detection capabilities and potentially endanger vehicle occupants.\nHardware trojans: Malicious circuit modifications can introduce trojans designed to activate upon specific input conditions. For instance, an ML accelerator chip might operate as intended until encountering a predetermined trigger, at which point it behaves erratically.\nTampering with memory: Physically exposing and manipulating memory chips could allow the extraction of encrypted ML model parameters. Fault injection techniques can also corrupt model data to degrade accuracy.\nIntroducing backdoors: Gaining physical access to servers, an adversary could use hardware keyloggers to capture passwords and create backdoor accounts for persistent access. These could then be used to exfiltrate ML training data over time.\nSupply chain attacks: Manipulating third-party hardware components or compromising manufacturing and shipping channels creates systemic vulnerabilities that are difficult to detect and remediate.\n\n\n\n14.5.3 Fault-injection Attacks\nBy intentionally introducing faults into ML hardware, attackers can induce errors in the computational process, leading to incorrect outputs. This manipulation compromises the integrity of ML operations and can serve as a vector for further exploitation, such as system reverse engineering or security protocol bypass. Fault injection involves deliberately disrupting standard computational operations in a system through external interference (Joye and Tunstall 2012). By precisely triggering computational errors, adversaries can alter program execution in ways that degrade reliability or leak sensitive information.\n\nJoye, Marc, and Michael Tunstall. 2012. Fault Analysis in Cryptography. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-29656-7.\n\nBarenghi, Alessandro, Guido M. Bertoni, Luca Breveglieri, Mauro Pellicioli, and Gerardo Pelosi. 2010. “Low Voltage Fault Attacks to AES.” In 2010 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 7–12. IEEE; IEEE. https://doi.org/10.1109/hst.2010.5513121.\n\nHutter, Michael, Jorn-Marc Schmidt, and Thomas Plos. 2009. “Contact-Based Fault Injections and Power Analysis on RFID Tags.” In 2009 European Conference on Circuit Theory and Design, 409–12. IEEE; IEEE. https://doi.org/10.1109/ecctd.2009.5275012.\n\nAmiel, Frederic, Christophe Clavier, and Michael Tunstall. 2006. “Fault Analysis of DPA-Resistant Algorithms.” In International Workshop on Fault Diagnosis and Tolerance in Cryptography, 223–36. Springer.\n\nAgrawal, Dakshi, Selcuk Baktir, Deniz Karakoyunlu, Pankaj Rohatgi, and Berk Sunar. 2007. “Trojan Detection Using IC Fingerprinting.” In 2007 IEEE Symposium on Security and Privacy (SP ’07), 29–45. Springer; IEEE. https://doi.org/10.1109/sp.2007.36.\n\nSkorobogatov, Sergei. 2009. “Local Heating Attacks on Flash Memory Devices.” In 2009 IEEE International Workshop on Hardware-Oriented Security and Trust, 1–6. IEEE; IEEE. https://doi.org/10.1109/hst.2009.5225028.\n\nSkorobogatov, Sergei P, and Ross J Anderson. 2003. “Optical Fault Induction Attacks.” In Cryptographic Hardware and Embedded Systems-CHES 2002: 4th International Workshop Redwood Shores, CA, USA, August 1315, 2002 Revised Papers 4, 2–12. Springer.\nVarious physical tampering techniques can be used for fault injection. Low voltage (Barenghi et al. 2010), power spikes (Hutter, Schmidt, and Plos 2009), clock glitches (Amiel, Clavier, and Tunstall 2006), electromagnetic pulses (Agrawal et al. 2007), temperate increase (S. Skorobogatov 2009) and laser strikes (S. P. Skorobogatov and Anderson 2003) are common hardware attack vectors. They are precisely timed to induce faults like flipped bits or skipped instructions during critical operations.\nFor ML systems, consequences include impaired model accuracy, denial of service, extraction of private training data or model parameters, and reverse engineering of model architectures. Attackers could use fault injection to force misclassifications, disrupt autonomous systems, or steal intellectual property.\nFor example, in (Breier et al. 2018), the authors successfully injected a fault attack into a deep neural network deployed on a microcontroller. They used a laser to heat specific transistors, forcing them to switch states. In one instance, they used this method to attack a ReLU activation function, resulting in the function always outputting a value of 0, regardless of the input. In the assembly code in Figure 14.2, the attack caused the executing program always to skip the jmp end instruction on line 6. This means that HiddenLayerOutput[i] is always set to 0, overwriting any values written to it on lines 4 and 5. As a result, the targeted neurons are rendered inactive, resulting in misclassifications.\n\n\n\n\n\n\nFigure 14.2: Fault-injection demonstrated with assembly code. Source: Breier et al. (2018).\n\n\nBreier, Jakub, Xiaolu Hou, Dirmanto Jap, Lei Ma, Shivam Bhasin, and Yang Liu. 2018. “Deeplaser: Practical Fault Attack on Deep Neural Networks.” ArXiv Preprint abs/1806.05859. https://arxiv.org/abs/1806.05859.\n\n\nAn attacker’s strategy could be to infer information about the activation functions using side-channel attacks (discussed next). Then, the attacker could attempt to target multiple activation function computations by randomly injecting faults into the layers as close to the output layer as possible, increasing the likelihood and impact of the attack.\nEmbedded devices are particularly vulnerable due to limited physical hardening and resource constraints that restrict robust runtime defenses. Without tamper-resistant packaging, attacker access to system buses and memory enables precise fault strikes. Lightweight embedded ML models also lack redundancy to overcome errors.\nThese attacks can be particularly insidious because they bypass traditional software-based security measures, often not accounting for physical disruptions. Furthermore, because ML systems rely heavily on the accuracy and reliability of their hardware for tasks like pattern recognition, decision-making, and automated responses, any compromise in their operation due to fault injection can have severe and wide-ranging consequences.\nMitigating fault injection risks necessitates a multilayer approach. Physical hardening through tamper-proof enclosures and design obfuscation helps reduce access. Lightweight anomaly detection can identify unusual sensor inputs or erroneous model outputs (Hsiao et al. 2023). Error-correcting memories minimize disruption, while data encryption safeguards information. Emerging model watermarking techniques trace stolen parameters.\n\nHsiao, Yu-Shun, Zishen Wan, Tianyu Jia, Radhika Ghosal, Abdulrahman Mahmoud, Arijit Raychowdhury, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 2023. “MAVFI: An End-to-End Fault Analysis Framework with Anomaly Detection and Recovery for Micro Aerial Vehicles.” In 2023 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 1–6. IEEE; IEEE. https://doi.org/10.23919/date56975.2023.10137246.\nHowever, balancing robust protections with embedded systems’ tight size and power limits remains challenging. Cryptography limits and lack of secure co-processors on cost-sensitive embedded hardware restrict options. Ultimately, fault injection resilience demands a cross-layer perspective spanning electrical, firmware, software, and physical design layers.\n\n\n14.5.4 Side-Channel Attacks\nSide-channel attacks are a category of security breach that depends on information gained from a computer system’s physical implementation. Unlike direct attacks on software or network vulnerabilities, side-channel attacks exploit a system’s hardware characteristics. These attacks can be particularly effective against complex machine learning systems, where large amounts of data are processed, and a high level of security is expected.\nThe fundamental premise of a side-channel attack is that a device’s operation can inadvertently leak information. Such leaks can come from various sources, including the electrical power a device consumes (Kocher, Jaffe, and Jun 1999), the electromagnetic fields it emits (Gandolfi, Mourtel, and Olivier 2001), the time it takes to process certain operations, or even the sounds it produces. Each channel can indirectly glimpse the system’s internal processes, revealing information that can compromise security.\n\nKocher, Paul, Joshua Jaffe, and Benjamin Jun. 1999. “Differential Power Analysis.” In Advances in CryptologyCRYPTO’99: 19th Annual International Cryptology Conference Santa Barbara, California, USA, August 1519, 1999 Proceedings 19, 388–97. Springer.\n\nGandolfi, Karine, Christophe Mourtel, and Francis Olivier. 2001. “Electromagnetic Analysis: Concrete Results.” In Cryptographic Hardware and Embedded SystemsCHES 2001: Third International Workshop Paris, France, May 1416, 2001 Proceedings 3, 251–61. Springer.\n\nKocher, Paul, Joshua Jaffe, Benjamin Jun, and Pankaj Rohatgi. 2011. “Introduction to Differential Power Analysis.” Journal of Cryptographic Engineering 1 (1): 5–27. https://doi.org/10.1007/s13389-011-0006-y.\nFor instance, consider a machine learning system performing encrypted transactions. Encryption algorithms are supposed to secure data but require computational work to encrypt and decrypt information. An attacker can analyze the power consumption patterns of the device performing encryption to figure out the cryptographic key. With sophisticated statistical methods, small variations in power usage during the encryption process can be correlated with the data being processed, eventually revealing the key. Some differential analysis attack techniques are Differential Power Analysis (DPA) (Kocher et al. 2011), Differential Electromagnetic Analysis (DEMA), and Correlation Power Analysis (CPA).\nFor example, consider an attacker trying to break the AES encryption algorithm using a differential analysis attack. The attacker would first need to collect many power or electromagnetic traces (a trace is a record of consumptions or emissions) of the device while performing AES encryption.\nOnce the attacker has collected sufficient traces, they would then use a statistical technique to identify correlations between the traces and the different values of the plaintext (original, unencrypted text) and ciphertext (encrypted text). These correlations would then be used to infer the value of a bit in the AES key and, eventually, the entire key. Differential analysis attacks are dangerous because they are low-cost, effective, and non-intrusive, allowing attackers to bypass algorithmic and hardware-level security measures. Compromises by these attacks are also hard to detect because they do not physically modify the device or break the encryption algorithm.\nBelow is a simplified visualization of how analyzing the power consumption patterns of the encryption device can help us extract information about the algorithm’s operations and, in turn, the secret data. Say we have a device that takes a 5-byte password as input. We will analyze and compare the different voltage patterns that are measured while the encryption device performs operations on the input to authenticate the password.\nFirst, consider the power analysis of the device’s operations after entering a correct password in the first picture in Figure 14.3. The dense blue graph outputs the encryption device’s voltage measurement. What matters here is the comparison between the different analysis charts rather than the specific details of what is going on in each scenario.\n\n\n\n\n\n\nFigure 14.3: Power analysis of an encryption device with a correct password. Source: Colin O’Flynn.\n\n\n\nLet’s look at the power analysis chart when we enter an incorrect password in Figure 14.4. The first three bytes of the password are correct. As a result, we can see that the voltage patterns are very similar or identical between the two charts, up to and including the fourth byte. After the device processes the fourth byte, it determines a mismatch between the secret key and the attempted input. We notice a change in the pattern at the transition point between the fourth and fifth bytes: the voltage has gone up (the current has gone down) because the device has stopped processing the rest of the input.\n\n\n\n\n\n\nFigure 14.4: Power analysis of an encryption device with a (partially) wrong password. Source: Colin O’Flynn.\n\n\n\nFigure 14.5 describes another chart of a completely wrong password. After the device finishes processing the first byte, it determines that it is incorrect and stops further processing - the voltage goes up and the current down.\n\n\n\n\n\n\nFigure 14.5: Power analysis of an encryption device with a wrong password. Source: Colin O’Flynn.\n\n\n\nThe example above shows how we can infer information about the encryption process and the secret key by analyzing different inputs and trying to ‘eavesdrop’ on the device’s operations on each input byte. For a more detailed explanation, watch Video 14.3 below.\n\n\n\n\n\n\nVideo 14.3: Power Attack\n\n\n\n\n\n\nAnother example is an ML system for speech recognition, which processes voice commands to perform actions. By measuring the time it takes for the system to respond to commands or the power used during processing, an attacker could infer what commands are being processed and thus learn about the system’s operational patterns. Even more subtle, the sound emitted by a computer’s fan or hard drive could change in response to the workload, which a sensitive microphone could pick up and analyze to determine what kind of operations are being performed.\nIn real-world scenarios, side-channel attacks have been used to extract encryption keys and compromise secure communications. One of the earliest recorded side-channel attacks dates back to the 1960s when British intelligence agency MI5 faced the challenge of deciphering encrypted communications from the Egyptian Embassy in London. Their cipher-breaking attempts were thwarted by the computational limitations of the time until an ingenious observation changed the game.\nMI5 agent Peter Wright proposed using a microphone to capture the subtle acoustic signatures emitted from the embassy’s rotor cipher machine during encryption (Burnet and Thomas 1989). The distinct mechanical clicks of the rotors as operators configured them daily leaked critical information about the initial settings. This simple side channel of sound enabled MI5 to dramatically reduce the complexity of deciphering messages. This early acoustic leak attack highlights that side-channel attacks are not merely a digital age novelty but a continuation of age-old cryptanalytic principles. The notion that where there is a signal, there is an opportunity for interception remains foundational. From mechanical clicks to electrical fluctuations and beyond, side channels enable adversaries to extract secrets indirectly through careful signal analysis.\n\nBurnet, David, and Richard Thomas. 1989. “Spycatcher: The Commodification of Truth.” J. Law Soc. 16 (2): 210. https://doi.org/10.2307/1410360.\n\nAsonov, D., and R. Agrawal. 2004. “Keyboard Acoustic Emanations.” In IEEE Symposium on Security and Privacy, 2004. Proceedings. 2004, 3–11. IEEE; IEEE. https://doi.org/10.1109/secpri.2004.1301311.\n\nGnad, Dennis R. E., Fabian Oboril, and Mehdi B. Tahoori. 2017. “Voltage Drop-Based Fault Attacks on FPGAs Using Valid Bitstreams.” In 2017 27th International Conference on Field Programmable Logic and Applications (FPL), 1–7. IEEE; IEEE. https://doi.org/10.23919/fpl.2017.8056840.\n\nZhao, Mark, and G. Edward Suh. 2018. “FPGA-Based Remote Power Side-Channel Attacks.” In 2018 IEEE Symposium on Security and Privacy (SP), 229–44. IEEE; IEEE. https://doi.org/10.1109/sp.2018.00049.\nToday, acoustic cryptanalysis has evolved into attacks like keyboard eavesdropping (Asonov and Agrawal 2004). Electrical side channels range from power analysis on cryptographic hardware (Gnad, Oboril, and Tahoori 2017) to voltage fluctuations (Zhao and Suh 2018) on machine learning accelerators. Timing, electromagnetic emission, and even heat footprints can likewise be exploited. New and unexpected side channels often emerge as computing becomes more interconnected and miniaturized.\nJust as MI5’s analog acoustic leak transformed their codebreaking, modern side-channel attacks circumvent traditional boundaries of cyber defense. Understanding the creative spirit and historical persistence of side channel exploits is key knowledge for developers and defenders seeking to secure modern machine learning systems comprehensively against digital and physical threats.\n\n\n14.5.5 Leaky Interfaces\nLeaky interfaces in embedded systems are often overlooked backdoors that can become significant security vulnerabilities. While designed for legitimate purposes such as communication, maintenance, or debugging, these interfaces may inadvertently provide attackers with a window through which they can extract sensitive information or inject malicious data.\nAn interface becomes “leaky” when it exposes more information than it should, often due to a lack of stringent access controls or inadequate shielding of the transmitted data. Here are some real-world examples of leaky interface issues causing security problems in IoT and embedded devices:\n\nBaby Monitors: Many WiFi-enabled baby monitors have been found to have unsecured interfaces for remote access. This allowed attackers to gain live audio and video feeds from people’s homes, representing a major privacy violation.\nPacemakers: Interface vulnerabilities were discovered in some pacemakers that could allow attackers to manipulate cardiac functions if exploited. This presents a potentially life-threatening scenario.\nSmart Lightbulbs: A researcher found he could access unencrypted data from smart lightbulbs via a debug interface, including WiFi credentials, allowing him to gain access to the connected network (Greengard 2015).\nSmart Cars: If left unsecured, The OBD-II diagnostic port has been shown to provide an attack vector into automotive systems. Researchers could use it to control brakes and other components (Miller and Valasek 2015).\n\n\nGreengard, Samuel. 2015. The Internet of Things. The MIT Press. https://doi.org/10.7551/mitpress/10277.001.0001.\n\nMiller, Charlie, and Chris Valasek. 2015. “Remote Exploitation of an Unaltered Passenger Vehicle.” Black Hat USA 2015 (S 91): 1–91.\nWhile the above are not directly connected with ML, consider the example of a smart home system with an embedded ML component that controls home security based on behavior patterns it learns over time. The system includes a maintenance interface accessible via the local network for software updates and system checks. If this interface does not require strong authentication or the data transmitted through it is not encrypted, an attacker on the same network could gain access. They could then eavesdrop on the homeowner’s daily routines or reprogram the security settings by manipulating the firmware.\nSuch leaks are a privacy issue and a potential entry point for more damaging exploits. The exposure of training data, model parameters, or ML outputs from a leak could help adversaries construct adversarial examples or reverse-engineer models. Access through a leaky interface could also be used to alter an embedded device’s firmware, loading it with malicious code that could turn off the device, intercept data, or use it in botnet attacks.\nTo mitigate these risks, a multi-layered approach is necessary, spanning technical controls like authentication, encryption, anomaly detection, policies and processes like interface inventories, access controls, auditing, and secure development practices. Turning off unnecessary interfaces and compartmentalizing risks via a zero-trust model provide additional protection.\nAs designers of embedded ML systems, we should assess interfaces early in development and continually monitor them post-deployment as part of an end-to-end security lifecycle. Understanding and securing interfaces is crucial for ensuring the overall security of embedded ML.\n\n\n14.5.6 Counterfeit Hardware\nML systems are only as reliable as the underlying hardware. In an era where hardware components are global commodities, the rise of counterfeit or cloned hardware presents a significant challenge. Counterfeit hardware encompasses any components that are unauthorized reproductions of original parts. Counterfeit components infiltrate ML systems through complex supply chains that stretch across borders and involve numerous stages from manufacture to delivery.\nA single lapse in the supply chain’s integrity can result in the insertion of counterfeit parts designed to closely imitate the functions and appearance of genuine hardware. For instance, a facial recognition system for high-security access control may be compromised if equipped with counterfeit processors. These processors could fail to accurately process and verify biometric data, potentially allowing unauthorized individuals to access restricted areas.\nThe challenge with counterfeit hardware is multifaceted. It undermines the quality and reliability of ML systems, as these components may degrade faster or perform unpredictably due to substandard manufacturing. The security risks are also profound; counterfeit hardware can contain vulnerabilities ripe for exploitation by malicious actors. For example, a cloned network router in an ML data center might include a hidden backdoor, enabling data interception or network intrusion without detection.\nFurthermore, counterfeit hardware poses legal and compliance risks. Companies inadvertently utilizing counterfeit parts in their ML systems may face serious legal repercussions, including fines and sanctions for failing to comply with industry regulations and standards. This is particularly true for sectors where compliance with specific safety and privacy regulations is mandatory, such as healthcare and finance.\nThe issue of counterfeit hardware is exacerbated by economic pressures to reduce costs, which can compel businesses to source from lower-cost suppliers without stringent verification processes. This economizing can inadvertently introduce counterfeit parts into otherwise secure systems. Additionally, detecting these counterfeits is inherently difficult since they are created to pass as the original components, often requiring sophisticated equipment and expertise to identify.\nIn ML, where decisions are made in real time and based on complex computations, the consequences of hardware failure are inconvenient and potentially dangerous. Stakeholders in the field of ML need to understand these risks thoroughly. The issues presented by counterfeit hardware necessitate a deep dive into the current challenges facing ML system integrity and emphasize the importance of vigilant, informed management of the hardware life cycle within these advanced systems.\n\n\n14.5.7 Supply Chain Risks\nThe threat of counterfeit hardware is closely tied to broader supply chain vulnerabilities. Globalized, interconnected supply chains create multiple opportunities for compromised components to infiltrate a product’s lifecycle. Supply chains involve numerous entities, from design to manufacturing, assembly, distribution, and integration. A lack of transparency and oversight of each partner makes verifying integrity at every step challenging. Lapses anywhere along the chain can allow the insertion of counterfeit parts.\nFor example, a contracted manufacturer may unknowingly receive and incorporate recycled electronic waste containing dangerous counterfeits. An untrustworthy distributor could smuggle in cloned components. Insider threats at any vendor might deliberately mix counterfeits into legitimate shipments.\nOnce counterfeits enter the supply stream, they move quickly through multiple hands before ending up in ML systems where detection is difficult. Advanced counterfeits like refurbished parts or clones with repackaged externals can masquerade as authentic components, passing visual inspection.\nTo identify fakes, thorough technical profiling using micrography, X-ray screening, component forensics, and functional testing is often required. However, such costly analysis is impractical for large-volume procurement.\nStrategies like supply chain audits, screening suppliers, validating component provenance, and adding tamper-evident protections can help mitigate risks. However, given global supply chain security challenges, a zero-trust approach is prudent. Designing ML systems to use redundant checking, fail-safes, and continuous runtime monitoring provides resilience against component compromises.\nRigorous validation of hardware sources coupled with fault-tolerant system architectures offers the most robust defense against the pervasive risks of convoluted, opaque global supply chains.\n\n\n14.5.8 Case Study\nIn 2018, Bloomberg Businessweek published an alarming story that got much attention in the tech world. The article claimed that Supermicro had secretly planted tiny spy chips on server hardware. Reporters said Chinese state hackers working with Supermicro could sneak these tiny chips onto motherboards during manufacturing. The tiny chips allegedly gave the hackers backdoor access to servers used by over 30 major companies, including Apple and Amazon.\nIf true, this would allow hackers to spy on private data or even tamper with systems. However, after investigating, Apple and Amazon found no proof that such hacked Supermicro hardware existed. Other experts questioned whether the Bloomberg article was accurate reporting.\nWhether the story is completely true or not is not our concern from a pedagogical viewpoint. However, this incident drew attention to the risks of global supply chains for hardware, especially manufactured in China. When companies outsource and buy hardware components from vendors worldwide, there needs to be more visibility into the process. In this complex global pipeline, there are concerns that counterfeits or tampered hardware could be slipped in somewhere along the way without tech companies realizing it. Companies relying too much on single manufacturers or distributors creates risk. For instance, due to the over-reliance on TSMC for semiconductor manufacturing, the U.S. has invested 50 billion dollars into the CHIPS Act.\nAs ML moves into more critical systems, verifying hardware integrity from design through production and delivery is crucial. The reported Supermicro backdoor demonstrated that for ML security, we cannot take global supply chains and manufacturing for granted. We must inspect and validate hardware at every link in the chain.",
"crumbs": [
"Advanced Topics",
"14 Security & Privacy "