+
+ +
+

Chapter 7: Large Language Models

+
+

Background

+

Large language models have gained increasing prevalence in many aspects of +society today, with interactive sessions such as OpenAI’s ChatGPT and Google’s +Bard allowing users to query and receive answers for a diverse set of general +downstream tasks. Although they are highly developed and tailored, the +underlying architecture of the models for systems such as GPT and LaMDA is +based on a transformer model. In such a model, each word in a given input is +represented by one or more tokens, and each input is represented with an +embedding, a high-dimensional representation of the input that captures its +semantics. Self-attention is then used to assign a weight for each token in +an input based on its importance to its surrounding tokens. This mechanism +allows transformer models to capture fine-grained relationships in input +semantics that are not captured by other machine learning models, including +conventional neural networks.

+

The use of large-language models in computer networks is rapidly expanding. +One of the areas where these models show promise is in the areas of network +security and performance troubleshooting. In this chapter, we will explore +some of these early examples in more detail, as well as discuss some of the +practical hurdles towards deploying LLMs in production networks.

+

Large language models typically operate on a vocabulary of words. Since this +book is about applications of machine learning to networking, ultimately the +models we work with will operate on network data (e.g., packets, elements from +network traffic), not words or text in a language. Nonetheless, before we talk +about applications of LLMs to networking, it helps to understand the basic +design of LLMs and how they operate on text data. We will provide this +overview by providing background on two key concepts in LLMs: vectors and +transformers.

+
+

Vectors

+

Language models represent each word as a long array of numbers called a word +vector. Each word has a corresponding word vector, and each word thus +represents a point in a high-dimensional space. This representation allows +models to reason about spatial relationships between words. For example, the +word vector for “cat” might be close to the word vector for “dog”, since these +words are semantically similar. In contrast, the word vector for “cat” might +be far from the word vector for “computer”, since these words are semantically +different. In the mid-2010s, Google’s word2vec project led to significant +advances in the quality of word vectors; specifically, these vectors allowed +various semantic relationships, such as analogies, to be captured in the +spatial relationships. dictionary meaning, not linguistic context

+

(free: speech or beer) +While word vectors, and simple arithmetic operations on these vectors, have +turned out out to be useful for capturing these relationships, they missed +another important characteristic, which is that words can change meaning +depending on context (e.g., the word “sound” might mean very different things +depending on whether we were talking about a proof or a musical performance). +Fortunately, word vectors have also been useful as input to more complex +large language models that are capable of reasoning about the meaning of +words from context. These models are capable of capturing the meaning of +sentences and paragraphs, and are the basis for many modern machine learning +applications. LLMs comprise many layers of transformers, a concept we will +discuss next.

+
+
+

Transformers

+

The fundamental building block of a large language model is the transformer. +In large language models, each token is represented as a high-dimensional +vector. In GPT-3, for example, each token is represented by a vector of nearly +13,000 dimensions. The model first applies what is referred to as an attention +layer to assign weights to each token in the input based on its relationships +to the tokens in the rest of the input. In the attention layer, so-called +attention heads retrieve information from earlier words in the prompt.

+

Second, the feed-forward portion of the model then uses the results from the +attention layer to predict the next token in a sequence given the previous +tokens. This process is accomplished using the weights calculated by the +self-attention mechanism to calculate a weighted average of the token vectors +in the input. This weighted average is then used to predict the next token in +the sequence. The feed-forward layers in some sense represent a databased of +information that the model has learned from from the training data; +feed-forward layers effectively encode relationships between tokens as seen +elsewhere in the training data.

+

Large language models tend to have many sets of attention and feed-forward +layers, resulting in the ability to make fairly complex predictions on text. +Of course, network traffic does not have the same form or structure as text, +but if packets are treated as tokens, and the sequence of packets is treated as +a sequence of tokens, then the same mechanism can be used to predict the next +packet in a sequence given the previous packets. This is the basic idea +behind the use of large language models in network traffic analysis.

+

A key distinction of large-language models from other types of machine +learning approaches that we’ve read about in previous chapters is that +training them doesn’t rely on having explicitly labeled data. Instead, the +model is trained on a large corpus of text, and the model learns to predict +the next word in a sequence given the previous words. This is, in some sense, +another form of unsupervised learning.

+

Transformers tend to work well on problems that (1) can be represented with +sequences of structured input; and (2) have large input spaces that any one +feature set cannot sufficiently represent. In computer networking, several +areas, including protocol analysis and traffic analysis, bear some of these +characteristics. In both of these cases, manual analysis of network traffic +can be cumbersome. Yet, some of the other machine learning models and +approaches we have covered in previous chapters can also be difficult for +certain types of problems. For example, mappings of byte offsets or header +fields and their data types for all protocols, as well as considering all +values a field may take, may yield prohibitively large feature spaces. For +example, detecting and mitigating protocol misconfiguration can be well-suited +to transformer models, where small nuances, interactions, or +misinterpretations of protocol settings can lead to complicated corner cases +and unexpected behavior that may be challenging to encode in either static +rule sets or formal methods approaches.

+

BERT is popular transformer-based model that has been successfully extended to +a number of domains, with modifications to the underlying vocabulary used +during training. At a high level, BERT operates in two phases: pre-training +and fine-tuning. In the pre-training phase, BERT is trained over unlabeled +input, and is evaluated on two downstream tasks to verify its understanding of +the input. After pre-training, BERT models may then be fine-tuned with labeled +data to perform tasks such as classification (or, in other domains, text generation) +that have the same input format.

+

In recent years, transformer-based models have been applied to large text +corpora to perform a variety of tasks, including question answering, text +generation, and translation. On the other hand, their utility outside of the +context of text—and especially in the context of data that does not +constitute English words—remains an active area of exploration.

+
+
+
+

Large Language Models in Networking

+

The utility of large language models for practical network management +applications is an active area of research. In this section, we will explore +a particular early-stage example of the use of large language models for the +analysis of network traffic: the analysis of network protocols.

+
+

Network Protocol Analysis

+

We will explore a recent example from Chu et al., who explored the use of +large language models to detect vulnerable or misconfigured versions of the +TLS protocol. In this work, BERT was trained using a dataset of TLS +handshakes.

+

A significant challenge in applying large language models to network data is +to build a vocabulary and corresponding training set that would allow the +model to understand TLS handshakes. This step is necessary, and important, +because existing LLMs are typically trained on text data, and the vocabulary +used in these models is typically based on the vocabulary of the English +language. To train a model that understands TLS handshakes, the first step +involved building a vocabulary that would allow the model to understand +TLS handshakes. In this case, the input to the model is a concatenation of +values in the headers of the server_hello and server_hello_done messages, as +well as any optional server steps in the TLS handshake. The resulting input +was normalized (i.e., to lowercase ASCII characters) and tokenized.

+

The resulting trained model was evaluated against a set of labeled TLS +handshakes, with examples of known misconfigurations coming from the Qualys +SSL Server Test website. The model was able to correctly identify TLS +misconfigurations with near-perfect accuracy.

+
+
+
+ + +
+