Source: Brex's Prompt Engineering Guide

What is a Large Language Model (LLM)?

A large language model is a prediction engine that takes a sequence of words and tries to predict the most likely sequence to come after that sequence1. It does this by assigning a probability to likely next sequences and then samples from those to choose one2. The process repeats until some stopping criteria is met.

Large language models learn these probabilities by training on large corpuses of text. A consequence of this is that the models will cater to some use cases better than others (e.g. if it’s trained on GitHub data, it’ll understand the probabilities of sequences in source code really well). Another consequence is that the model may generate statements that seem plausible, but are actually just random without being grounded in reality.

As language models become more accurate at predicting sequences, many surprising abilities emerge.

A Brief, Incomplete, and Somewhat Incorrect History of Language Models

📌 Skip [to here](https://github.com/brexhq/prompt-engineering#what-is-a-prompt) if you'd like to jump past the history of language models. This section is for the curious minded, though may also help you understand the reasoning behind the advice that follows.

Pre-2000’s

Language models have existed for decades, though traditional language models (e.g. n-gram models) have many deficiencies in terms of an explosion of state space (the curse of dimensionality) and working with novel phrases that they’ve never seen (sparsity). Plainly, older language models can generate text that vaguely resembles the statistics of human generated text, but there is no consistency within the output – and a reader will quickly realize it’s all gibberish. N-gram models also don’t scale to large values of N, so are inherently limited.

Mid-2000’s

In 2007, Geoffrey Hinton – famous for popularizing backpropagation in 1980’s – published an important advancement in training neural networks that unlocked much deeper networks. Applying these simple deep neural networks to language modeling helped alleviate some of problems with language models – they represented nuanced arbitrary concepts in a finite space and continuous way, gracefully handling sequences not seen in the training corpus. These simple neural networks learned the probabilities of their training corpus well, but the output would statistically match the training data and generally not be coherent relative to the input sequence.

Early-2010’s

Although they were first introduced in 1995, Long Short-Term Memory (LSTM) Networks found their time to shine in the 2010’s. LSTMs allowed models to process arbitrary length sequences and, importantly, alter their internal state dynamically as they processed the input to remember previous things they saw. This minor tweak led to remarkable improvements. In 2015, Andrej Karpathy famously wrote about creating a character-level lstm that performed far better than it had any right to.

LSTMs have seemingly magical abilities, but struggle with long term dependencies. If you asked it to complete the sentence, “In France, we traveled around, ate many pastries, drank lots of wine, ... lots more text ... , but never learned how to speak _______”, the model might struggle with predicting “French”. They also process input one token at a time, so are inherently sequential, slow to train, and the Nth token only knows about the N - 1 tokens prior to it.

Late-2010’s

In 2017, Google wrote a paper, Attention Is All You Need, that introduced Transformer Networks and kicked off a massive revolution in natural language processing. Overnight, machines could suddenly do tasks like translating between languages nearly as good as (sometimes better than) humans. Transformers are highly parallelizable and introduce a mechanism, called “attention”, for the model to efficiently place emphasis on specific parts of the input. Transformers analyze the entire input all at once, in parallel, choosing which parts are most important and influential. Every output token is influenced by every input token.

Transformers are highly parallelizable, efficient to train, and produce astounding results. A downside to transformers is that they have a fixed input and output size – the context window – and computation increases quadratically with the size of this window (in some cases, memory does as well!) 3.

Transformers are not the end of the road, but the vast majority of recent improvements in natural language processing have involved them. There is still abundant active research on various ways of implementing and applying them, such as Amazon’s AlexaTM 20B which outperforms GPT-3 in a number of tasks and is an order of magnitude smaller in its number of parameters.

2020’s

While technically starting in 2018, the theme of the 2020’s has been Generative Pre-Trained models – more famously known as GPT. One year after the “Attention Is All You Need” paper, OpenAI released Improving Language Understanding by Generative Pre-Training. This paper established that you can train a large language model on a massive set of data without any specific agenda, and then once the model has learned the general aspects of language, you can fine-tune it for specific tasks and quickly get state-of-the-art results.

In 2020, OpenAI followed up with their GPT-3 paper Language Models are Few-Shot Learners, showing that if you scale up GPT-like models by another factor of ~10x, in terms of number of parameters and quantity of training data, you no longer have to fine-tune it for many tasks. The capabilities emerge naturally and you get state-of-the-art results via text interaction with the model.

In 2022, OpenAI followed-up on their GPT-3 accomplishments by releasing InstructGPT. The intent here was to tweak the model to follow instructions, while also being less toxic and biased in its outputs. The key ingredient here was Reinforcement Learning from Human Feedback (RLHF), a concept co-authored by Google and OpenAI in 20174, which allows humans to be in the training loop to fine-tune the model output to be more in line with human preferences. InstructGPT is the predecessor to the now famous ChatGPT.

OpenAI has been a major contributor to large language models over the last few years, including the most recent introduction of GPT-4, but they are not alone. Meta has introduced many open source large language models like OPT, OPT-IML (instruction tuned), and LLaMa. Google released models like FLAN-T5 and BERT. And there is a huge open source research community releasing models like BLOOM and StableLM.

Progress is now moving so swiftly that every few weeks the state-of-the-art is changing or models that previously required clusters to run now run on Raspberry PIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History to present day.md

History to present day.md

What is a Large Language Model (LLM)?

A Brief, Incomplete, and Somewhat Incorrect History of Language Models

Pre-2000’s

Mid-2000’s

Early-2010’s

Late-2010’s

2020’s

Files

History to present day.md

Latest commit

History

History to present day.md

File metadata and controls

What is a Large Language Model (LLM)?

A Brief, Incomplete, and Somewhat Incorrect History of Language Models

Pre-2000’s

Mid-2000’s

Early-2010’s

Late-2010’s

2020’s