Skip to content

Commit

Permalink
Merge pull request #164 from nitya/chore/adopt-responsibleai-terms
Browse files Browse the repository at this point in the history
Updated "hallucination" to "fabrication" for responsible usage
  • Loading branch information
leestott authored Nov 10, 2023
2 parents 58647bf + ae4907d commit 8b68112
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 8 deletions.
18 changes: 10 additions & 8 deletions 04-prompt-engineering-fundamentals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Now that we know how prompts are processed by LLMs, let's talk about _why_ we ne

1. **Model responses are stochastic.** The _same prompt_ will likely produce different responses with different models or model versions. And it may even produce different results with the _same model_ at different times. _Prompt engineering techniques can help us minimize these variations by providing better guardrails_.

1. **Models can hallucinate responses.** Models are pre-trained with _large but finite_ datasets, meaning they lack knowledge about concepts outside that training scope. As a result, they can produce completions that are inaccurate, imaginary, or directly contradictory to known facts. _Prompt engineering techniques help users identify and mitigate hallucinations e.g., by asking AI for citations or reasoning_.
1. **Models can fabricate responses.** Models are pre-trained with _large but finite_ datasets, meaning they lack knowledge about concepts outside that training scope. As a result, they can produce completions that are inaccurate, imaginary, or directly contradictory to known facts. _Prompt engineering techniques help users identify and mitigate such fabrications e.g., by asking AI for citations or reasoning_.

1. **Models capabilities will vary.** Newer models or model generations will have richer capabilities but also bring unique quirks and tradeoffs in cost & complexity. _Prompt engineering can help us develop best practices and workflows that abstract away differences and adapt to model-specific requirements in scalable, seamless ways_.

Expand All @@ -118,9 +118,11 @@ Let's see this in action in the OpenAI or Azure OpenAI Playground:
- Use the same prompt with different LLM deployments (e.g, OpenAI, Azure OpenAI, Hugging Face) - did you see the variations?
- Use the same prompt repeatedly with the _same_ LLM deployment (e.g., Azure OpenAI playground) - how did these variations differ?

### Hallucinations Example
### Fabrications Example

Want to get a sense of how hallucinations work? Think of a prompt that instructs the AI to generate content for a non-existent topic (to ensure it is not found in the training dataset). For example - I tried this prompt:
In this course, we use the term **"fabrication"** to reference the phenomenon where LLMs sometimes generate factually incorrect information due to limitations in their training or other constraints. You may also have heard this referred to as _"hallucinations"_ in popular articles or research papers. However, we strongly recommend using _"fabrication"_ as the term so we don't accidentally anthropomorphize the behavior by attributing a human-like trait to a machine-driven outcome. This also reinforces [Responsible AI guidelines](https://www.microsoft.com/ai/responsible-ai) from a terminology perspective, removing terms that may also be considered offensive or non-inclusive in some contexts.

Want to get a sense of how fabrications work? Think of a prompt that instructs the AI to generate content for a non-existent topic (to ensure it is not found in the training dataset). For example - I tried this prompt:
> **Prompt:** generate a lesson plan on the Martian War of 2076.
A web search showed me that there were fictional accounts (e.g., television series or books) on Martian wars - but none in 2076. Commonsense also tells us that 2076 is _in the future_ and thus, cannot be associated with a real event.
Expand All @@ -129,19 +131,19 @@ So what happens when we run this prompt with different LLM providers?

> **Response 1**: OpenAI Playground (GPT-35)
![Response 1](./images/04-hallucination-oai.png?WT.mc_id=academic-105485-koreyst)
![Response 1](./images/04-fabrication-oai.png?WT.mc_id=academic-105485-koreyst)

> **Response 2**: Azure OpenAI Playground (GPT-35)
![Response 2](./images/04-hallucination-aoai.png?WT.mc_id=academic-105485-koreyst)
![Response 2](./images/04-fabrication-aoai.png?WT.mc_id=academic-105485-koreyst)

> **Response 3**: : Hugging Face Chat Playground (LLama-2)
![Response 3](./images/04-hallucination-huggingchat.png?WT.mc_id=academic-105485-koreyst)
![Response 3](./images/04-fabrication-huggingchat.png?WT.mc_id=academic-105485-koreyst)

As expected, each model (or model version) produces slightly different responses thanks to stochastic behavior and model capability variations. For instance, one model targets an 8th grade audience while the other assumes a high-school student. But all three models did generate responses that could convince an uninformed user that the event was real

Prompt engineering techniques like _metaprompting_ and _temperature configuration_ may reduce model hallucinations to some extent. New prompt engineering _architectures_ also incorporate new tools and techniques seamlessly into the prompt flow, to mitigate or reduce some of these effects.
Prompt engineering techniques like _metaprompting_ and _temperature configuration_ may reduce model fabrications to some extent. New prompt engineering _architectures_ also incorporate new tools and techniques seamlessly into the prompt flow, to mitigate or reduce some of these effects.

## Case Study: GitHub Copilot

Expand Down Expand Up @@ -330,7 +332,7 @@ Now let's look at common best practices that are recommended by [Open AI](https:
| Use cues to jumpstart completions | Nudge it towards a desired outcome by giving it some leading words or phrases that it can use as a starting point for the response.|
|Double Down | Sometimes you may need to repeat yourself to the model. Give instructions before and after your primary content, use an instruction and a cue, etc. Iterate & validate to see what works.|
| Order Matters | The order in which you present information to the model may impact the output, even in the learning examples, thanks to recency bias. Try different options to see what works best.|
|Give the model an “out” | Give the model a _fallback_ completion response it can provide if it cannot complete the task for any reason. This can reduce chances of models generating false or hallucinatory responses. |
|Give the model an “out” | Give the model a _fallback_ completion response it can provide if it cannot complete the task for any reason. This can reduce chances of models generating false or fabricated responses. |
| | |

As with any best practice, remember that _your mileage may vary_ based on the model, the task and the domain. Use these as a starting point, and iterate to find what works best for you. Constantly re-evaluate your prompt engineering process as new models and tools become available, with a focus on process scalability and response quality.
Expand Down

0 comments on commit 8b68112

Please sign in to comment.