From 2ba64e44a5c49dcad5435481cbb12d253eb9bf9b Mon Sep 17 00:00:00 2001 From: Isaac Chung Date: Mon, 16 Dec 2024 17:48:06 +0200 Subject: [PATCH] feat: Add code for building RAG with Langchain blog (#351) * add notebook * add to readme * Apply suggestions from code review Co-authored-by: Siddhant Sadangi Signed-off-by: Isaac Chung * add blog link * use upload() instead * correct metadata for file upload --------- Signed-off-by: Isaac Chung Co-authored-by: Siddhant Sadangi --- .../notebooks/run.ipynb | 1961 +++++++++++++++++ community-code/README.md | 1 + 2 files changed, 1962 insertions(+) create mode 100644 community-code/HOW_TO_BUILD_A_RAG_SYSTEM_USING_LANGCHAIN/notebooks/run.ipynb diff --git a/community-code/HOW_TO_BUILD_A_RAG_SYSTEM_USING_LANGCHAIN/notebooks/run.ipynb b/community-code/HOW_TO_BUILD_A_RAG_SYSTEM_USING_LANGCHAIN/notebooks/run.ipynb new file mode 100644 index 00000000..94e8afe4 --- /dev/null +++ b/community-code/HOW_TO_BUILD_A_RAG_SYSTEM_USING_LANGCHAIN/notebooks/run.ipynb @@ -0,0 +1,1961 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -qU langchain-core langchain-openai langchain-chroma ragas neptune pandas datasets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ETL and Data Preparations" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "USER_AGENT environment variable not set, consider setting it to identify your requests.\n" + ] + }, + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import bs4\n", + "from langchain_chroma import Chroma\n", + "from langchain_community.document_loaders import WebBaseLoader\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "# Load, chunk and index the contents of the blog.\n", + "loader = WebBaseLoader(\n", + " web_paths=[\n", + " \"https://neptune.ai/blog/llm-hallucinations\",\n", + " \"https://neptune.ai/blog/llmops\",\n", + " \"https://neptune.ai/blog/llm-guardrails\"\n", + " ],\n", + " bs_kwargs=dict(\n", + " parse_only=bs4.SoupStrainer(name=[\"p\", \"h2\", \"h3\", \"h4\"])\n", + " ),\n", + ")\n", + "docs = loader.load()\n", + "len(docs)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='Tell 120+K peers about your AI research → Learn more 💡Live Neptune projectHow deepsense.ai Tracked and Analyzed 120K+ Models Using NeptuneHow ReSpo.Vision Uses Neptune to Easily Track Training Pipelines at ScaleObservability in LLMOps: Different Levels of ScaleBreaking Down AI Research Across 3 Levels of DifficultyLive Neptune projectHow deepsense.ai Tracked and Analyzed 120K+ Models Using NeptuneHow ReSpo.Vision Uses Neptune to Easily Track Training Pipelines at ScaleObservability in LLMOps: Different Levels of ScaleBreaking Down AI Research Across 3 Levels of Difficulty\\n TL;DR Hallucinations are an inherent feature of LLMs that becomes a bug in LLM-based applications.Causes of hallucinations include insufficient training data, misalignment, attention limitations, and tokenizer issues.Hallucinations can be detected by verifying the accuracy and reliability of the model’s responses.Effective mitigation strategies involve enhancing data quality, alignment, information retrieval methods, and prompt engineering.In 2022, when GPT-3.5 was introduced with ChatGPT, many, like me, started experimenting with various use cases. A friend asked me if it could read an article, summarize it, and answer some questions, like a research assistant. At that time, ChatGPT had no tools to explore websites, but I was unaware of this. So, I gave it the article’s link. It responded with an abstract of the article. Since the article was a medical research paper, and I had no medical background, I was amazed by the result and eagerly shared my enthusiasm with my friend. However, when he reviewed the abstract, he noticed it had almost nothing to do with the article.Then, I realized what had happened. As you might guess, ChatGPT had taken the URL, which included the article’s title, and “made up” an abstract. This “making up” event is what we call a hallucination, a term popularized by Andrej Karpathy in 2015 in the context of RNNs and extensively used nowadays for large language models (LLMs).What are LLM hallucinations?LLMs like GPT4o, Llama 3.1, Claude 3.5, or Gemini Pro 1.5 have made a huge jump in quality compared to the first of its class, GPT 3.5. However, they are all based on the same decoder-only transformer architecture, with the sole goal of predicting the next token based on a sequence of given or already predicted tokens. This is called causal language modeling. Relying on this goal task and looping (pre-training) over a gigantic dataset of text (15T tokens for Llama 3.1) trying to predict each one of its tokens is how an LLM acquires its ability to understand natural language.There is a whole field of study on how LLMs select the following token for a sequence. In the following, we’ll exclusively talk about LLMs with greedy decoding, which means choosing the most probable token for the next token prediction. Given that, talking about hallucinations is hard because, in some sense, all an LLM does is hallucinate tokens. Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide LLM hallucinations become a problem in LLM-based applicationsMost of the time, if you use an LLM, you probably won’t use a base LLM but an LLM-based assistant whose goal is to help with your requests and reliably answer your questions. Ultimately, the student has been trained (post-training) to follow your instructions. Here’s when hallucinations become an undesirable bug.In short, hallucinations occur when a user instruction (prompt) leads the LLM to predict tokens that are not aligned with the expected answer or ground truth. These hallucinations mainly happen either because the correct token was not available or because the LLM failed to retrieve it.Before we dive into this further, I’d like to stress that when thinking about LLM hallucinations, it’s important to keep in mind the difference between a base LLM and an LLM-based assistant. When we talk about LLM hallucinations as a problematic phenomenon, it’s in the context of an LLM-based assistant or system.Where in the transformer architecture are hallucinations generated?The statement “all an LLM does is hallucinate tokens” conceals a lot of meaning. To uncover this, let’s walk through the transformer architecture to understand how tokens are generated during inference and where hallucinations may be happening.Hallucinations can occur throughout the process to predict the next token in a sequence of tokens:What causes LLMs to hallucinate?While there are many origins of hallucinations within an LLM’s architecture, we can simplify and categorize the root causes into four main origins of hallucinations:Lack of or scarce data during trainingAs a rule of thumb, an LLM cannot give you any info that was not clearly shown during training. Trying to do so is one of the fastest ways to get a hallucination.How an LLM actually learns factual knowledge is not yet fully understood, and a lot of research is ongoing. But we do know that for an LLM to learn some knowledge, it is not enough to show it some information once. In fact, it benefits from being exposed to a piece of knowledge from diverse sources and perspectives, avoiding duplicated data, and maximizing the LLM opportunities to link it with other close knowledge (like a field of study). This is why scarce knowledge, commonly known as “long-tail knowledge,” usually shows high hallucination rates.There’s also certain knowledge that an LLM could not have possibly seen during training: Zero-Shot and Few-Shot Learning With LLMs Lack of alignmentAnother rule of thumb is that an LLM is just trying to follow your instructions and answer with the most probable response it has. But what happens if an LLM doesn’t know how to follow instructions properly? That is due to a lack of alignment.Alignment is the process of teaching an LLM how to follow instructions and respond helpfully, safely, and reliably to match our human expectations. This process happens during the post-training stage, which includes different fine-tuning methods.Imagine using an LLM-based meal assistant. You ask for a nutritious and tasty breakfast suitable for someone with celiac disease. The assistant recommends salmon, avocado, and toast. Why? The model likely knows that toast contains gluten, but when asked for a breakfast suggestion, it failed to ensure that all items met the dietary requirements.Instead, it defaulted to the most probable and common pairing with salmon and avocado, which happened to be a toast. This is an example of a hallucination caused by misalignment. The assistant’s response didn’t meet the requirements for a celiac-friendly menu, not because the LLM didn’t understand what celiac disease is but because it failed to accurately follow the instructions provided.Although the example may seem simplistic, and modern LLMs have largely addressed these issues, similar mistakes can still be observed with smaller or older language models.Poor attention performanceAttention is the process of modeling the interaction between input tokens via the dot product of Query and Key matrices, generating an attention matrix, which is then multiplied with a Value matrix to get the attention output. This operation represents a mathematical way of expressing a lookup of knowledge related to the input tokens, weighing it, and then responding to the request based on it.Poor attention performance means not properly attending to all relevant parts of the prompt and thus not having available the information needed to respond appropriately. Attention performance is an inherent property of LLMs fundamentally determined by architecture and hyperparameter choice. Nevertheless, it seems like a combination of fine-tuning and some tweaks on the positional embedding brings huge improvements in attention performance.Typical poor attention-based hallucinations are those when, after a relatively long conversation, the model is unable to remember a certain date you mentioned or your name or even forgets the instructions given at the very beginning. We can assess this using the “needle in a haystack” evaluation, which assesses whether an LLM can accurately retrieve the fact across varying context lengths.TokenizerThe tokenizer is a core part of the LLMs due to its singular functionality. It’s the single component in the transformer architecture, which is at the same time the root cause of hallucinations and where hallucinations are generated.The tokenizer is the component where input text is chunked into little pieces of characters represented by a numeric ID, the tokens. Tokenizers learn the correspondences between word chunks and tokens separately from the LLM training. Hence, it is the only component that is not necessarily trained with the same dataset as the transformer.This can lead to words being interpreted with a totally different meaning. In extreme cases, certain tokens can completely break an LLM. One of the first widely discussed examples was the SolidGoldMagikarp token, which GPT-3 internally understood as the verb “distribute,” resulting in weird conversation completions.Is it possible to detect hallucinations?When it comes to detecting hallucinations, what you actually want to do is evaluate if the LLM responds reliably and truthfully. We can classify evaluations based on whether the ground truth (reference) is available or not.Reference-based evaluationsComparing ground truth against LLM-generated answers is based on the same principles of classic machine learning model evaluation. However, unlike other models, language predictions cannot be compared word by word. Instead, semantic and fact-based metrics must be used. Here are some of the main ones: LLM Evaluation For Text Summarization Reference-free evaluationsWhen there is no ground truth, evaluation methods may be separated based on whether the LLM response is generated from a given context (i.e., RAG-like frameworks) or not: LLM Observability: Fundamentals, Practices, and Tools How to reduce hallucinations in LLMs?Hallucinations have been one of the main obstacles to the adoption of LLM assistants in enterprises. LLMs are persuasive to the point of fooling PhDs in their own field. The potential harm to non-expert users is high when talking, for example, about health. So, preventing them is one of the main focuses for different stakeholders:Hence, an overwhelming amount of new hallucination-prevention methods are constantly being released. (If you’re curious, try searching the recent posts on X talking about “hallucination mitigation” or the latest papers on Google Scholar talking about “LLM hallucination.” By the way, this is a good way to stay updated.)Broadly speaking, we can reduce hallucinations in LLMs by filtering responses, prompt engineering, achieving better alignment, and improving the training data. To navigate the space, we can use a simple taxonomy to organize current and upcoming methods. Hallucinations can be prevented at different steps of the process an LLM uses to generate an output, and we can use this as the foundation for our categorization.After the responseCorrecting a hallucination after the LLM output has been generated is still beneficial, as it prevents the user from seeing the incorrect information. This approach can effectively transform correction into prevention by ensuring that the erroneous response never reaches the user. The process can be broken down into the following steps:This method is part of multi-step reasoning strategies, which are increasingly important in handling complex problems. These strategies, often referred to as “agents,” are gaining popularity. One well-known agent pattern is reflection. By identifying hallucinations early, you can address and correct them before they impact the user.During the response (in context)Since the LLM will directly respond to the user’s request, we can inject information before starting the generation to condition the model’s response. Here are the most relevant strategies to condition response:A good example of the “Chain of Thoughts” approach is the Anthropic’s Claude using to give itself space to reflect and the addition of “Let’s think step by step” at the end of any prompt.\\xa0As an alternative to retrieving information, if an LLM context window is long enough, any document or data source could be directly added to the prompt, leveraging in-context learning. This would be a brute-force approach, and while costly, it could be effective when reasoning over an entire knowledge base instead of just some retrieved parts.Post-training or alignmentIt is hypothesized that an LLM instructed not only to respond and follow instructions but also to take time to reason and reflect on a problem could largely mitigate the hallucination issue—either by providing the correct answer or by stating that it does not know how to answer.Furthermore, you can teach a model to use external tools during the reasoning process,\\xa0 like getting information from a search engine. There are a lot of different fine-tuning techniques being tested to achieve this. Some LLMs already working with this reasoning strategy are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.Pre-trainingIncreasing the pre-training dataset or introducing new knowledge directly leads to broader knowledge coverage and fewer hallucinations, especially regarding facts and recent events. Additionally, better data processing and curation enhance LLM learning. Unfortunately, pre-training requires vast computational resources, mainly GPUs, which are only accessible to large companies and frontier AI labs. Despite that, if the problem is big enough, pre-training may still be a viable solution, as the OpenAI and Harvey case showed.Is it possible to achieve hallucination-free LLM applications?Hallucination-free LLM applications are the Holy Grail or the One Piece of the LLM world. Over time, with a growing availability of resources, invested money, and brains researching the topic, it is hard not to be optimistic.Ilya Sutskever, one of the researchers behind GPT, is quite sure that hallucinations can be solved with better alignment alone. LLM-based applications are becoming more sophisticated and complex. The combination of the previously commented hallucination prevention strategies is conquering milestones one after another. Despite that, whether the goal is achievable or not is just a hypothesis.Some, like Yann LeCunn, Chief AI Scientist at Meta, have stated that hallucination problems are specific to auto-regressive models, and we should move away from architectures that can reason and plan. Others, like Gary Marcus, argue strongly that transformer-based LLMs are completely unable to eliminate hallucinations. Instead, he bets on neurosymbolic AI. But the good news is that even those not optimistic about mitigating hallucinations in today’s LLMs are optimistic about the broader goal.On average, experts’ opinions point either to moderate optimism or uncertainty. After all, my intuition is that there is enough evidence to believe that hallucination-free LLM applications are possible. But remember, when it comes to state-of-the-art research, intuitions must always be built on top of solid knowledge and previous research.Where does this leave us?Hallucinations are a blessing and a curse at the same time. Along the article, you’ve gained a structured understanding of why, how, and where LLMs hallucinate. Equipped with this base knowledge, you’re ready to face hallucination problems with the different tools and techniques that we’ve explored.\\n\\t\\t\\t\\t\\t\\tWas the article useful?\\t\\t\\t\\t\\tΔ\\n More about \\n LLM Hallucinations 101: Why Do They Appear? Can We Avoid Them? \\n Check out our \\r\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nproduct resources \\r\\nand\\r\\n\\n\\n\\nrelated articles below:\\n \\n LLM Observability: Fundamentals, Practices, and Tools\\n \\n Strategies For Effective Prompt Engineering\\n \\n Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide\\n \\n LLM Guardrails: Secure and Controllable Deployment\\n \\n Explore more content topics: \\nManage your model metadata in a single place\\nJoin 50,000+ ML Engineers & Data Scientists using Neptune to easily log, compare, register, and share ML metadata.NewsletterTop articles, case studies, events (and more) in your inbox every month.\\n\\nGet Newsletter\\n\\nCopyright © 2024 Neptune Labs. All rights reserved.')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "assert len(docs[0].page_content) > 0" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "57" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Filter out header and footer chunks.\n", + "header_footer_keywords = [\"peers about your research\", \"deepsense\", \"ReSpo\", \"Was the article useful?\", \"related articles\", \"All rights reserved\"]\n", + "splits = []\n", + "for s in text_splitter.split_documents(docs):\n", + " kw_found = False\n", + " for kw in header_footer_keywords:\n", + " if kw in s.page_content:\n", + " kw_found = True\n", + " break\n", + " if not kw_found:\n", + " splits.append(s)\n", + "\n", + "len(splits)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='TL;DR Hallucinations are an inherent feature of LLMs that becomes a bug in LLM-based applications.Causes of hallucinations include insufficient training data, misalignment, attention limitations, and tokenizer issues.Hallucinations can be detected by verifying the accuracy and reliability of the model’s responses.Effective mitigation strategies involve enhancing data quality, alignment, information retrieval methods, and prompt engineering.In 2022, when GPT-3.5 was introduced with ChatGPT, many, like me, started experimenting with various use cases. A friend asked me if it could read an article, summarize it, and answer some questions, like a research assistant. At that time, ChatGPT had no tools to explore websites, but I was unaware of this. So, I gave it the article’s link. It responded with an abstract of the article. Since the article was a medical research paper, and I had no medical background, I was amazed by the result and eagerly shared my enthusiasm with'),\n", + " Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='link. It responded with an abstract of the article. Since the article was a medical research paper, and I had no medical background, I was amazed by the result and eagerly shared my enthusiasm with my friend. However, when he reviewed the abstract, he noticed it had almost nothing to do with the article.Then, I realized what had happened. As you might guess, ChatGPT had taken the URL, which included the article’s title, and “made up” an abstract. This “making up” event is what we call a hallucination, a term popularized by Andrej Karpathy in 2015 in the context of RNNs and extensively used nowadays for large language models (LLMs).What are LLM hallucinations?LLMs like GPT4o, Llama 3.1, Claude 3.5, or Gemini Pro 1.5 have made a huge jump in quality compared to the first of its class, GPT 3.5. However, they are all based on the same decoder-only transformer architecture, with the sole goal of predicting the next token based on a sequence of given or already predicted tokens. This is'),\n", + " Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='3.5. However, they are all based on the same decoder-only transformer architecture, with the sole goal of predicting the next token based on a sequence of given or already predicted tokens. This is called causal language modeling. Relying on this goal task and looping (pre-training) over a gigantic dataset of text (15T tokens for Llama 3.1) trying to predict each one of its tokens is how an LLM acquires its ability to understand natural language.There is a whole field of study on how LLMs select the following token for a sequence. In the following, we’ll exclusively talk about LLMs with greedy decoding, which means choosing the most probable token for the next token prediction. Given that, talking about hallucinations is hard because, in some sense, all an LLM does is hallucinate tokens. Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide LLM hallucinations become a problem in LLM-based applicationsMost of the time, if you use an LLM, you probably won’t use a'),\n", + " Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide LLM hallucinations become a problem in LLM-based applicationsMost of the time, if you use an LLM, you probably won’t use a base LLM but an LLM-based assistant whose goal is to help with your requests and reliably answer your questions. Ultimately, the student has been trained (post-training) to follow your instructions. Here’s when hallucinations become an undesirable bug.In short, hallucinations occur when a user instruction (prompt) leads the LLM to predict tokens that are not aligned with the expected answer or ground truth. These hallucinations mainly happen either because the correct token was not available or because the LLM failed to retrieve it.Before we dive into this further, I’d like to stress that when thinking about LLM hallucinations, it’s important to keep in mind the difference between a base LLM and an LLM-based assistant. When we talk about LLM hallucinations as a problematic phenomenon, it’s'),\n", + " Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='thinking about LLM hallucinations, it’s important to keep in mind the difference between a base LLM and an LLM-based assistant. When we talk about LLM hallucinations as a problematic phenomenon, it’s in the context of an LLM-based assistant or system.Where in the transformer architecture are hallucinations generated?The statement “all an LLM does is hallucinate tokens” conceals a lot of meaning. To uncover this, let’s walk through the transformer architecture to understand how tokens are generated during inference and where hallucinations may be happening.Hallucinations can occur throughout the process to predict the next token in a sequence of tokens:What causes LLMs to hallucinate?While there are many origins of hallucinations within an LLM’s architecture, we can simplify and categorize the root causes into four main origins of hallucinations:Lack of or scarce data during trainingAs a rule of thumb, an LLM cannot give you any info that was not clearly shown during training. Trying')]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "splits[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve and generate using the relevant snippets of the blog.\n", + "retriever = vectorstore.as_retriever(search_kwargs={'k': 1})" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(model=\"gpt-4o-mini\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DOM-based attacks are a type of vulnerability that involves feeding harmful instructions to a system by hiding them within a website's code. This can occur when an attacker embeds malicious key phrases in parts of the HTML that are not visible to users, such as matching text color to the background or placing it in a style tag. The rendered page may appear normal to users, but the hidden instructions can affect how the system processes the content.\n" + ] + } + ], + "source": [ + "from langchain_core.documents import Document\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.runnables.base import Runnable\n", + "from langchain.chains import create_retrieval_chain\n", + "from langchain.chains.combine_documents import create_stuff_documents_chain\n", + "\n", + "system_prompt = (\n", + " \"You are an assistant for question-answering tasks. \"\n", + " \"Use the following pieces of retrieved context to answer \"\n", + " \"the question. If you don't know the answer, say that you \"\n", + " \"don't know. Use three sentences maximum and keep the \"\n", + " \"answer concise.\"\n", + " \"\\n\\n\"\n", + " \"{context}\"\n", + ")\n", + "\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\"system\", system_prompt),\n", + " (\"human\", \"{input}\"),\n", + " ]\n", + ")\n", + "\n", + "\n", + "question_answer_chain = create_stuff_documents_chain(llm, prompt)\n", + "rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n", + "\n", + "response = rag_chain.invoke({\"input\": \"What are DOM-based attacks?\"})\n", + "print(response[\"answer\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['input', 'context', 'answer'])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "response.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'source': 'https://neptune.ai/blog/llm-guardrails'}, page_content='By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered, such as a style Tag.While invisible to human eyes, the LLM will')]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "response['context']" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"DOM-based attacks are a type of vulnerability that involves injecting harmful instructions into a system by concealing them within a website's code. This can happen when a program crawls websites and sends the raw HTML to a language model, allowing attackers to hide malicious key phrases in parts of the HTML that are not visible to users. The objective is to exploit the way the system processes and renders the code, potentially leading to unauthorized actions.\"" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "question_answer_chain.invoke({\"input\": \"What are DOM-based attacks?\", \"context\": response['context']})" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "def predict(chain: Runnable, query: str, context: list[Document] | None = None)-> dict:\n", + " \"\"\"\n", + " Accepts a retrieval chain or a stuff documents chain. If the latter, context must be passed in.\n", + " Return a response dict with keys \"input\", \"context\", and \"answer\"\n", + " \"\"\"\n", + " inputs = {\"input\": query}\n", + " if context:\n", + " inputs.update({\"context\": context})\n", + " response = chain.invoke(inputs)\n", + " result = {\n", + " response['input']: {\n", + " \"context\": [d.page_content for d in response['context']], \n", + " \"answer\": response['answer'],\n", + " }\n", + " }\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RAGAS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Eval set generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install nltk" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas as pd\n", + "# pd.set_option('display.max_colwidth', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from ragas.llms import LangchainLLMWrapper\n", + "from ragas.embeddings import LangchainEmbeddingsWrapper\n", + "from langchain_openai import ChatOpenAI\n", + "from langchain_openai import OpenAIEmbeddings\n", + "generator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\n", + "generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Applying [SummaryExtractor, HeadlinesExtractor]: 0%| | 0/114 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_inputreference_contextsreferencesynthesizer_name
0How users trick chatbot to bypass restrictions?[By prompting the application to pretend to be...Users trick chatbots to bypass restrictions by...AbstractQuerySynthesizer
1What distnguishes 'promt injecton' frm 'jailbr...[Although “prompt injection” and “jailbreaking...'Prompt injection' and 'jailbreaking' are dist...AbstractQuerySynthesizer
2DOM-based attacks exploit vulnerabilities web ...[DOM-based attacksDOM-based attacks are an ext...DOM-based attacks exploit vulnerabilities in w...AbstractQuerySynthesizer
3What are the challenges and benefits of fine-t...[Fine-tuning and serving pre-trained Large Lan...The challenges of fine-tuning Large Language M...AbstractQuerySynthesizer
4What Neptune and Transformers do in LLM fine-t...[LLM Fine-Tuning and Model Selection Using Nep...Neptune and Transformers are used in LLM fine-...AbstractQuerySynthesizer
5What role does reflection play in identifying ...[After the responseCorrecting a hallucination ...Reflection plays a role in identifying and cor...SpecificQuerySynthesizer
6What role does Giskard play in scanning LLM ap...[Assessing an LLM application for vulnerabilit...Giskard plays a role in scanning LLM applicati...SpecificQuerySynthesizer
7How does an LLM's architecture contribute to h...[What causes LLMs to hallucinate?While there a...An LLM's architecture contributes to hallucina...SpecificQuerySynthesizer
8What are some key practices involved in the ob...[Large Language Model (LLM) Observability: Fun...The key practices involved in the observabilit...SpecificQuerySynthesizer
9What are some examples of LLMs that utilize a ...[Post-training or alignmentIt is hypothesized ...Some examples of LLMs that utilize a reasoning...SpecificQuerySynthesizer
10What role does LLMOps play in the creation and...[Embedding creation and managementCreating and...LLMOps plays a crucial role in the creation an...SpecificQuerySynthesizer
11What role does ground truth play in evaluating...[evaluationsComparing ground truth against LLM...Ground truth plays a role in evaluating LLM-ge...SpecificQuerySynthesizer
12What are some recommended testing best practic...[LLM evaluation and testing best practices ...The context does not provide specific recommen...SpecificQuerySynthesizer
13How does an LLM acquire its ability to underst...[3.5. However, they are all based on the same ...An LLM acquires its ability to understand natu...SpecificQuerySynthesizer
14What role do LLMs play in zero-shot and few-sh...[Zero-Shot and Few-Shot Learning With LLMs ...LLMs play a crucial role in zero-shot and few-...SpecificQuerySynthesizer
15What role does the decoder-only transformer ar...[3.5. However, they are all based on the same ...The decoder-only transformer architecture play...SpecificQuerySynthesizer
16What are the potential risks and consequences ...[(in this case, 5 seconds) is automatically te...The potential risks and consequences of a prom...SpecificQuerySynthesizer
17What are the common methods used in denial of ...[can hide a malicious key phrase by matching i...Denial of service attacks disrupt normal opera...SpecificQuerySynthesizer
18What role do dedicated research teams play in ...[Training and serving LLMsFor larger organizat...Dedicated research teams play a crucial role i...SpecificQuerySynthesizer
19What role does regex play in validating email ...[Rule-based data validationThe simplest type o...Regex plays a role in validating email formats...SpecificQuerySynthesizer
20What are the steps involved in implementing gu...[How to implement guardrailsNow, let’s see how...The context provided does not contain detailed...SpecificQuerySynthesizer
21What are the key components of a pipeline?[pipeline]The context provided is insufficient to determ...SpecificQuerySynthesizer
22What steps are involved in refining the applic...[Test and refine the application]The steps involved in refining the application...SpecificQuerySynthesizer
23What role does language modeling play in the d...[Advanced validations based on metric scoresLa...Language modeling plays a crucial role in the ...SpecificQuerySynthesizer
24What are pipeline parts?[pipeline]The context provided is insufficient to determ...SpecificQuerySynthesizer
25What are the key guardrail methods used to han...[key guardrail methodsThe common vulnerabiliti...Key guardrail methods in the context of LLM ar...SpecificQuerySynthesizer
26What role does the post-training stage play in...[Lack of alignmentAnother rule of thumb is tha...The post-training stage plays a crucial role i...SpecificQuerySynthesizer
27What are some best practices for deploying and...[LLM Deployment and Observability Best Practic...The context does not provide specific best pra...SpecificQuerySynthesizer
28What are the key steps involved in deploying a...[Deploying customer service chatbotImagine tha...The key steps involved in deploying a customer...SpecificQuerySynthesizer
29What role does an LLM play in handling custome...[Example Scenario: Evaluating a customer servi...An LLM plays a role in handling customer servi...SpecificQuerySynthesizer
30What role does an LLM play in generating hallu...[Where in the transformer architecture are hal...An LLM plays a role in generating hallucinatio...SpecificQuerySynthesizer
31What role does the LLMBasicSycophancyDetector ...[As you can see below, Giskard runs multiple s...The LLMBasicSycophancyDetector plays a role in...SpecificQuerySynthesizer
32What are the best practices for embedding crea...[Embedding Creation and Management Best]The context provided is incomplete, so specifi...SpecificQuerySynthesizer
33What is the process of creating a function to ...[Then, we create a function to generate predic...The process involves creating a function that ...SpecificQuerySynthesizer
34What role does human-in-the-loop testing play ...[LLM evaluation and testingLLM evaluation tech...Human-in-the-loop testing plays a role in the ...SpecificQuerySynthesizer
35What role do specialized detectors play in ide...[As you can see below, Giskard runs multiple s...Specialized detectors in Giskard's system play...SpecificQuerySynthesizer
36What role does GPT4o play in the context of LL...[What are LLM hallucinations?LLMs like GPT4o, ...GPT4o, like other LLMs, plays a role in LLM ha...SpecificQuerySynthesizer
37What are the potential issues that arise when ...[LLM hallucinations become a problem in LLM-ba...LLM hallucinations become a problem in LLM-bas...SpecificQuerySynthesizer
38How does a customer service chatbot utilize a ...[Example: A chain of LLMs handling customer se...A customer service chatbot utilizes a chain of...SpecificQuerySynthesizer
39What are the key criteria used in LLM Evaluati...[LLM Evaluation For Text Summarization ...The key criteria used in LLM Evaluation for as...SpecificQuerySynthesizer
40What are the key steps involved in deploying a...[Example Scenario: Deploying customer service ...The key steps involved in deploying a customer...SpecificQuerySynthesizer
41What challenges are associated with using doma...[Fine-tuning and serving pre-trained Large Lan...Challenges associated with using domain-specif...SpecificQuerySynthesizer
42How were users able to manipulate ChatGPT by m...[By prompting the application to pretend to be...Users were able to manipulate ChatGPT by promp...SpecificQuerySynthesizer
43What advancements does Claude 3.5 offer compar...[What are LLM hallucinations?LLMs like GPT4o, ...The text does not provide specific advancement...SpecificQuerySynthesizer
44What are the differences between online and ba...[LLM deployment: Serving, monitoring, and obse...Online inference mode in LLM deployment involv...SpecificQuerySynthesizer
45What role does containerization play in the de...[DeploymentDeploy models through pipelines, ty...Containerization plays a role in the deploymen...SpecificQuerySynthesizer
46What r the key compnents n functons of a pipline?[pipeline]The key components and functions of a pipeline...SpecificQuerySynthesizer
47What role does LLMOps play in the creation and...[Embedding creation and managementCreating and...LLMOps plays a crucial role in the creation an...SpecificQuerySynthesizer
48What are the potential issues that arise when ...[LLM hallucinations become a problem in LLM-ba...LLM hallucinations can become a problem in LLM...SpecificQuerySynthesizer
49What are the key vulnerabilities in LLMs that ...[LLM guardrails are small programs that valida...The key vulnerabilities in LLMs that LLM guard...SpecificQuerySynthesizer
\n", + "" + ], + "text/plain": [ + " user_input \\\n", + "0 How users trick chatbot to bypass restrictions? \n", + "1 What distnguishes 'promt injecton' frm 'jailbr... \n", + "2 DOM-based attacks exploit vulnerabilities web ... \n", + "3 What are the challenges and benefits of fine-t... \n", + "4 What Neptune and Transformers do in LLM fine-t... \n", + "5 What role does reflection play in identifying ... \n", + "6 What role does Giskard play in scanning LLM ap... \n", + "7 How does an LLM's architecture contribute to h... \n", + "8 What are some key practices involved in the ob... \n", + "9 What are some examples of LLMs that utilize a ... \n", + "10 What role does LLMOps play in the creation and... \n", + "11 What role does ground truth play in evaluating... \n", + "12 What are some recommended testing best practic... \n", + "13 How does an LLM acquire its ability to underst... \n", + "14 What role do LLMs play in zero-shot and few-sh... \n", + "15 What role does the decoder-only transformer ar... \n", + "16 What are the potential risks and consequences ... \n", + "17 What are the common methods used in denial of ... \n", + "18 What role do dedicated research teams play in ... \n", + "19 What role does regex play in validating email ... \n", + "20 What are the steps involved in implementing gu... \n", + "21 What are the key components of a pipeline? \n", + "22 What steps are involved in refining the applic... \n", + "23 What role does language modeling play in the d... \n", + "24 What are pipeline parts? \n", + "25 What are the key guardrail methods used to han... \n", + "26 What role does the post-training stage play in... \n", + "27 What are some best practices for deploying and... \n", + "28 What are the key steps involved in deploying a... \n", + "29 What role does an LLM play in handling custome... \n", + "30 What role does an LLM play in generating hallu... \n", + "31 What role does the LLMBasicSycophancyDetector ... \n", + "32 What are the best practices for embedding crea... \n", + "33 What is the process of creating a function to ... \n", + "34 What role does human-in-the-loop testing play ... \n", + "35 What role do specialized detectors play in ide... \n", + "36 What role does GPT4o play in the context of LL... \n", + "37 What are the potential issues that arise when ... \n", + "38 How does a customer service chatbot utilize a ... \n", + "39 What are the key criteria used in LLM Evaluati... \n", + "40 What are the key steps involved in deploying a... \n", + "41 What challenges are associated with using doma... \n", + "42 How were users able to manipulate ChatGPT by m... \n", + "43 What advancements does Claude 3.5 offer compar... \n", + "44 What are the differences between online and ba... \n", + "45 What role does containerization play in the de... \n", + "46 What r the key compnents n functons of a pipline? \n", + "47 What role does LLMOps play in the creation and... \n", + "48 What are the potential issues that arise when ... \n", + "49 What are the key vulnerabilities in LLMs that ... \n", + "\n", + " reference_contexts \\\n", + "0 [By prompting the application to pretend to be... \n", + "1 [Although “prompt injection” and “jailbreaking... \n", + "2 [DOM-based attacksDOM-based attacks are an ext... \n", + "3 [Fine-tuning and serving pre-trained Large Lan... \n", + "4 [LLM Fine-Tuning and Model Selection Using Nep... \n", + "5 [After the responseCorrecting a hallucination ... \n", + "6 [Assessing an LLM application for vulnerabilit... \n", + "7 [What causes LLMs to hallucinate?While there a... \n", + "8 [Large Language Model (LLM) Observability: Fun... \n", + "9 [Post-training or alignmentIt is hypothesized ... \n", + "10 [Embedding creation and managementCreating and... \n", + "11 [evaluationsComparing ground truth against LLM... \n", + "12 [LLM evaluation and testing best practices ... \n", + "13 [3.5. However, they are all based on the same ... \n", + "14 [Zero-Shot and Few-Shot Learning With LLMs ... \n", + "15 [3.5. However, they are all based on the same ... \n", + "16 [(in this case, 5 seconds) is automatically te... \n", + "17 [can hide a malicious key phrase by matching i... \n", + "18 [Training and serving LLMsFor larger organizat... \n", + "19 [Rule-based data validationThe simplest type o... \n", + "20 [How to implement guardrailsNow, let’s see how... \n", + "21 [pipeline] \n", + "22 [Test and refine the application] \n", + "23 [Advanced validations based on metric scoresLa... \n", + "24 [pipeline] \n", + "25 [key guardrail methodsThe common vulnerabiliti... \n", + "26 [Lack of alignmentAnother rule of thumb is tha... \n", + "27 [LLM Deployment and Observability Best Practic... \n", + "28 [Deploying customer service chatbotImagine tha... \n", + "29 [Example Scenario: Evaluating a customer servi... \n", + "30 [Where in the transformer architecture are hal... \n", + "31 [As you can see below, Giskard runs multiple s... \n", + "32 [Embedding Creation and Management Best] \n", + "33 [Then, we create a function to generate predic... \n", + "34 [LLM evaluation and testingLLM evaluation tech... \n", + "35 [As you can see below, Giskard runs multiple s... \n", + "36 [What are LLM hallucinations?LLMs like GPT4o, ... \n", + "37 [LLM hallucinations become a problem in LLM-ba... \n", + "38 [Example: A chain of LLMs handling customer se... \n", + "39 [LLM Evaluation For Text Summarization ... \n", + "40 [Example Scenario: Deploying customer service ... \n", + "41 [Fine-tuning and serving pre-trained Large Lan... \n", + "42 [By prompting the application to pretend to be... \n", + "43 [What are LLM hallucinations?LLMs like GPT4o, ... \n", + "44 [LLM deployment: Serving, monitoring, and obse... \n", + "45 [DeploymentDeploy models through pipelines, ty... \n", + "46 [pipeline] \n", + "47 [Embedding creation and managementCreating and... \n", + "48 [LLM hallucinations become a problem in LLM-ba... \n", + "49 [LLM guardrails are small programs that valida... \n", + "\n", + " reference \\\n", + "0 Users trick chatbots to bypass restrictions by... \n", + "1 'Prompt injection' and 'jailbreaking' are dist... \n", + "2 DOM-based attacks exploit vulnerabilities in w... \n", + "3 The challenges of fine-tuning Large Language M... \n", + "4 Neptune and Transformers are used in LLM fine-... \n", + "5 Reflection plays a role in identifying and cor... \n", + "6 Giskard plays a role in scanning LLM applicati... \n", + "7 An LLM's architecture contributes to hallucina... \n", + "8 The key practices involved in the observabilit... \n", + "9 Some examples of LLMs that utilize a reasoning... \n", + "10 LLMOps plays a crucial role in the creation an... \n", + "11 Ground truth plays a role in evaluating LLM-ge... \n", + "12 The context does not provide specific recommen... \n", + "13 An LLM acquires its ability to understand natu... \n", + "14 LLMs play a crucial role in zero-shot and few-... \n", + "15 The decoder-only transformer architecture play... \n", + "16 The potential risks and consequences of a prom... \n", + "17 Denial of service attacks disrupt normal opera... \n", + "18 Dedicated research teams play a crucial role i... \n", + "19 Regex plays a role in validating email formats... \n", + "20 The context provided does not contain detailed... \n", + "21 The context provided is insufficient to determ... \n", + "22 The steps involved in refining the application... \n", + "23 Language modeling plays a crucial role in the ... \n", + "24 The context provided is insufficient to determ... \n", + "25 Key guardrail methods in the context of LLM ar... \n", + "26 The post-training stage plays a crucial role i... \n", + "27 The context does not provide specific best pra... \n", + "28 The key steps involved in deploying a customer... \n", + "29 An LLM plays a role in handling customer servi... \n", + "30 An LLM plays a role in generating hallucinatio... \n", + "31 The LLMBasicSycophancyDetector plays a role in... \n", + "32 The context provided is incomplete, so specifi... \n", + "33 The process involves creating a function that ... \n", + "34 Human-in-the-loop testing plays a role in the ... \n", + "35 Specialized detectors in Giskard's system play... \n", + "36 GPT4o, like other LLMs, plays a role in LLM ha... \n", + "37 LLM hallucinations become a problem in LLM-bas... \n", + "38 A customer service chatbot utilizes a chain of... \n", + "39 The key criteria used in LLM Evaluation for as... \n", + "40 The key steps involved in deploying a customer... \n", + "41 Challenges associated with using domain-specif... \n", + "42 Users were able to manipulate ChatGPT by promp... \n", + "43 The text does not provide specific advancement... \n", + "44 Online inference mode in LLM deployment involv... \n", + "45 Containerization plays a role in the deploymen... \n", + "46 The key components and functions of a pipeline... \n", + "47 LLMOps plays a crucial role in the creation an... \n", + "48 LLM hallucinations can become a problem in LLM... \n", + "49 The key vulnerabilities in LLMs that LLM guard... \n", + "\n", + " synthesizer_name \n", + "0 AbstractQuerySynthesizer \n", + "1 AbstractQuerySynthesizer \n", + "2 AbstractQuerySynthesizer \n", + "3 AbstractQuerySynthesizer \n", + "4 AbstractQuerySynthesizer \n", + "5 SpecificQuerySynthesizer \n", + "6 SpecificQuerySynthesizer \n", + "7 SpecificQuerySynthesizer \n", + "8 SpecificQuerySynthesizer \n", + "9 SpecificQuerySynthesizer \n", + "10 SpecificQuerySynthesizer \n", + "11 SpecificQuerySynthesizer \n", + "12 SpecificQuerySynthesizer \n", + "13 SpecificQuerySynthesizer \n", + "14 SpecificQuerySynthesizer \n", + "15 SpecificQuerySynthesizer \n", + "16 SpecificQuerySynthesizer \n", + "17 SpecificQuerySynthesizer \n", + "18 SpecificQuerySynthesizer \n", + "19 SpecificQuerySynthesizer \n", + "20 SpecificQuerySynthesizer \n", + "21 SpecificQuerySynthesizer \n", + "22 SpecificQuerySynthesizer \n", + "23 SpecificQuerySynthesizer \n", + "24 SpecificQuerySynthesizer \n", + "25 SpecificQuerySynthesizer \n", + "26 SpecificQuerySynthesizer \n", + "27 SpecificQuerySynthesizer \n", + "28 SpecificQuerySynthesizer \n", + "29 SpecificQuerySynthesizer \n", + "30 SpecificQuerySynthesizer \n", + "31 SpecificQuerySynthesizer \n", + "32 SpecificQuerySynthesizer \n", + "33 SpecificQuerySynthesizer \n", + "34 SpecificQuerySynthesizer \n", + "35 SpecificQuerySynthesizer \n", + "36 SpecificQuerySynthesizer \n", + "37 SpecificQuerySynthesizer \n", + "38 SpecificQuerySynthesizer \n", + "39 SpecificQuerySynthesizer \n", + "40 SpecificQuerySynthesizer \n", + "41 SpecificQuerySynthesizer \n", + "42 SpecificQuerySynthesizer \n", + "43 SpecificQuerySynthesizer \n", + "44 SpecificQuerySynthesizer \n", + "45 SpecificQuerySynthesizer \n", + "46 SpecificQuerySynthesizer \n", + "47 SpecificQuerySynthesizer \n", + "48 SpecificQuerySynthesizer \n", + "49 SpecificQuerySynthesizer " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from ragas.testset import TestsetGenerator\n", + "from ragas.testset.synthesizers import AbstractQuerySynthesizer, SpecificQuerySynthesizer\n", + "\n", + "generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)\n", + "dataset = generator.generate_with_langchain_docs(\n", + " splits, \n", + " testset_size=50, \n", + " query_distribution=[\n", + " (AbstractQuerySynthesizer(llm=generator_llm), 0.1),\n", + " (SpecificQuerySynthesizer(llm=generator_llm), 0.9),\n", + " ],\n", + ")\n", + "dataset.to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['pipeline']" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.to_pandas()['reference_contexts'][46]" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "48" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.to_pandas()['user_input'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "48" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Remove duplicated questions\n", + "unique_indices = list(dataset.to_pandas().drop_duplicates(subset=['user_input']).index)\n", + "len(unique_indices)" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "7" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Remove not helpful contexts/answers\n", + "not_helpful = list(dataset.to_pandas()[dataset.to_pandas()['reference'].str.contains(\"does not contain|does not provide|context does not|is insufficient|is incomplete\", case=False, regex=True)].index)\n", + "len(not_helpful)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[12, 20, 21, 24, 27, 32, 43]" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "not_helpful" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [], + "source": [ + "for x in not_helpful:\n", + " if x in unique_indices:\n", + " unique_indices.remove(x)" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "41" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(unique_indices)" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],\n", + " num_rows: 41\n", + "})" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds = dataset.to_hf_dataset().select(unique_indices)\n", + "ds" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 247.45ba/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "32380" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds.to_csv(\"eval_data.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run inference over eval set" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'How users trick chatbot to bypass restrictions?': {'context': ['By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered, such as a style Tag.While invisible to human eyes, the LLM will'],\n", + " 'answer': 'Users trick chatbots by using techniques like \"prompt injection\" and \"jailbreaking,\" which allow them to manipulate the chatbot into providing responses it would normally refuse to answer. This can involve crafting prompts that make the chatbot pretend it has fewer restrictions or embedding harmful instructions in a way that goes unnoticed. Additionally, DOM-based attacks can involve hiding malicious instructions within a website’s code to exploit the chatbot\\'s processing of that content.'}}" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predict(rag_chain, ds['user_input'][0])" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "41" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from concurrent.futures import ThreadPoolExecutor, as_completed\n", + "from datasets import Dataset\n", + "\n", + "def concurrent_predict_retrieval_chain(chain: Runnable, dataset: Dataset):\n", + " results = {}\n", + " threads = []\n", + " with ThreadPoolExecutor(max_workers=5) as pool:\n", + " for query in dataset['user_input']:\n", + " threads.append(pool.submit(predict, chain, query))\n", + " for task in as_completed(threads):\n", + " results.update(task.result())\n", + " return results\n", + "\n", + "predictions = concurrent_predict_retrieval_chain(rag_chain, ds)\n", + "\n", + "len(predictions.keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'context': ['potential cost increases with scaling.Fine-tuning and serving pre-trained Large Language ModelsAs needs become more specific and off-the-shelf APIs prove insufficient, teams progress to fine-tuning pre-trained models like Llama-2-70B or Mistral 8x7B. This middle ground balances customization and resource management, so teams can adapt these models to niche use cases or proprietary data sets.The process is more resource-intensive than using APIs directly. However, it provides a tailored experience that leverages the inherent strengths of pre-trained models without the exorbitant cost of training from scratch. This stage introduces challenges such as the need for quality domain-specific data, the risk of overfitting, and navigating potential licensing issues. LLM Fine-Tuning and Model Selection Using Neptune and Transformers Training and serving LLMsFor larger organizations or dedicated research teams, the journey may involve training LLMs from scratch—a path'],\n", + " 'answer': 'The challenges of fine-tuning Large Language Models include the need for quality domain-specific data, the risk of overfitting, and navigating potential licensing issues. The benefits include a tailored experience that adapts the models to specific use cases or proprietary data sets, while also leveraging the strengths of pre-trained models without the high costs of training from scratch. This approach provides a balance between customization and resource management.'}" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predictions[next(iter(predictions.keys()))]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluation Metrics\n", + "https://docs.ragas.io/en/stable/getstarted/rag_evaluation/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install rapidfuzz" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity, NoiseSensitivity\n", + "from ragas import EvaluationDataset\n", + "from ragas import evaluate\n", + "\n", + "evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\n", + "evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())\n", + "\n", + "metrics = [\n", + " LLMContextRecall(llm=evaluator_llm), \n", + " FactualCorrectness(llm=evaluator_llm), \n", + " Faithfulness(llm=evaluator_llm),\n", + " SemanticSimilarity(embeddings=evaluator_embeddings),\n", + " NoiseSensitivity(llm=evaluator_llm),\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Map: 100%|██████████| 41/41 [00:00<00:00, 5500.99 examples/s]\n" + ] + } + ], + "source": [ + "# map predictions back to eval set\n", + "ds_k_1 = ds.map(lambda example: {\"response\": predictions[example[\"user_input\"]]['answer'], \"retrieved_contexts\": predictions[example['user_input']][\"context\"]})" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name', 'response', 'retrieved_contexts'],\n", + " num_rows: 41\n", + "})" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_k_1" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Evaluating: 19%|█▉ | 39/205 [00:15<01:20, 2.07it/s]Exception raised in Job[151]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')\n", + "Evaluating: 71%|███████ | 145/205 [01:03<00:18, 3.23it/s]Exception raised in Job[186]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')\n", + "Evaluating: 92%|█████████▏| 188/205 [01:22<00:08, 2.09it/s]Exception raised in Job[131]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')\n", + "Evaluating: 100%|██████████| 205/205 [01:38<00:00, 2.08it/s]\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_inputretrieved_contextsreference_contextsresponsereferencecontext_recallfactual_correctnessfaithfulnesssemantic_similaritynoise_sensitivity_relevant
0How users trick chatbot to bypass restrictions?[By prompting the application to pretend to be...[By prompting the application to pretend to be...Users can trick chatbots to bypass restriction...Users trick chatbots to bypass restrictions by...1.00.730.6666670.9190340.166667
1What distnguishes 'promt injecton' frm 'jailbr...[one day, you suddenly find an ineligible cand...[Although “prompt injection” and “jailbreaking...Prompt injection aims to manipulate an applica...'Prompt injection' and 'jailbreaking' are dist...0.00.000.8333330.9045880.833333
2DOM-based attacks exploit vulnerabilities web ...[By prompting the application to pretend to be...[DOM-based attacksDOM-based attacks are an ext...Yes, DOM-based attacks exploit vulnerabilities...DOM-based attacks exploit vulnerabilities in w...1.00.891.0000000.9466430.666667
3What are the challenges and benefits of fine-t...[potential cost increases with scaling.Fine-tu...[Fine-tuning and serving pre-trained Large Lan...The challenges of fine-tuning Large Language M...The challenges of fine-tuning Large Language M...1.00.831.0000000.9934990.250000
4What Neptune and Transformers do in LLM fine-t...[potential cost increases with scaling.Fine-tu...[LLM Fine-Tuning and Model Selection Using Nep...Neptune and Transformers are used in LLM fine-...Neptune and Transformers are used in LLM fine-...1.00.000.0000000.9352440.000000
\n", + "
" + ], + "text/plain": [ + " user_input \\\n", + "0 How users trick chatbot to bypass restrictions? \n", + "1 What distnguishes 'promt injecton' frm 'jailbr... \n", + "2 DOM-based attacks exploit vulnerabilities web ... \n", + "3 What are the challenges and benefits of fine-t... \n", + "4 What Neptune and Transformers do in LLM fine-t... \n", + "\n", + " retrieved_contexts \\\n", + "0 [By prompting the application to pretend to be... \n", + "1 [one day, you suddenly find an ineligible cand... \n", + "2 [By prompting the application to pretend to be... \n", + "3 [potential cost increases with scaling.Fine-tu... \n", + "4 [potential cost increases with scaling.Fine-tu... \n", + "\n", + " reference_contexts \\\n", + "0 [By prompting the application to pretend to be... \n", + "1 [Although “prompt injection” and “jailbreaking... \n", + "2 [DOM-based attacksDOM-based attacks are an ext... \n", + "3 [Fine-tuning and serving pre-trained Large Lan... \n", + "4 [LLM Fine-Tuning and Model Selection Using Nep... \n", + "\n", + " response \\\n", + "0 Users can trick chatbots to bypass restriction... \n", + "1 Prompt injection aims to manipulate an applica... \n", + "2 Yes, DOM-based attacks exploit vulnerabilities... \n", + "3 The challenges of fine-tuning Large Language M... \n", + "4 Neptune and Transformers are used in LLM fine-... \n", + "\n", + " reference context_recall \\\n", + "0 Users trick chatbots to bypass restrictions by... 1.0 \n", + "1 'Prompt injection' and 'jailbreaking' are dist... 0.0 \n", + "2 DOM-based attacks exploit vulnerabilities in w... 1.0 \n", + "3 The challenges of fine-tuning Large Language M... 1.0 \n", + "4 Neptune and Transformers are used in LLM fine-... 1.0 \n", + "\n", + " factual_correctness faithfulness semantic_similarity \\\n", + "0 0.73 0.666667 0.919034 \n", + "1 0.00 0.833333 0.904588 \n", + "2 0.89 1.000000 0.946643 \n", + "3 0.83 1.000000 0.993499 \n", + "4 0.00 0.000000 0.935244 \n", + "\n", + " noise_sensitivity_relevant \n", + "0 0.166667 \n", + "1 0.833333 \n", + "2 0.666667 \n", + "3 0.250000 \n", + "4 0.000000 " + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k_1), metrics=metrics)\n", + "df = results.to_pandas()\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_inputretrieved_contextsreference_contextsresponsereferencecontext_recallfactual_correctnessfaithfulnesssemantic_similaritynoise_sensitivity_relevant
36How were users able to manipulate ChatGPT by m...[By prompting the application to pretend to be...[By prompting the application to pretend to be...Users manipulated ChatGPT by prompting it to a...Users were able to manipulate ChatGPT by promp...1.00.400.7500000.9295040.25
37What are the differences between online and ba...[models in production.The main difference is t...[LLM deployment: Serving, monitoring, and obse...I don't know.Online inference mode in LLM deployment involv...0.0NaN0.0000000.7399230.00
38What role does containerization play in the de...[language understanding  (e.g., using techniqu...[DeploymentDeploy models through pipelines, ty...Containerization plays a crucial role in the d...Containerization plays a role in the deploymen...1.00.800.1250000.9790620.00
39What r the key compnents n functons of a pipline?[pipelineImplement the inference componentTest...[pipeline]The key components of a pipeline typically inc...The key components and functions of a pipeline...0.00.290.1000000.9413950.00
40What are the key vulnerabilities in LLMs that ...[key guardrail methodsThe common vulnerabiliti...[LLM guardrails are small programs that valida...The key vulnerabilities in LLMs that guardrail...The key vulnerabilities in LLMs that LLM guard...0.00.170.4285710.9076960.00
\n", + "
" + ], + "text/plain": [ + " user_input \\\n", + "36 How were users able to manipulate ChatGPT by m... \n", + "37 What are the differences between online and ba... \n", + "38 What role does containerization play in the de... \n", + "39 What r the key compnents n functons of a pipline? \n", + "40 What are the key vulnerabilities in LLMs that ... \n", + "\n", + " retrieved_contexts \\\n", + "36 [By prompting the application to pretend to be... \n", + "37 [models in production.The main difference is t... \n", + "38 [language understanding  (e.g., using techniqu... \n", + "39 [pipelineImplement the inference componentTest... \n", + "40 [key guardrail methodsThe common vulnerabiliti... \n", + "\n", + " reference_contexts \\\n", + "36 [By prompting the application to pretend to be... \n", + "37 [LLM deployment: Serving, monitoring, and obse... \n", + "38 [DeploymentDeploy models through pipelines, ty... \n", + "39 [pipeline] \n", + "40 [LLM guardrails are small programs that valida... \n", + "\n", + " response \\\n", + "36 Users manipulated ChatGPT by prompting it to a... \n", + "37 I don't know. \n", + "38 Containerization plays a crucial role in the d... \n", + "39 The key components of a pipeline typically inc... \n", + "40 The key vulnerabilities in LLMs that guardrail... \n", + "\n", + " reference context_recall \\\n", + "36 Users were able to manipulate ChatGPT by promp... 1.0 \n", + "37 Online inference mode in LLM deployment involv... 0.0 \n", + "38 Containerization plays a role in the deploymen... 1.0 \n", + "39 The key components and functions of a pipeline... 0.0 \n", + "40 The key vulnerabilities in LLMs that LLM guard... 0.0 \n", + "\n", + " factual_correctness faithfulness semantic_similarity \\\n", + "36 0.40 0.750000 0.929504 \n", + "37 NaN 0.000000 0.739923 \n", + "38 0.80 0.125000 0.979062 \n", + "39 0.29 0.100000 0.941395 \n", + "40 0.17 0.428571 0.907696 \n", + "\n", + " noise_sensitivity_relevant \n", + "36 0.25 \n", + "37 0.00 \n", + "38 0.00 \n", + "39 0.00 \n", + "40 0.00 " + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"eval_results.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "sql" + } + }, + "source": [ + "## Neptune Experiment Tracking" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\"NEPTUNE_PROJECT\"] = \"your_workspace/your_project\"\n", + "os.environ[\"NEPTUNE_API_TOKEN\"] = \"your_neptune_API_token\"" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[neptune] [warning] NeptuneWarning: By default, these monitoring options are disabled in interactive sessions: 'capture_stdout', 'capture_stderr', 'capture_traceback', 'capture_hardware_metrics'. You can set them to 'True' when initializing the run and the monitoring will continue until you call run.stop() or the kernel stops. NOTE: To track the source files, pass their paths to the 'source_code' argument. For help, see: https://docs.neptune.ai/logging/source_code/\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/community/building-RAG-using-LangChain/e/BUIL1-5\n" + ] + } + ], + "source": [ + "import neptune\n", + "\n", + "run = neptune.init_run()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Upload the eval file\n", + "\n", + "run[\"eval_data\"].upload(\"eval_results.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [], + "source": [ + "# Track metrics for each row of the eval set, and the overall metric.\n", + "import pandas as pd\n", + "\n", + "def log_detailed_metrics(results_df: pd.DataFrame, run: neptune.Run):\n", + " for i, row in results_df.iterrows():\n", + " for m in metrics:\n", + " val = row[m.name]\n", + " run[f\"eval/q{i}/{m.name}\"].append(val)\n", + " \n", + " overall_metrics = results_df[[m.name for m in metrics]].mean(axis=0).to_dict()\n", + " for k, v in overall_metrics.items():\n", + " run[f\"eval/overall/{k}\"].append(v)" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'context_recall': 0.7317073170731707,\n", + " 'factual_correctness': 0.4694736842105263,\n", + " 'faithfulness': 0.615642487593707,\n", + " 'semantic_similarity': 0.9280948510865946,\n", + " 'noise_sensitivity_relevant': 0.29297856614929785}" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "overall_metrics = df[[m.name for m in metrics]].mean(axis=0).to_dict()\n", + "overall_metrics" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[neptune] [warning] NeptuneUnsupportedValue: WARNING: A value you're trying to log (`nan`) will be skipped because it's a non-standard float value that is not currently supported.\n" + ] + } + ], + "source": [ + "log_detailed_metrics(df, run)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Iterate on RAG system" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def concurrent_predict(chain: Runnable, dataset: Dataset, k: int = 1):\n", + " \"\"\"Uses the stuff documents chain, and thus needs context.\"\"\"\n", + " results = {}\n", + " with ThreadPoolExecutor(max_workers=5) as pool:\n", + " for result in pool.map(predict, chain, dataset['user_input'], dataset[f'context_{k}']):\n", + " results.update(result)\n", + " return results\n", + "\n", + "predictions = concurrent_predict(rag_chain, ds)\n", + "\n", + "len(predictions.keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Map: 100%|██████████| 41/41 [00:00<00:00, 4247.66 examples/s]\n", + "Evaluating: 37%|███▋ | 75/205 [00:46<01:49, 1.19it/s]Exception raised in Job[186]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')\n", + "Evaluating: 100%|██████████| 205/205 [02:35<00:00, 1.32it/s]\n", + "Map: 100%|██████████| 41/41 [00:00<00:00, 4820.50 examples/s]\n", + "Evaluating: 100%|██████████| 205/205 [03:32<00:00, 1.04s/it]\n" + ] + } + ], + "source": [ + "for k in [3, 5]:\n", + " retriever_k = vectorstore.as_retriever(search_kwargs={'k': k})\n", + " rag_chain_k = create_retrieval_chain(retriever_k, question_answer_chain)\n", + " predictions_k = concurrent_predict_retrieval_chain(rag_chain_k, ds)\n", + "\n", + " # map predictions back to eval set\n", + " ds_k = ds.map(lambda example: {\n", + " \"response\": predictions_k[example[\"user_input\"]]['answer'], \n", + " \"retrieved_contexts\": predictions_k[example['user_input']][\"context\"]\n", + " })\n", + "\n", + " results_k = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k), metrics=metrics)\n", + " df_k = results_k.to_pandas()\n", + " df_k.to_csv(\"eval_results.csv\", index=False)\n", + "\n", + " log_detailed_metrics(df_k, run)" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[neptune] [info ] Shutting down background jobs, please wait a moment...\n", + "[neptune] [info ] Done!\n", + "[neptune] [info ] All 0 operations synced, thanks for waiting!\n", + "[neptune] [info ] Explore the metadata in the Neptune app: https://app.neptune.ai/community/building-RAG-using-LangChain/e/BUIL1-5/metadata\n" + ] + } + ], + "source": [ + "# Ends the run\n", + "run.stop()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vllm", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/community-code/README.md b/community-code/README.md index 8798c9fa..43d0c0a4 100644 --- a/community-code/README.md +++ b/community-code/README.md @@ -11,6 +11,7 @@ | Title | Blog | Code | Neptune | --- | :---: | :---: | :---: | MLOps For Time Series Prediction: Binance Trading Tutorial | [![blog]](https://neptune.ai/blog/mlops-pipeline-for-time-series-prediction-tutorial) | [![github]](./binance-trading-neptune-master) | [![neptune]](https://app.neptune.ai/o/community/org/mlops-pipeline-for-time-series-prediction/runs/table?viewId=standard-view) +| How to build a RAG system using LangChain | [![blog]](https://neptune.ai/blog/building-rag-system-using-langchain) | [![github]](./HOW_TO_BUILD_A_RAG_SYSTEM_USING_LANGCHAIN/) | [![neptune]](https://app.neptune.ai/o/community/org/building-RAG-using-LangChain/runs/table?viewId=standard-view)