Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LangChain Support #2187

Open
Skar0 opened this issue Oct 16, 2024 · 2 comments · May be fixed by #2188
Open

Update LangChain Support #2187

Skar0 opened this issue Oct 16, 2024 · 2 comments · May be fixed by #2188

Comments

@Skar0
Copy link

Skar0 commented Oct 16, 2024

Feature request

The provided examples that leverage LangChain to create a representation all make use of langchain.chains.question_answering.load_qa_chain and the implementation is not very transparent to the user, leading to inconsistencies and difficulties to understand how to provide custom chains.

Motivation

Some of the issues in detail

  • langchain.chains.question_answering.load_qa_chain is now depricated and will be removed at some point.
  • The current LangChain integration is not very clear because
    • a prompt can be specified in the constructor of the LangChain class. However this is not a prompt but rather a custom instruction that is passed to the provided chain through the question key.
    • in the case of langchain.chains.question_answering.load_qa_chain (which is the provided example), this question key is added as part of a larger, hard-coded (and not transparent to a casual user) prompt.
    • if a user wants to fully customize the instructions to create the representation, it would be best not to use the langchain.chains.question_answering.load_qa_chain chain to avoid this hard-coded prompt (this is currently not very clearly documented). In addition, if that specific chain is not used, the use of a question key can be confusing.
    • the approach to add keywords in the prompt (by adding "[KEYWORDS]" in self.prompt and then performing some string manipulation) is confusing.
  • Some imports to LangChain are outdated (e.g. Documents, OpenAI).

Example of workarounds in current implementation

With the current implementation, a user wanting to use a custom LangChain prompt in a custom LCEL chain and add keywords to that prompt would have to do something like (ignoring that documents are passed as Document objects and not formatted into a str).

from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate

custom_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "Custom instructions."),
            ("human", "Documents: {input_documents}, Keywords: {question}"),
        ]
    )

chain = some_custom_chain_with_above_prompt

representation_model  = LangChain(chain, prompt="[KEYWORDS]")

Related issues:

Your contribution

I propose several changes, which I have started working on in a branch (made a PR to make the diff easy to see).

  • Update the examples so that langchain.chains.question_answering.load_qa_chain is replaced by langchain.chains.combine_documents.stuff.create_stuff_documents_chain as recommended in the migration guide.
  • This new approach still takes care of formatting the Document objects into the prompt, but the prompt must now be specified explicitly (instead of the implicit, hard-coded prompt of langchain.chains.question_answering.load_qa_chain).
  • Remove the ability to provide a prompt in the constructor of LangChain as the prompt must now be explicitly created with the chain object.
  • Rename the keys for consistency to documents, keywords, and representation (note that langchain.chains.combine_documents.stuff.create_stuff_documents_chain does not have a output_text output key and the representation key must thus be added).
  • Make it so that the keywords key is always provided to the chain (but it's up to the user to include a placeholder for it in their prompt).

Questions:

  • Should we provide a new example prompt to replace DEFAULT_PROMPT? For example
    EXAMPLE_PROMPT = "What are these documents about? {documents}. Here are some keywords about them {keywords} Please give a single label."
    however it could only be used directly in langchain.chains.combine_documents.stuff.create_stuff_documents_chain which takes care of formatting the documents.
@Skar0 Skar0 linked a pull request Oct 16, 2024 that will close this issue
@MaartenGr
Copy link
Owner

Awesome, thank you for the extensive description! I had hoped that LangChain would be stable for a little while longer but unfortunately that does not seem to be the case.

That said, if it's deprecated we indeed should be replacing this functionality. Let me address some things here before we continue in the PR:

the approach to add keywords in the prompt (by adding "[KEYWORDS]" in self.prompt and then performing some string manipulation) is confusing.

This behavior is used throughout all LLMs integrated in BERTopic, so if we change it here it should be changed everywhere. That said, I'm actually a big fan of using tags like "[KEYWORDS]" and "[DOCUMENTS]" to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.

Other than that (and looking at the PR), I'm wondering whether the changes make the usability for most users more complex. Take a look at this piece of the documentation you shared:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.chains.combine_documents import create_stuff_documents_chain

chat_model = ChatOpenAI(model=..., api_key=...)

prompt = ChatPromptTemplate.from_template("What are these documents about? {documents}. Please give a single label.")

chain = RunnablePassthrough.assign(representation=create_stuff_documents_chain(chat_model, prompt, document_variable_name="documents"))

That's quite a bit more involved than what it originally was:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")

Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users. I generally prefer additional representations to have 4 lines of code or so to do a basic LLM and nothing more.

@Skar0
Copy link
Author

Skar0 commented Oct 22, 2024

Hi,

Thanks for taking the time to reply 😊

That said, I'm actually a big fan of using tags like "[KEYWORDS]" and "[DOCUMENTS]" to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.

I understand this, and I agree that it is a nice approach to format prompts when using an LLM (e.g. with OpenAI). However, in the case of LangChain, there is already a standard built-in way of formatting prompts using prompt templates.

# Example: prompt with a `topic` placeholder replaced at runtime through the input of the chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

chat_model = ChatOpenAI(model=..., api_key=...)

prompt_template = PromptTemplate.from_template("Tell me a joke about {topic}")

chain = prompt_template | chat_model

chain.invoke({"topic": "cats"})

The current implementation uses a hybrid approach to formatting the prompt, using both LangChain prompt templates and string manipulation. The sequence looks like this (I'll assume that langchain.chains.question_answering.load_qa_chain is used as it's the documented approach).

  1. The prompt (which is hard-coded) contains two placeholders: one for the documents (named context) and one for the prompt provided here to the LangChain representation object (named question). Here it is below for convenience:

    from langchain_core.prompts.chat import (
        ChatPromptTemplate,
        HumanMessagePromptTemplate,
        SystemMessagePromptTemplate,
    )
    
    system_template = """Use the following pieces of context to answer the user's question. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    ----------------
    {context}"""
    messages = [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template("{question}"),
    ]
    CHAT_PROMPT = ChatPromptTemplate.from_messages(messages)
  2. Even though the placeholder in the prompt is named context because the document_variable_name set here is context, the chain expects the document objects to be passed through a key named input_documents (as set here). This explains why input_documents and question are used as input keys to the provided chain in the LangChain representation object.

  3. Given the above, the placement of documents into the prompt is thus performed using a LangChain prompt template placeholder (namely context). However, in case keywords need to be added in the prompt, they can currently only be provided through the question prompt template placeholder. In order to do so, one has to provide a prompt to the LangChain representation object, for example

    prompt="[KEYWORDS]"

    in which [KEYWORDS] is replaced by the actual keywords through string manipulation, and that formatted string is then passed as an input to the chain through the question key.

  4. The output of the chain is contained in a key named output_text due to some hard-coding in the object used.

I think these steps illustrate how the complex internal workings of that specific deprecated LangChain approach, together with the combination of LangChain prompt templates and string manipulations make things very confusing to a user wanting to dig deeper about what is feasible in BERTopic using LangChain (and doesn't make it easy to work with custom chains without reading the source code of the LangChain representation object to understand the expected input and output keys).

I'm wondering whether the changes make the usability for most users more complex.

Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users.

To your point, I can modify the approach to make it simpler in general:

  1. Remove the output_text key (which I had renamed representation), which removes the need for RunnablePassthrough to create an output key (which create_stuff_documents_chain doesn't have by default).
  2. Work with a LangChain prompt template, but name the keys so that they are similar to what is used in other representations (which means that the only difference between a LangChain representation prompt and an LLM representation prompt will be the brackets used (curly vs square).
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
representation_model = LangChain(chain)

becomes

from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI

prompt = ChatPromptTemplate.from_template("What are these documents about? {DOCUMENTS} Here are keywords related to them {KEYWORDS}.")

chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), prompt, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)

Note that we can define a prompt in the representation, like it was done before (but this time as a LangChain prompt template) and the code would become

from bertopic.representation import LangChain, DEFAULT_PROMPT
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI

chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), DEFAULT_PROMPT, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)

I made the necessary changes in the PR, let me know what you think! (I'll still need to tinker a bit to actually provide a good default prompt, and to make sure that this allows more fancy chains to work, but at least for the basic example it seems to work)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants