Skip to content

Commit

Permalink
Add an example of data extraction with Quarkus LangChain4j
Browse files Browse the repository at this point in the history
  • Loading branch information
aldettinger committed Sep 3, 2024
1 parent b735251 commit 02f3b7e
Show file tree
Hide file tree
Showing 17 changed files with 1,250 additions and 0 deletions.
126 changes: 126 additions & 0 deletions data-extract-langchain4j/README.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
= Unstructured Data Extraction with LangChain4j: A Camel Quarkus example
:cq-example-description: An example that shows how to convert unstructured text data to structured Java objects helped with a Large Language Model and LangChain4j

{cq-description}

TIP: Check the https://camel.apache.org/camel-quarkus/latest/first-steps.html[Camel Quarkus User guide] for prerequisites
and other general information.

Suppose the volume of https://en.wikipedia.org/wiki/Unstructured_data[unstructured data] grows at a high pace in a given organization.
How could one transform those disseminated gold particles into a conform bullion that could be used in banks.
For instance, let's imagine an insurance company that would record the transcripts of the conversation when customers are discussing with the hotline.
There is probably a lot of valuable information that could be extracted from those conversation transcripts.
In this example, we'll convert those text conversations into Java Objects that could then be used in the rest of the Camel route.

In order to achieve this extraction, we'll need a https://en.wikipedia.org/wiki/Large_language_model[Large Language Model (LLM)] that natively supports JSON output.
Here, we arbitrarily choose https://ollama.com/library/codellama[codellama] served through https://ollama.com/[ollama].
In order to invoke the served model, we'll use the high-level LangChain4j APIs like https://docs.langchain4j.dev/tutorials/ai-services[AiServices].
As we are using the Quarkus runtime, we can leverage all the advantages of the https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html[Quarkus LangChain4j extension].

=== Start the Large Language Model

Let's start a container to serve the LLM with Ollama:

[source,shell]
----
docker run -p11434:11434 langchain4j/ollama-codellama:latest
----

After a moment, a log like below should be output:

[source,shell]
----
time=2024-09-03T08:03:15.532Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="54.4 GiB"
----

That's it, the LLM is now ready to serve our data extraction requests.

=== Package and run the application

You are now ready to package and run the application.

TIP: Find more details about the JVM mode and Native mode in the Package and run section of
https://camel.apache.org/camel-quarkus/latest/first-steps.html#_package_and_run_the_application[Camel Quarkus User guide]

==== JVM mode

[source,shell]
----
mvn clean package -DskipTests
java -jar target/quarkus-app/quarkus-run.jar
----

==== Extracting data from unstructured conversation

Let's atomically copy/move the transcript files to the input folder named `target/transcripts/`, for instance like below:

[source,shell]
----
cp -rf src/test/resources/transcripts/ target/transcripts-tmp
mv target/transcripts-tmp/*.json target/transcripts/
----

The Camel route should output a log as below:

[source,shell]
----
024-09-03 10:14:34,757 INFO [route1] (Camel (camel-1) thread #1 - file://target/transcripts) A document has been received by the camel-quarkus-file extension: {
"id": 1,
"content": "Operator: Hello, how may I help you ?\nCustomer: Hello, I'm calling because I need to declare an accident on my main vehicle.\nOperator: Ok, can you please give me your name ?\nCustomer: My name is Sarah London.\nOperator: Could you please give me your birth date ?\nCustomer: 1986, July the 10th.\nOperator: Ok, I've got your contract and I'm happy to share with you that we'll be able to reimburse all expenses linked to this accident.\nCustomer: Oh great, many thanks."
}
----

In the first log above, we can see that a JSON file handling transcript related information has been consumed.
The conversation is present in the JSON field named `content`.
This content will be injected into the LLM prompt.

After a few seconds or minutes depending on your hardware setup, the LLM provides an answer strictly conforming to the expected JSON schema.
It's now easy for LangChain4j to convert the returned JSON into a Java Object.
At the end, we are provided with a Plain Old Java Object (POJO) handling the extracted data like below.

[source,shell]
----
2024-09-03 10:14:51,284 INFO [org.acm.ext.CustomPojoStore] (Camel (camel-1) thread #1 - file://target/transcripts) An extracted POJO has been added to the store:
{
"customerSatisfied": "true",
"customerName": "Sarah London",
"customerBirthday": "10 July 1986",
"summary": "Declare an accident on main vehicle and receive reimbursement for expenses."
}
----

See how the LLM shows its capacity to:
* Extract a human friendly sentiment like `customerSatisfied`
* Exhibits https://nlp.stanford.edu/projects/coref.shtml#:~:text=Overview,question%20answering%2C%20and%20information%20extraction.[coreference resolution], like `customerName` that is deduced from information spread in the whole conversation
* Manage issues related to date format, like the field `customerBirthday`
* Mixed structured and unstructured data (semi-structured data) with the field `summary`.

Cherry on the cake, all those informations are computed simultaneously during a single LLM inference.

At the end, the application should have extracted 3 POJOs.
For each of them, it could be interesting to compare the unstructured input text and the corresponding structured POJO.

More details can be found in the `src/main/java/org/acme/extraction/CustomPojoExtractionService.java` class.

==== Native mode

IMPORTANT: Native mode requires having GraalVM and other tools installed. Please check the Prerequisites section
of https://camel.apache.org/camel-quarkus/latest/first-steps.html#_prerequisites[Camel Quarkus User guide].

If the application is still running in JVM mode, please kill it, for instance with `CTRL+C`.

Now, to prepare a native executable using GraalVM, run the following commands:

[source,shell]
----
mvn clean package -DskipTests -Dnative
./target/*-runner
----

The compilation is a bit slower. Beyond that, notice how the application behaves the same way.
Indeed, you should be able to send the JSON files and see the extracted data exactly as it was done in JVM mode.
The only variation compared to the JVM mode is actually that the application was packaged as a native executable.

== Feedback

Please report bugs and propose improvements via https://github.com/apache/camel-quarkus/issues[GitHub issues of Camel Quarkus] project.
Loading

0 comments on commit 02f3b7e

Please sign in to comment.