Add an example of data extraction with Quarkus LangChain4j

apache · Sep 3, 2024 · 02f3b7e · 02f3b7e
1 parent b735251
commit 02f3b7e
Show file tree

Hide file tree

Showing 17 changed files with 1,250 additions and 0 deletions.
diff --git a/data-extract-langchain4j/README.adoc b/data-extract-langchain4j/README.adoc
@@ -0,0 +1,126 @@
+= Unstructured Data Extraction with LangChain4j: A Camel Quarkus example
+:cq-example-description: An example that shows how to convert unstructured text data to structured Java objects helped with a Large Language Model and LangChain4j
+
+{cq-description}
+
+TIP: Check the https://camel.apache.org/camel-quarkus/latest/first-steps.html[Camel Quarkus User guide] for prerequisites
+and other general information.
+
+Suppose the volume of https://en.wikipedia.org/wiki/Unstructured_data[unstructured data] grows at a high pace in a given organization.
+How could one transform those disseminated gold particles into a conform bullion that could be used in banks.
+For instance, let's imagine an insurance company that would record the transcripts of the conversation when customers are discussing with the hotline.
+There is probably a lot of valuable information that could be extracted from those conversation transcripts.
+In this example, we'll convert those text conversations into Java Objects that could then be used in the rest of the Camel route.
+
+In order to achieve this extraction, we'll need a https://en.wikipedia.org/wiki/Large_language_model[Large Language Model (LLM)] that natively supports JSON output.
+Here, we arbitrarily choose https://ollama.com/library/codellama[codellama] served through https://ollama.com/[ollama].
+In order to invoke the served model, we'll use the high-level LangChain4j APIs like https://docs.langchain4j.dev/tutorials/ai-services[AiServices].
+As we are using the Quarkus runtime, we can leverage all the advantages of the https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html[Quarkus LangChain4j extension].
+
+=== Start the Large Language Model
+
+Let's start a container to serve the LLM with Ollama:
+
+[source,shell]
+----
+docker run -p11434:11434 langchain4j/ollama-codellama:latest
+----
+
+After a moment, a log like below should be output:
+
+[source,shell]
+----
+time=2024-09-03T08:03:15.532Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="54.4 GiB"
+----
+
+That's it, the LLM is now ready to serve our data extraction requests.
+
+=== Package and run the application
+
+You are now ready to package and run the application.
+
+TIP: Find more details about the JVM mode and Native mode in the Package and run section of
+https://camel.apache.org/camel-quarkus/latest/first-steps.html#_package_and_run_the_application[Camel Quarkus User guide]
+
+==== JVM mode
+
+[source,shell]
+----
+mvn clean package -DskipTests
+java -jar target/quarkus-app/quarkus-run.jar
+----
+
+==== Extracting data from unstructured conversation
+
+Let's atomically copy/move the transcript files to the input folder named `target/transcripts/`, for instance like below:
+
+[source,shell]
+----
+cp -rf src/test/resources/transcripts/ target/transcripts-tmp
+mv target/transcripts-tmp/*.json target/transcripts/
+----
+
+The Camel route should output a log as below:
+
+[source,shell]
+----
+024-09-03 10:14:34,757 INFO  [route1] (Camel (camel-1) thread #1 - file://target/transcripts) A document has been received by the camel-quarkus-file extension: {
+  "id": 1,
+  "content": "Operator: Hello, how may I help you ?\nCustomer: Hello, I'm calling because I need to declare an accident on my main vehicle.\nOperator: Ok, can you please give me your name ?\nCustomer: My name is Sarah London.\nOperator: Could you please give me your birth date ?\nCustomer: 1986, July the 10th.\nOperator: Ok, I've got your contract and I'm happy to share with you that we'll be able to reimburse all expenses linked to this accident.\nCustomer: Oh great, many thanks."
+}
+----
+
+In the first log above, we can see that a JSON file handling transcript related information has been consumed.
+The conversation is present in the JSON field named `content`.
+This content will be injected into the LLM prompt.
+
+After a few seconds or minutes depending on your hardware setup, the LLM provides an answer strictly conforming to the expected JSON schema.
+It's now easy for LangChain4j to convert the returned JSON into a Java Object.
+At the end, we are provided with a Plain Old Java Object (POJO) handling the extracted data like below.
+
+[source,shell]
+----
+2024-09-03 10:14:51,284 INFO  [org.acm.ext.CustomPojoStore] (Camel (camel-1) thread #1 - file://target/transcripts) An extracted POJO has been added to the store: 
+{
+    "customerSatisfied": "true",
+    "customerName": "Sarah London",
+    "customerBirthday": "10 July 1986",
+    "summary": "Declare an accident on main vehicle and receive reimbursement for expenses."
+}
+----
+
+See how the LLM shows its capacity to:
+ * Extract a human friendly sentiment like `customerSatisfied`
+ * Exhibits https://nlp.stanford.edu/projects/coref.shtml#:~:text=Overview,question%20answering%2C%20and%20information%20extraction.[coreference resolution], like `customerName` that is deduced from information spread in the whole conversation
+ * Manage issues related to date format, like the field `customerBirthday`
+ * Mixed structured and unstructured data (semi-structured data) with the field `summary`.
+
+Cherry on the cake, all those informations are computed simultaneously during a single LLM inference.
+
+At the end, the application should have extracted 3 POJOs.
+For each of them, it could be interesting to compare the unstructured input text and the corresponding structured POJO.
+
+More details can be found in the `src/main/java/org/acme/extraction/CustomPojoExtractionService.java` class.
+
+==== Native mode
+
+IMPORTANT: Native mode requires having GraalVM and other tools installed. Please check the Prerequisites section
+of https://camel.apache.org/camel-quarkus/latest/first-steps.html#_prerequisites[Camel Quarkus User guide].
+
+If the application is still running in JVM mode, please kill it, for instance with `CTRL+C`.
+
+Now, to prepare a native executable using GraalVM, run the following commands:
+
+[source,shell]
+----
+mvn clean package -DskipTests -Dnative
+./target/*-runner
+----
+
+The compilation is a bit slower. Beyond that, notice how the application behaves the same way.
+Indeed, you should be able to send the JSON files and see the extracted data exactly as it was done in JVM mode.
+The only variation compared to the JVM mode is actually that the application was packaged as a native executable.
+
+== Feedback
+
+Please report bugs and propose improvements via https://github.com/apache/camel-quarkus/issues[GitHub issues of Camel Quarkus] project.