Skip to content

Commit

Permalink
refactoring, docu
Browse files Browse the repository at this point in the history
Issue #214
  • Loading branch information
rsoika committed Sep 29, 2024
1 parent 7e4c9f6 commit 2f96ae8
Show file tree
Hide file tree
Showing 14 changed files with 7,815 additions and 56 deletions.
74 changes: 73 additions & 1 deletion imixs-archive-documents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,9 +189,81 @@ For more details about the OCR configuration see the [Imixs-Archive-OCR project]

All extracted textual information from attached documents is also searchable by the Imixs search index. The class *org.imixs.workflow.documents.DocumentIndexer* adds the ocr content for each file attachment into the search index.



## The e-Invoice Adapter

The Adapter class `org.imixs.workflow.documents.EInvoiceAdapter` can detect and extract content from e-invoice documents in different formats.

The detection outcome of the adapter is a new item named 'einvoice.type' with the detected type of the e-invoice format. E.g:

- Factur-X/ZUGFeRD 2.0

The Adapter can be configured in an BPMN event to extract e-invoice data fields by the following entity definition

```xml
<e-invoice name="ENTITY">
<name>Item Name</name>
<type>Item Type (text|date|double</type>
<xpath>xPath expression</xpath>
</e-invoice>
```

**Example e-invoice configuration:**

```xml
<e-invoice name="ENTITY">
<name>invoice.number</name>
<xpath>//rsm:CrossIndustryInvoice/rsm:ExchangedDocument/ram:ID</xpath>
</e-invoice>
<e-invoice name="ENTITY">
<name>invoice.date</name>
<type>date</type>
<xpath>//rsm:ExchangedDocument/ram:IssueDateTime/udt:DateTimeString/text()</xpath>
</e-invoice>
<e-invoice name="ENTITY">
<name>invoice.total</name>
<type>double</type>
<xpath>//ram:SpecifiedTradeSettlementHeaderMonetarySummation/ram:GrandTotalAmount</xpath>
</e-invoice>
<e-invoice name="ENTITY">
<name>cdtr.name</name>
<xpath>//ram:ApplicableHeaderTradeAgreement/ram:SellerTradeParty/ram:Name/text()</xpath>
</e-invoice>
```








If the type is not set the item value will be treated as a String. Possible types are 'double' and 'date'

If the document is not a e-invoice no items and also the einvoice.type field will be set.



## The e-Invoice AutoAdapter


The Adapter class `org.imixs.workflow.documents.EInvoiceAutoAdapter` is an extension of the EInvoiceAdapter and can be used to resolve all relevant e-invoice fields automatically. The following fields are supported:

| Item | Type | Description | XPath
| ----------------- | --------- | ------------------------- | -------------------------------------------------------
| invoice.number | text | Invoice number | //rsm:CrossIndustryInvoice/rsm:ExchangedDocument/ram:ID
| invoice.date | date | Invoice date | //rsm:ExchangedDocument/ram:IssueDateTime/udt:DateTimeString/text()
| invoice.total | double | Invoice total grant amount| //ram:SpecifiedTradeSettlementHeaderMonetarySummation/ram:GrandTotalAmount
| cdtr.name | text | Creditor name | //ram:ApplicableHeaderTradeAgreement/ram:SellerTradeParty/ram:Name/text()





## The PDF XML Plugin

The plugin class "*org.imixs.workflow.documents.parser.PDFXMLExtractorPlugin*" can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices.
The plugin class `org.imixs.workflow.documents.EInvoiceAdapter` can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices.

The plugin can be activated by the BPMN Model with the following result definition:

Expand Down
19 changes: 17 additions & 2 deletions imixs-archive-documents/pom.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.imixs.workflow</groupId>
Expand All @@ -7,6 +9,19 @@
</parent>
<artifactId>imixs-archive-documents</artifactId>

<build>
<testResources>
<!--
<testResource>
<directory>${basedir}/../reports</directory>
</testResource>
-->
<testResource>
<directory>${basedir}/src/test/resources</directory>
</testResource>
</testResources>

</build>
<dependencies>
<!-- Imixs-Workflow dependencies -->
<dependency>
Expand All @@ -17,7 +32,7 @@
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-workflow-engine</artifactId>
</dependency>

<dependency>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-archive-api</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@
import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.pdfbox.pdmodel.PDDocument;
Expand All @@ -44,6 +46,7 @@
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;

import jakarta.inject.Inject;

Expand All @@ -64,11 +67,11 @@
* <pre>
* {@code
<e-invoice name="ENTITY">
<name>invoice.date</name>
<type>date</type>
<e-invoice name="ENTITY">
<name>invoice.date</name>
<type>date</type>
<xpath>//rsm:CrossInvoice/ram:ID</xpath>
</e-invoice>
</e-invoice>
* }
* </pre>
Expand Down Expand Up @@ -98,7 +101,7 @@ public class EInvoiceAdapter implements SignalAdapter {
private static final Pattern XML_PATTERN = Pattern.compile(".[xX][mM][lL]$");
private static final Pattern ZIP_PATTERN = Pattern.compile(".[zZ][iI][pP]$");

private static final Map<String, String> NAMESPACES = new HashMap<>();
public static final Map<String, String> NAMESPACES = new HashMap<>();

static {
NAMESPACES.put("rsm", "urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100");
Expand All @@ -110,6 +113,9 @@ public class EInvoiceAdapter implements SignalAdapter {
public static String PROCESSING_ERROR = "PROCESSING_ERROR";
public static final String CONFIG_ERROR = "CONFIG_ERROR";

private XPath xpath = null;
private Document xmlDoc = null;

@Inject
DocumentService documentService;

Expand Down Expand Up @@ -234,7 +240,7 @@ private void storeXMLContent(FileData fileData, byte[] xmlData) {
* @param xmlData
*/
@SuppressWarnings("unchecked")
private byte[] readXMLContent(FileData fileData) {
public byte[] readXMLContent(FileData fileData) {
// store the ocrContent....
List<Object> list = (List<Object>) fileData.getAttribute(FILE_ATTRIBUTE_XML);
if (list != null) {
Expand Down Expand Up @@ -439,13 +445,73 @@ private void readEInvoiceContent(FileData eInvoiceFileData, List<ItemCollection>
byte[] xmlData = readXMLContent(eInvoiceFileData);

try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new ByteArrayInputStream(xmlData));

createXMLDoc(xmlData);

// Map<String, XPathExpression> compiledExpressions = new HashMap<>();

// extract the itemName and the expression from each itemDefinition....
for (ItemCollection entityDef : entityDefinitions) {

if (entityDef.getItemValueString("name").isEmpty()
|| entityDef.getItemValueString("xpath").isEmpty()) {
logger.warning("Invalid entity definition: " + entityDef);
continue;
}
String itemName = entityDef.getItemValueString("name");
String xPathExpr = entityDef.getItemValueString("xpath");
String itemType = entityDef.getItemValueString("type");

readItem(workitem, xPathExpr, itemType, itemName);
// XPathExpression expr = compiledExpressions.computeIfAbsent(xPathExpr,
// k -> {
// try {
// return xpath.compile(k);
// } catch (Exception e) {
// logger.warning("Error compiling XPath expression: " + k + " - " +
// e.getMessage());
// return null;
// }
// });
// // extract the xpath value and update the workitem...
// if (expr != null) {
// Node node = (Node) expr.evaluate(xmlDoc, XPathConstants.NODE);
// String itemValue = node != null ? node.getTextContent() : null;
// // test if we have a type....

// if ("date".equalsIgnoreCase(itemType)) {
// SimpleDateFormat formatter = new SimpleDateFormat("yyyyMMdd");
// try {
// Date invoiceDate = formatter.parse(itemValue);
// workitem.setItemValue(itemName, invoiceDate);
// } catch (ParseException e) {
// e.printStackTrace();
// }
// } else if ("double".equalsIgnoreCase(itemType)) {
// workitem.setItemValue(itemName, Double.parseDouble(itemValue));
// } else {
// // default...
// workitem.setItemValue(itemName, itemValue);
// }
// }
}
} catch (Exception e) {
logger.warning("Error analyzing XML content: " + e.getMessage());
}
}

/**
* Returns a XPath instance to be used to resolve xpath expressions.
*
* The method uses a cache
*
* @return
*/
private void createXPath() {
if (xpath == null) {

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
xpath = xPathfactory.newXPath();

xpath.setNamespaceContext(new NamespaceContext() {
public String getNamespaceURI(String prefix) {
Expand All @@ -460,53 +526,74 @@ public Iterator<String> getPrefixes(String uri) {
return null;
}
});
}
}

Map<String, XPathExpression> compiledExpressions = new HashMap<>();
/**
* Creates the XML document instance based on a XML content
*
* @param xmlData
* @throws PluginException
*/
public void createXMLDoc(byte[] xmlData) throws PluginException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
xmlDoc = builder.parse(new ByteArrayInputStream(xmlData));
} catch (ParserConfigurationException | SAXException | IOException e) {
throw new PluginException(EInvoiceAdapter.class.getSimpleName(), PARSING_EXCEPTION,
"Failed to parse XML Content: " + e.getMessage(), e);
}

// extract the itemName and the expression from each itemDefinition....
for (ItemCollection entityDef : entityDefinitions) {
}

if (entityDef.getItemValueString("name").isEmpty()
|| entityDef.getItemValueString("xpath").isEmpty()) {
logger.warning("Invalid entity definition: " + entityDef);
continue;
}
String itemName = entityDef.getItemValueString("name");
String xPathExpr = entityDef.getItemValueString("xpath");
String itemType = entityDef.getItemValueString("type");
XPathExpression expr = compiledExpressions.computeIfAbsent(xPathExpr,
k -> {
try {
return xpath.compile(k);
} catch (Exception e) {
logger.warning("Error compiling XPath expression: " + k + " - " + e.getMessage());
return null;
}
});
// extract the xpath value and update the workitem...
if (expr != null) {
Node node = (Node) expr.evaluate(doc, XPathConstants.NODE);
String itemValue = node != null ? node.getTextContent() : null;
// test if we have a type....

if ("date".equalsIgnoreCase(itemType)) {
SimpleDateFormat formatter = new SimpleDateFormat("yyyyMMdd");
try {
Date invoiceDate = formatter.parse(itemValue);
workitem.setItemValue(itemName, invoiceDate);
} catch (ParseException e) {
e.printStackTrace();
}
} else if ("double".equalsIgnoreCase(itemType)) {
workitem.setItemValue(itemName, Double.parseDouble(itemValue));
} else {
// default...
workitem.setItemValue(itemName, itemValue);
/**
* Reads a single item from an e-invoice document based on a xPathExp
*
* @param workitem
* @param xPathExpr
* @param itemType
* @param itemName
* @throws PluginException
*/
public void readItem(ItemCollection workitem, String xPathExpr, String itemType,
String itemName) throws PluginException {

if (xmlDoc == null) {
logger.warning("Missing XML Doc !");
return;
}
createXPath();
XPathExpression expr = null;

try {
expr = xpath.compile(xPathExpr);

// extract the xpath value and update the workitem...
if (expr != null) {
Node node = (Node) expr.evaluate(xmlDoc, XPathConstants.NODE);
String itemValue = node != null ? node.getTextContent() : null;
// test if we have a type....
if ("date".equalsIgnoreCase(itemType)) {
SimpleDateFormat formatter = new SimpleDateFormat("yyyyMMdd");
try {
Date invoiceDate = formatter.parse(itemValue);
workitem.setItemValue(itemName, invoiceDate);
} catch (ParseException e) {
e.printStackTrace();
}
} else if ("double".equalsIgnoreCase(itemType)) {
workitem.setItemValue(itemName, Double.parseDouble(itemValue));
} else {
// default...
workitem.setItemValue(itemName, itemValue);
}
}
} catch (Exception e) {
logger.warning("Error analyzing XML content: " + e.getMessage());
} catch (XPathExpressionException e) {
throw new PluginException(EInvoiceAdapter.class.getSimpleName(), PARSING_EXCEPTION,
"Error compiling XPath expression: " + xPathExpr + " - " + e.getMessage(), e);
}
}

Expand Down
Loading

0 comments on commit 2f96ae8

Please sign in to comment.