diff --git a/docs/administration-guide/aggregation/ingest-example.md b/docs/administration-guide/aggregation/ingest-example.md new file mode 100644 index 0000000000..ac55616187 --- /dev/null +++ b/docs/administration-guide/aggregation/ingest-example.md @@ -0,0 +1,608 @@ +# Ingest Aggregation Example + +## Simple Aggregation + +To demonstrate basic aggregation at ingest we can take the following graph as a +start point and modify the schema so that the properties are summed together: + +```mermaid +graph LR + A(["Person + + ID: Dave"]) + -- + "Commit + added: 6 + removed: 8" + --> + B(["Repository + + ID: r1"]) + A + -- + "Commit + added: 35 + removed: 10" + --> + B +``` + +As you can see we have two entity groups, `Person` and `Repository`, both without +any properties. We have one edge type `Commit` with two properties `added` +and `removed`. Translating this into a basic Gaffer schema gives the following: + +!!! note + In Gaffer every property type defined in the schema must specify an + `"aggregationFunction"` unless you specify `"aggregate": "false"` on the type. + +=== "elements.json" + + ```json + { + "edges": { + "Commit": { + "source": "id.person.string", + "destination": "id.repo.string", + "directed": "true", + "properties": { + "added": "property.integer", + "removed": "property.integer" + } + } + }, + "entities": { + "Person": { + "description": "Entity representing a person vertex", + "vertex": "id.person.string" + }, + "Repository": { + "description": "Entity representing a repository vertex", + "vertex": "id.repo.string" + } + } + } + ``` + +=== "types.json" + + ```json + { + "types": { + "id.person.string": { + "description": "A basic type to hold the string id of a person entity", + "class": "java.lang.String" + }, + "id.repo.string": { + "description": "A basic type to hold the string id of a repository entity", + "class": "java.lang.String" + }, + "property.integer": { + "description": "A basic type to hold integer properties of elements", + "class": "java.lang.Integer", + "aggregateFunction": { + "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Sum" + } + }, + "true": { + "description": "A simple boolean that must always be true.", + "class": "java.lang.Boolean", + "validateFunctions": [ + { + "class": "uk.gov.gchq.koryphe.impl.predicate.IsTrue" + } + ] + } + } + } + ``` + +In the above schema you can see we have applied an aggregation function to the +`"property.integer"` type which will sum the property to give a total. For this +function we must specify a class that will do the aggregation. There exists a +few default classes and some additional ones implemented by the Koryphe module +which you can read more about in the [reference guide](../../reference/binary-operators-guide/binary-operators.md). + +!!! tip + It is possible to create your own aggregation functions however, they must + extend the [`java.util.function.BiFunction`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/function/BiFunction.html?is-external=true) + interface. + +Loading the data into a Graph using the example schema we can form the Graph +and see the aggregation in action. First load the data via the REST API +using the `AddElements` operation like below: + +```json +{ + "class": "AddElements", + "input": [ + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 6, + "removed": 8 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 35, + "removed": 10 + } + }, + { + "class": "Entity", + "group": "Person", + "vertex": "Dave" + }, + { + "class": "Entity", + "group": "Repository", + "vertex": "r1" + } + ] +} +``` + +Now running a query on these elements with the seed as `"Dave"` you can see +that all the commit edges have been aggregated together to give a total +for the `added` and `removed` properties. + +=== "JSON Query" + + ```json + { + "class": "GetElements", + "input": [ + { + "class": "EntitySeed", + "vertex": "Dave" + } + ] + } + ``` + +=== "Result" + + ```json + [ + { + "class": "uk.gov.gchq.gaffer.data.element.Entity", + "group": "Person", + "vertex": "Dave", + "properties": {} + }, + { + "class": "uk.gov.gchq.gaffer.data.element.Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "matchedVertex": "SOURCE", + "properties": { + "removed": 18, + "added": 41 + } + } + ] + ``` + +### Using the groupBy field + +It is also possible to have a fine control over exactly when aggregation is +applied to by using the `groupBy` parameter. This parameter can be added to +the schema so that aggregation is applied only when a specific property is the +same between elements. + +To demonstrate this functionality we can expand the example from the previous +section to add a new property to the `Commit` edge called `issue` which +hypothetically represents the issue number the commit relates to. + +Now we can add the `groupBy` parameter to the schema so that all `Commit` edges +with the same `issue` property will be aggregated like before to sum the +`removed` and `added` properties: + +```json +"edges": { + "Commit": { + "source": "id.person.string", + "destination": "id.repo.string", + "directed": "true", + "properties": { + "added": "property.integer", + "removed": "property.integer", + "issue": "property.integer" + }, + "groupBy": [ + "issue" + ] + } +} +``` + +Now say if we added the following element to the graph and run a query to get the +edges like before: + +=== "AddElements" + + ```json + { + "class": "AddElements", + "input": [ + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 20, + "removed": 5, + "issue": 1 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 6, + "removed": 8, + "issue": 1 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 60, + "removed": 4, + "issue": 2 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "added": 35, + "removed": 10, + "issue": 2 + } + }, + { + "class": "Entity", + "group": "Person", + "vertex": "Dave" + }, + { + "class": "Entity", + "group": "Repository", + "vertex": "r1" + } + ] + } + ``` + +=== "Result" + + ```json + [ + { + "class": "uk.gov.gchq.gaffer.data.element.Entity", + "group": "Person", + "vertex": "Dave", + "properties": {} + }, + { + "class": "uk.gov.gchq.gaffer.data.element.Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "matchedVertex": "SOURCE", + "properties": { + "issue": 1, + "removed": 13, + "added": 26 + } + }, + { + "class": "uk.gov.gchq.gaffer.data.element.Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "matchedVertex": "SOURCE", + "properties": { + "issue": 2, + "removed": 14, + "added": 95 + } + } + ] + ``` + +As you can see we end up with two `Commit` edges relating to each `issue` with +all other properties aggregated together. + +## Expanded Example + +The example from the first section is a good demonstration of how aggregation +works, but just having the total number of some properties may not be the most +useful. To demonstrate a more complex use case we will modify the example to add +some new properties to the edges, so that after aggregation we'll have this graph: + +```mermaid +graph LR + A(["Person + + ID: Dave"]) + -- + "Commit + first: 2015-12-25 + latest: 2023-01-01 + count: 3" + --> + B(["Repository + + ID: r1"]) +``` + +What we are doing with this graph is aggregating any new `Commit` edges so that +the `first` and `latest` commit dates are kept updated as new edges are added to +the Graph whilst incrementing a `count` property to indicate overall how many +`Commit` edges are between two vertexes. + +We will modify the schema from the basic example add the different properties +and set up the aggregation functions: + +!!! tip + For good practice we have also added some `validateFunctions` to give + minimum confidence in the values of the types. Please see the + [predicates reference guide](../../reference/predicates-guide/gaffer-predicates.md) + for more information. + +=== "elements.json" + + ```json + { + "edges": { + "Commit": { + "source": "id.person.string", + "destination": "id.repo.string", + "directed": "true", + "properties": { + "first": "property.date.first", + "latest": "property.date.latest", + "count": "property.integer.count" + } + } + }, + "entities": { + "Person": { + "description": "Entity representing a person vertex", + "vertex": "id.person.string" + }, + "Repository": { + "description": "Entity representing a repository vertex", + "vertex": "id.repo.string" + } + } + } + ``` + +=== "types.json" + + ```json + { + "types": { + "id.person.string": { + "description": "A basic type to hold the string id of a person entity", + "class": "java.lang.String" + }, + "id.repo.string": { + "description": "A basic type to hold the string id of a repository entity", + "class": "java.lang.String" + }, + "property.integer.count": { + "description": "A basic type to hold a count property that must be greater than 0", + "class": "java.lang.Integer", + "aggregateFunction": { + "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Sum" + }, + "validateFunctions": [ + { + "class": "uk.gov.gchq.koryphe.impl.predicate.IsMoreThan", + "orEqualTo": true, + "value": 0 + } + ] + }, + "property.date.first": { + "description": "A Date type to hold first date property after aggregation", + "class": "java.util.Date", + "aggregateFunction": { + "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Min" + }, + "validateFunctions": [ + { + "class": "uk.gov.gchq.koryphe.impl.predicate.Exists" + } + ] + }, + "property.date.latest": { + "description": "A Date type to hold latest date property after aggregation", + "class": "java.util.Date", + "aggregateFunction": { + "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Max" + }, + "validateFunctions": [ + { + "class": "uk.gov.gchq.koryphe.impl.predicate.Exists" + } + ] + }, + "true": { + "description": "A simple boolean that must always be true.", + "class": "java.lang.Boolean", + "validateFunctions": [ + { + "class": "uk.gov.gchq.koryphe.impl.predicate.IsTrue" + } + ] + } + } + } + ``` + +As you can see in the types schema we have applied the `Min` function to the +`property.date.first` type so that will always be aggregated to be the +earliest date property. Similarly we apply the `Max` function to the +`property.date.latest` to always give us the latest date property. The +`property.integer.count` property keeps the simple `Sum` function to keep a +total of the number of edges. + +Applying these schemas to a Graph we can then add the following elements to +demonstrate the aggregation in practice: + +!!! note + The dates in the JSON are in milliseconds since Unix Epoch instead of a + typical format like `dd/mm/yyyy` due to how Jackson serialises + `java.util.Date` types. + +```json +{ + "class": "AddElements", + "input": [ + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "first": { + "java.util.Date": 1451044800146 + }, + "latest": { + "java.util.Date": 1451044800146 + }, + "count": 1 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "first": { + "java.util.Date": 1514808000146 + }, + "latest": { + "java.util.Date": 1514808000146 + }, + "count": 1 + } + }, + { + "class": "Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "properties": { + "first": { + "java.util.Date": 1672574400146 + }, + "latest": { + "java.util.Date": 1672574400146 + }, + "count": 1 + } + }, + { + "class": "Entity", + "group": "Person", + "vertex": "Dave" + }, + { + "class": "Entity", + "group": "Repository", + "vertex": "r1" + } + ] +} +``` + +!!! tip + Loading the elements like this is just for demonstration purposes, it can + look a little unintuitive as we have the same data for `first` and `latest` + properties. In production you may want to create a custom + `ElementsGenerator` so that the elements are created correctly from your raw + data based on the graph schema. + +Now running a query on these elements with the seed as `"Dave"` we can see that +we get back one edge with aggregated properties holding the `first` and `latest` +commit times as well as a `count` with the current number of edges. + +=== "JSON Query" + + ```json + { + "class": "GetElements", + "input": [ + { + "class": "EntitySeed", + "vertex": "Dave" + } + ] + } + ``` + +=== "Result" + + ```json + [ + { + "class": "uk.gov.gchq.gaffer.data.element.Entity", + "group": "Person", + "vertex": "Dave", + "properties": {} + }, + { + "class": "uk.gov.gchq.gaffer.data.element.Edge", + "group": "Commit", + "source": "Dave", + "destination": "r1", + "directed": true, + "matchedVertex": "SOURCE", + "properties": { + "count": 3, + "first": { + "java.util.Date": 1451044800146 + }, + "latest": { + "java.util.Date": 1672574400146 + } + } + } + ] + ``` + diff --git a/docs/administration-guide/aggregation/overview.md b/docs/administration-guide/aggregation/overview.md new file mode 100644 index 0000000000..a133e0d57f --- /dev/null +++ b/docs/administration-guide/aggregation/overview.md @@ -0,0 +1,51 @@ +# Aggregation Guide + +A basic introduction to the concept of Aggregation in Gaffer can be found in the +[User Guide](../../user-guide/gaffer-basics/what-is-aggregation.md). This guide is +an extension of the introduction to demonstrate more advanced usage of +Aggregation and how it can be applied. + +Aggregation is applied in Gaffer through an aggregation function. These can take +a number of forms but the common factor between them is that they use the +underlying [koryphe library](https://github.com/gchq/koryphe) to provide the +[`ElementAggregator`](https://gchq.github.io/Gaffer/uk/gov/gchq/gaffer/data/element/function/ElementAggregator.html). + +## Ingest Aggregation + +Ingest aggregation permanently aggregates similar elements together in the Graph +as they are loaded. The application of ingest aggregation is done via the Graph +schema which will apply the aggregation if one of the following conditions are +met: + +- An entity has the same `group`, `vertex` (e.g. ID), `visibility` and all `groupBy` + property values are the same. +- An edge has the same `group`, `source`, `destination`, and all `groupBy` + property values are the same. + +There are a few different use cases for applying ingest aggregation but it is +largely driven by the data you have and the analysis you wish to perform. As an +example, say you were expecting multiple connections of the same edge between +two nodes but each instance of the edge may have differing values on its +properties, this could be a place to apply aggregation to sum the values etc. + +Please see the [ingest aggregation example](ingest-example.md) for some common +use cases on how this can be applied. + +## Query-time Aggregation + +Query-time aggregation, as the name suggests, is adding aggregation to +elements from within the graph query. This differs from ingest aggregation +as only the results of the query will have been aggregated; the data stored +in the graph remains unchanged. + +Generally, to apply aggregation at query-time you must override the `groupBy` +property to prevent the default grouping taking place. It is then possible +to create your own aggregator in the query which can force the use of a +different aggregation function on a property. + +A simple example demonstrating query-time aggregation can be found in the +[user guide on filtering](../../user-guide/query/gaffer-syntax/filtering.md#query-time-aggregation). + +!!! tip + Most of the time you will want to couple query-time aggregation with a `View` + to allow more targeted queries on the data in your graph. diff --git a/docs/user-guide/gaffer-basics/what-is-aggregation.md b/docs/user-guide/gaffer-basics/what-is-aggregation.md index e69de29bb2..06fbe18dd2 100644 --- a/docs/user-guide/gaffer-basics/what-is-aggregation.md +++ b/docs/user-guide/gaffer-basics/what-is-aggregation.md @@ -0,0 +1,78 @@ +# What is Aggregation? + +**Aggregation** - noun + +*"The process of combining things or amounts into a single group or total"* + +--- + +In a software context Aggregation can have a variety of interpretations, in +Gaffer this specifically refers to ingest or query aggregation. + +## Why Aggregate? + +Aggregation allows us to take a set of elements and group them and their +properties together to form a new result. This allows us to get quick insights +into our data and generate valuable outputs from our graphing queries. + +There are also some key benefits when using aggregation with sketches as we +can store aggregated data in a compact format to reduce storage requirements +and improve throughput. Further reading on sketches is available in the +[reference guide](../../reference/properties-guide/advanced.md). + +## How is Aggregation Applied? + +The application of aggregation can be done at either data ingest time or +added to a specific query. + +For ingest time the configuration is specified via the [graph schema](../schema.md) +so that as data is loaded into a graph it is aggregated and stored in that +state. To demonstrate what this would look like take the simple graph below +where we can apply aggregation to merge the multiple edges together by summing the +properties. + +=== "Before Aggregation" + If we don't apply ingest aggregation all elements will be stored separately in + graph. + + ```mermaid + flowchart LR + A(A) + -- + "Edge + prop: 1" + --> B(B) + A(A) + -- + "Edge + prop: 1" + --> B(B) + ``` + +=== "After Aggregation" + Now if we have apply aggregation to sum the same properties on the `Edge` + they now get stored together as one element. + + ```mermaid + flowchart LR + A(A) + -- + "Edge + prop: 2" + --> B(B) + ``` + +!!! tip + Other aggregation functions are available, for an in-depth guide into ingest + aggregation please see the [administration guide](../../administration-guide/aggregation/overview.md). + +The other place you can apply aggregation is at query time. As a user this is +the most common use case, typically an administrator will have set up the ingest +time aggregation to summarise the input data into a more manageable size and +saving disk space in the process. Users can then use query time aggregation to +further summarise the data without having to edit the graph schema. + +The main difference between query-time aggregation and ingest aggregation is +the aggregation applied on a query will only affect the elements in that +query so the overall graph data is left intact. There is more information on how +to apply query-time aggregation in the [querying guide on filtering](../query/gaffer-syntax/filtering.md#query-time-aggregation). diff --git a/mkdocs.yml b/mkdocs.yml index 9b6446255d..3c8cda503c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -152,6 +152,9 @@ nav: - 'Changing Accumulo Passwords': 'administration-guide/gaffer-config/change-accumulo-passwords.md' - 'Proxy': 'administration-guide/gaffer-config/proxy.md' - 'URL': 'administration-guide/gaffer-config/url.md' + - Aggregation Guide: + - 'Overview': 'administration-guide/aggregation/overview.md' + - 'Ingest Example': 'administration-guide/aggregation/ingest-example.md' - 'Named Operations': 'administration-guide/named-operations.md' - 'Operation Score': 'administration-guide/operation-score.md' - Security: @@ -168,7 +171,7 @@ nav: - 'Accumulo Kerberos': 'change-notes/migrating-from-v1-to-v2/accumulo-kerberos.md' - 'Log4j in Gaffer': 'change-notes/migrating-from-v1-to-v2/log4j.md' - Reference: - - 'Reference Guide': 'reference/intro.md' + - 'Introduction': 'reference/intro.md' - 'Glossary': 'reference/glossary.md' - 'Javadoc': 'reference/javadoc.md' - Properties: