Skip to content

Commit 9d44519

Browse files
CDSTP-110-Investigate-potential-solutions-for-storing-tree-structures (#98)
* CDSTP-110 Create docs file to store results of investigation * CDSTP-110 Update filename * CDSTP-110 Update file * CDSTP-110 Add basic algorithm functionality * CDSTP-110 Add script to generate dummy json OTel data * CDSTP-110 Add function to convert json data into memgraph required format * CDSTP-110 Add id to relationship node * CDSTP-110 Add functionality to find unique graphs within memgraph database * CDSTP-110 Tidy up typing * CDSTP-110 Update algorithm to use custom procedure within memgraph * CDSTP-110 Add docstrings to functions and module * CDSTP-110 Fix mypy errors * CDSTP-110 Fix formatting * CDSTP-110 Add SQLite database to compare efficiency * CDSTP-110 Update SQLite method for finding unique graphs * CDSTP-110 Update findings and use in memory storage for SQLite computations * CDSTP-110 Add batched requests to sqlite method * CDSTP-110 Update investigation findings * CDSTP-110 Remove notes from design note * CDSTP-110 Fix somy mypy errors * CDSTP-110 Fix linting errors * CDSTP-110 Add docstrings * CDSTP-110 Create a POC folder * CDSTP-110 Update mypy script to exclude graph_solutions_poc folder * CDSTP-110 Fix linting * CDSTP-110 Update scripts file to .md * CDSTP-110 Add system architecture diagram * CDSTP-110 Adding required functions and classes to design note * CDSTP-110 Fix linting error * CDSTP-110 Fix linting and add extra functions to design note * CDSTP-110 Fix import error * CDSTP-Remove DN3 - to be moved into new branch * CDSTP-110 Create generic POC folder * CDSTP-110 Update mypy script to ignore poc folder
1 parent 56c95f8 commit 9d44519

File tree

13 files changed

+879
-6
lines changed

13 files changed

+879
-6
lines changed

.gitignore

+1-3
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,4 @@ outputs
167167
data/
168168

169169
# vscode
170-
.vscode
171-
172-
/tel2puml/OtelSpan_output
170+
.vscode

.mypy.ini

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[mypy]
22
ignore_errors = True
3-
exclude = tests
3+
exclude = tests
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Investigation into Storage and Evaluation Methods to Find Unique Graphs within OpenTelemetry Data
2+
3+
## 1. Current Implementation
4+
5+
### 1.1 Data Storage
6+
- OpenTelemetry (OTel) data is parsed and stored in a SQLite database.
7+
- Relations between data points are formed within the database structure.
8+
9+
### 1.2 Graph Evaluation Method
10+
- Unique graph structures are identified through a comparison process.
11+
- The current method involves sorting graph data into alphanumerical ordered lists before comparison.
12+
- This approach is not 100% accurate due to limitations in the sorting-based comparison.
13+
14+
### 1.3 Limitations
15+
- Potential for false positives or negatives in graph isomorphism determination (graph uniqueness).
16+
17+
## 2. Investigation Objectives
18+
19+
The primary goal is to explore alternative methods for storing and evaluating graph data derived from OTel traces. Specifically, we aim to:
20+
21+
1. Identify more accurate methods for determining graph isomorphism.
22+
2. Explore storage solutions optimised for graph data.
23+
3. Evaluate the performance and scalability of different approaches.
24+
4. Assess the ease of implementation and maintenance of new solutions.
25+
26+
## 3. Proposed Solutions for Investigation
27+
28+
### 3.1 Graph Databases
29+
#### 3.1.1 Neo4j
30+
- Native graph storage and querying capabilities.
31+
- Cypher query language for complex graph operations.
32+
- Built-in visualization tools.
33+
- Disk-based
34+
- Written in Java
35+
- Most popular graph database solution
36+
37+
#### 3.1.2 ArangoDB
38+
- Supports JSON documents, graphs and key/values.
39+
- Uses AQL (ArangoDB Query Language)
40+
- Disk-based
41+
- Written in C++
42+
43+
#### 3.1.3 Memgraph
44+
- Memgraph is an open-source graph database built for streaming and compatible with Neo4j.
45+
- Supports both property graph and RDF models.
46+
- Written in C++
47+
- In-memory based (disk based available) for faster querying
48+
- Uses Cypher as its query language
49+
- Compatible for use with Networkx
50+
- Allows custom procedures to be written in Python
51+
- Networkx has built in methods for finding graph isomorphism
52+
53+
#### 3.1.4 SQLite
54+
- Lightweight, serverless, self-contained relational database engine
55+
- Written in C
56+
- File-based: entire database stored in a single file on disk
57+
- ACID-compliant
58+
- Supports standard SQL syntax with some extensions
59+
- No native graph capabilities, but can be used to store graph-like structures
60+
- Requires custom implementation for graph operations and traversals
61+
- Efficient for smaller datasets and embedded applications
62+
- Limited concurrency support
63+
- No built-in support for graph algorithms or isomorphism detection
64+
65+
### 3.2 Advanced Graph Algorithms
66+
#### 3.2.1 Graph Isomorphism Algorithms
67+
- VF2 algorithm for graph and subgraph isomorphism detection.
68+
- Ullmann's algorithm for subgraph isomorphism.
69+
70+
#### 3.2.2 Graph Hashing Techniques
71+
- WL (Weisfeiler-Lehman) graph kernels for structure comparison.
72+
73+
## 4. Evaluation Criteria
74+
75+
For each proposed solution, we will evaluate:
76+
77+
1. Accuracy in identifying unique graph structures.
78+
2. Query performance for common graph operations.
79+
3. Scalability with increasing data volume.
80+
4. Ease of integration with existing OTel data processing pipeline.
81+
5. Maintenance overhead and long-term viability.
82+
83+
### 4.1 Custom Solution with Memgraph and NetworkX
84+
85+
#### 4.1.1 Memgraph with NetworkX Integration
86+
- Utilise Memgraph's ability to write custom procedures in Python
87+
- Implement a procedure to convert Memgraph graphs to NetworkX DiGraph objects
88+
- Leverage NetworkX's implementation of the Weisfeiler-Lehman graph hashing algorithm
89+
- Use the generated hash for graph isomorphism comparisons
90+
91+
#### 4.1.2 Findings
92+
- Memgraph's lack of native graph isomorphism capabilities necessitated a custom solution.
93+
- To load JSON data into memgraph, the data had to be transformed to a specific format containing data for a "node" and for a "relationship". This created more overhead.
94+
- Integration with NetworkX provides access to a wide range of graph algorithms.
95+
- The Weisfeiler-Lehman algorithm implemented in NetworkX offers an efficient and precise method for graph hashing.
96+
- This approach allows for flexibility in implementing custom graph analysis procedures.
97+
- Initial tests show promising results for identifying unique graph structures in OTel data.
98+
- Relatively slow, for 10,000 graphs of depth 3 with 1-3 branches per node resulted in a processing time of around 3-6 mins. This included optimisations such as batch queries.
99+
100+
101+
#### 4.1.3 Algorithm Overview
102+
103+
1. OTel data converted to JSON of nodes and relationships. Data could then be loaded into memgraph using the procedure import_util.json()
104+
2. Query database for root nodes.
105+
3. Extract job_name and trace_id from root nodes, mapping job names to trace ids.
106+
4. Loop over trace_id within each job_name. Query database with a batch of trace_id
107+
5. Iterate over trace_id within the cypher query using the UNWIND statement. Call the custom procedure within the query that converts a memgraph graph to a networkx digraph, and returns the weisfeler-lehman hash value
108+
6. Store hashes within a defaultdict(set), mapping job names to unique graph hashes
109+
110+
### 4.2 Custom Solution with SQLite
111+
112+
4.2.1 Performance Findings
113+
114+
- Initial Implementation: The basic SQLite implementation demonstrated a significant performance advantage, computing graphs at least 5 times faster than the Memgraph solution.
115+
- Custom Hashing Algorithm: We developed a tailored hashing solution that computes a node's hash by combining:
116+
* The hash of the node's event_type
117+
* The sorted hashes of its children's event_types
118+
* This approach proved both efficient and accurate, matching the Memgraph NetworkX solution in identifying unique graphs when tested on a dataset of 10,000 graphs.
119+
- In-Memory Processing: By loading the dataset into memory, we achieved a 40% speed improvement for processing 10,000 graphs of depth 3 with 1-3 branches per node.
120+
- Query Optimisation: Implementing batch processing with a size of 500 nodes gave remarkable results:
121+
* SQLite solution: 1.6 seconds
122+
* Memgraph solution: 200 seconds
123+
* This represents a 125x speed improvement over the Memgraph approach.
124+
125+
4.2.2 Architectural Advantages
126+
127+
- Flexible Data Modeling: The SQLite solution allows for easy modifications to the Node model, accommodating changes in data structure without significant refactoring.
128+
- Reduced Data Manipulation: Unlike the Memgraph solution, which required post-processing of data, the SQLite approach eliminates the need for data transformation, resulting in less overhead and simpler data pipeline.
129+
- Scalability: The performance gains observed with the SQLite solution suggest better scalability for larger datasets, addressing one of the key objectives of this investigation.
130+
131+
4.2.3 Comparative Analysis
132+
When compared to the Memgraph solution, the SQLite approach offers:
133+
134+
- Substantially faster processing times (125x improvement in our tests)
135+
- Simpler data pipeline with reduced manipulation requirements
136+
- Greater flexibility in data modeling
137+
- Potential for better scalability with larger datasets
138+
139+
#### 4.1.4 Algorithm Overview
140+
Database Setup:
141+
- Create an SQLite database (in-memory for this implementation).
142+
- Define a Node model representing the structure of OTel data.
143+
144+
Data Loading:
145+
- Load node data from a JSON file into the SQLite database.
146+
- Each node contains span_id, trace_id, event_type, job_name, and prev_span_id.
147+
148+
Graph Processing:
149+
- Retrieve distinct job names from the database.
150+
- For each job name:
151+
* Query for root nodes (nodes with no prev_span_id) for that job.
152+
* Process root nodes in batches (batch size = 500).
153+
154+
Graph Hashing:
155+
156+
- For each batch of root nodes:
157+
* Retrieve all related nodes for the batch from the database.
158+
* Create a mapping of nodes to their children.
159+
* For each root node in the batch:
160+
- Recursively compute a hash for the graph starting from the root node.
161+
- The hash is based on the node's event_type and its children's hashes.
162+
163+
Unique Graph Identification:
164+
- Maintain a hash set for each job name.
165+
- Add computed graph hashes to the corresponding job name's hash set.
166+
- Keep track of which trace_ids correspond to each unique graph hash.
167+
168+
Result Compilation:
169+
- Count the total number of unique graph structures across all job names.
170+
171+
Performance Optimisation:
172+
- Use in-memory SQLite database for faster access.
173+
- Process nodes in batches to reduce database query overhead.
174+
- Use efficient hashing algorithm (SHA-256) for graph structure comparison.
175+
176+
177+
## 7. Conclusion
178+
179+
This investigation into storage and evaluation methods for OpenTelemetry data has given valuable insights, particularly in comparing SQLite and Memgraph solutions for identifying unique graph structures.
180+
181+
The custom SQLite solution has demonstrated significant performance advantages over the Memgraph approach:
182+
183+
- Speed: SQLite computed graphs 5 times faster than the Memgraph solution with a basic implementation. With optimisations like in-memory processing and query batching, the SQLite approach processed 10,000 graphs in just 1.6 seconds, compared to Memgraph's 200 seconds.
184+
- Efficiency: The custom hashing algorithm implemented for SQLite proved to be both fast and accurate, matching the Memgraph/NetworkX solution in identifying unique graphs.
185+
- Flexibility: SQLite allowed for easier modifications to the Node model and required less data manipulation, resulting in reduced overhead compared to Memgraph.
186+
- Scalability: The SQLite solution showed better performance with larger datasets, addressing one of our key investigation objectives.
187+
188+
While Memgraph offered some advantages, such as native graph storage and compatibility with NetworkX for advanced algorithms, these benefits were outweighed by the performance gains and simplicity of the SQLite approach for our specific use case.
189+
The investigation also highlighted the importance of custom implementations tailored to specific needs. The SQLite solution, despite lacking native graph capabilities, outperformed the specialised graph database when optimised for our particular requirements.
190+
191+
Moving forward, we recommend:
192+
193+
- Further optimisation and refinement of the SQLite-based solution.
194+
- Conducting additional scalability tests with even larger datasets.
195+
- Exploring ways to incorporate some of the beneficial features of graph databases (like visualisation) into our SQLite-based system.
196+
197+
In conclusion, this investigation has provided a clear direction for improving our OTel trace analysis infrastructure, favoring a highly optimised SQLite-based approach over more complex graph database solutions for our current needs.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""TypedDicts for graph solutions"""
2+
3+
from typing import TypedDict, NotRequired
4+
5+
6+
class NodeData(TypedDict):
7+
"""TypedDict for NodeData"""
8+
9+
id: str
10+
labels: list[str]
11+
properties: dict[str, str]
12+
type: str
13+
14+
15+
class NodeRelationshipData(TypedDict):
16+
"""TypedDict for NodeRelationshipData"""
17+
18+
id: str
19+
end: str
20+
start: str
21+
label: str
22+
properties: dict[str, str]
23+
type: str
24+
25+
26+
class OtelData(TypedDict):
27+
"""TypedDict for OtelData"""
28+
29+
span_id: str
30+
trace_id: str
31+
event_type: str
32+
prev_span_id: NotRequired[str]
33+
job_name: str
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Note
2+
3+
To be able to connect to the memgraph docker container, you must create a network that both the dev container and memgraph container use.
4+
5+
The following has to be added to the devcontainer.json
6+
```json
7+
"runArgs": [
8+
"--network=<custom_network>"
9+
],
10+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
"""Module to generate JSON data representing OTel data
2+
"""
3+
4+
import json
5+
import random
6+
import string
7+
8+
9+
def generate_id() -> str:
10+
"""Function to generate a random unique id"""
11+
return "".join(
12+
random.choices(string.ascii_lowercase + string.digits, k=16)
13+
)
14+
15+
16+
def generate_dummy_data(
17+
num_traces: int = 1, max_depth: int = 2
18+
) -> list[dict[str, str | None]]:
19+
"""Function to generate dummy OTel data.
20+
21+
:param num_traces: Number of traces to generate
22+
:type num_traces: `int`
23+
:param max_depth: The max depth of the tree to be generated
24+
:type max_depth: `int`
25+
:return: List of dictionaries representing an OTel event
26+
:rtype: `list`[`dict`[`str`,`str`]]
27+
"""
28+
data = []
29+
event_types = ["A", "B", "C", "D"]
30+
job_names = ["job_1", "job_2", "job_3"]
31+
32+
def generate_spans(
33+
job_name: str,
34+
trace_id: str,
35+
prev_span_id: str | None,
36+
current_depth: int,
37+
) -> None:
38+
"""Recursive function to generate OTel Spans.
39+
40+
:param job_name: The job name
41+
:type job_name: `str`
42+
:param trace_id: The trace ID of the Span
43+
:type trace_id: `str`
44+
:param prev_span_id: The span ID of the parent Span
45+
:type prev_span_id: `str` | `None`
46+
:param current_depth: The current depth of the tree generated
47+
:type current_depth: `int`
48+
"""
49+
50+
if current_depth > max_depth:
51+
return
52+
53+
num_branches = random.randint(1, 3)
54+
for _ in range(num_branches):
55+
span_id = generate_id()
56+
event_type = random.choice(event_types)
57+
58+
span = {
59+
"span_id": span_id,
60+
"trace_id": trace_id,
61+
"event_type": event_type,
62+
"prev_span_id": prev_span_id,
63+
"job_name": job_name,
64+
}
65+
66+
data.append(span)
67+
68+
# Recursively generate child spans
69+
generate_spans(job_name, trace_id, span_id, current_depth + 1)
70+
71+
for _ in range(num_traces):
72+
trace_id = generate_id()
73+
root_span_id = generate_id()
74+
job_name = random.choice(job_names)
75+
76+
# Create root span
77+
root_span = {
78+
"span_id": root_span_id,
79+
"trace_id": trace_id,
80+
"event_type": "start",
81+
"prev_span_id": None,
82+
"job_name": job_name,
83+
}
84+
data.append(root_span)
85+
86+
# Generate the rest of the trace
87+
generate_spans(job_name, trace_id, root_span_id, 1)
88+
89+
return data
90+
91+
92+
if __name__ == "__main__":
93+
# Generate dummy data
94+
dummy_data = generate_dummy_data()
95+
96+
# Save to a file
97+
with open("./graph_solution_poc/data/dummy_trace_data.json", "w") as f:
98+
json.dump(dummy_data, f, indent=2)

0 commit comments

Comments
 (0)