diff --git a/site/content/3.11/components/tools/arango-datasets.md b/site/content/3.11/components/tools/arango-datasets.md index 78dfbe2950..a24903550f 100644 --- a/site/content/3.11/components/tools/arango-datasets.md +++ b/site/content/3.11/components/tools/arango-datasets.md @@ -3,9 +3,9 @@ title: ArangoDB Datasets menuTitle: ArangoDB Datasets weight: 60 description: >- - `arango_datasets` is a Python package for loading sample datasets into ArangoDB + `arango-datasets` is a Python package for loading sample datasets into ArangoDB --- -You can use the `arango_datasets` package in conjunction with the `python-arango` +You can use the `arango-datasets` package in conjunction with the `python-arango` driver to load example data into your ArangoDB deployments. The data is hosted on AWS S3. There are a number of existing datasets already available and you can view them by calling the `list_datasets()` method as shown below. @@ -24,7 +24,7 @@ You can find the source code repository of the module on GitHub: ## Usage -Once you have installed the `arango_datasets` package, you can use it to +Once you have installed the `arango-datasets` package, you can use it to download and import datasets into your deployment with `arango_datasets.Datasets`. The `Datasets` constructor requires a valid [python-arango](../../develop/drivers/python.md) diff --git a/site/content/3.11/data-science/arangographml/_index.md b/site/content/3.11/data-science/arangographml/_index.md index baa200deaa..2d5d3324de 100644 --- a/site/content/3.11/data-science/arangographml/_index.md +++ b/site/content/3.11/data-science/arangographml/_index.md @@ -7,33 +7,37 @@ description: >- aliases: - graphml --- -Traditional machine learning overlooks the connections and relationships +Traditional Machine Learning (ML) overlooks the connections and relationships between data points, which is where graph machine learning excels. However, accessibility to GraphML has been limited to sizable enterprises equipped with -specialized teams of data scientists. ArangoGraphML, on the other hand, -simplifies the utilization of GraphML, enabling a broader range of personas to -extract profound insights from their data. +specialized teams of data scientists. ArangoGraphML simplifies the utilization of GraphML, +enabling a broader range of personas to extract profound insights from their data. ## How GraphML works -GraphML focuses on the utilization of neural networks specifically for -graph-related tasks. It is well-suited for addressing vague or fuzzy problems -and facilitating their resolution. The process involves incorporating a graph's -topology (node and edge structure) and the node and edge characteristics and -features to create a numerical representation known as an embedding. +Graph machine learning leverages the inherent structure of graph data, where +entities (nodes) and their relationships (edges) form a network. Unlike +traditional ML, which primarily operates on tabular data, GraphML applies +specialized algorithms like Graph Neural Networks (GNNs), node embeddings, and +link prediction to uncover complex patterns and insights. + +1. **Graph Construction**: + Raw data is transformed into a graph structure, defining nodes and edges based + on real-world relationships. +2. **Featurization**: + Nodes and edges are enriched with features that help in training predictive models. +3. **Model Training**: + Machine learning techniques are applied on GNNs to identify patterns and make predictions. +4. **Inference & Insights**: + The trained model is used to classify nodes, detect anomalies, recommend items, + or predict future connections. + +ArangoGraphML streamlines these steps, providing an intuitive and scalable +framework to integrate GraphML into various applications, from fraud detection +to recommendation systems. ![GraphML Embeddings](../../../images/GraphML-Embeddings.webp) -Graph Neural Networks (GNNs) are explicitly designed to learn meaningful -numerical representations, or embeddings, for nodes and edges in a graph. - -By applying a series of steps, GNNs effectively create graph embeddings, -which are numerical representations that encode the essential information -about the nodes and edges in the graph. These embeddings can then be used -for various tasks, such as node classification, link prediction, and -graph-level classification, where the model can make predictions based on the -learned patterns and relationships within the graph. - ![GraphML Workflow](../../../images/GraphML-How-it-works.webp) It is no longer necessary to understand the complexities involved with graph @@ -45,71 +49,133 @@ The platform comes preloaded with all the tools needed to prepare your graph for machine learning, high-accuracy training, and persisting predictions back to the database for application use. -### Classification - -Node classification is a natural fit for graph databases as it can leverage -existing graph analytics insights during model training. For instance, if you -have performed some community detection, potentially using ArangoDB's built-in -Pregel support, you can use these insights as inputs for graph machine learning. - -#### What is Node Classification - -The goal of node classification is to categorize the nodes in a graph based on -their neighborhood connections and characteristics in the graph. Based on the -behaviors or patterns in the graph, the Graph Neural Network (GNN) will be able -to learn what makes a node belong to a category. - -Node classification can be used to solve complex problems such as: -- Entity Categorization - - Email - - Books - - WebPage - - Transaction -- Social Networks - - Events - - Friends - - Interests -- BioPharmaceutical - - Protein-protein interaction - - Drug Categorization - - Sequence grouping -- Behavior - - Fraud - - Purchase/decision making - - Anomaly - -Many use cases can be solved with node classification. With many challenges, -there are multiple ways to attempt to solve them, and that's why the -ArangoGraphML node classification is only the first of many techniques to be -introduced. You can sign up to get immediate access to our latest stable -features and also try out other features included in the pipeline, such as -embedding similarity or link prediction. - -For more information, [get in touch](https://www.arangodb.com/contact/) -with the ArangoDB team. - -### Metrics and Compliance - -#### Training Performance - -Before using a model to provide predictions to your application, there needs -to be a way to determine its level of accuracy. Additionally, a mechanism must -be in place to ensure the experiments comply with auditor requirements. - -ArangoGraphML supports these objectives by storing all relevant training data -and metrics in a metadata graph, which is only available to you and is never -viewable by ArangoDB. This metagraph contains valuable training metrics such as -average accuracy (the general metric for determining model performance), F1, -Recall, Precision, and confusion matrix data. This graph links all experiments +## Supported Tasks + +### Node Classification + +Node classification is a **supervised learning** task where the goal is to +predict the label of a node based on both its own features and its relationships +within the graph. It requires a set of labeled nodes to train a model, which then +classifies unlabeled nodes based on learned patterns. + +**How it works in ArangoGraphML** + +- A portion of the nodes in a graph is labeled for training. +- The model learns patterns from both **node features** and + **structural relationships** (neighboring nodes and connections). +- It predicts labels for unlabeled nodes based on these learned patterns. + +**Example Use Cases** + +- **Fraud Detection in Financial Networks** + - **Problem:** Fraudsters often create multiple accounts or interact within + suspicious clusters to evade detection. + - **Solution:** A transaction graph is built where nodes represent users and + edges represent transactions. The model learns patterns from labeled + fraudulent and legitimate users, detecting hidden fraud rings based on + **both user attributes and transaction relationships**. + +- **Customer Segmentation in E-Commerce & Social Media** + - **Problem:** Businesses need to categorize customers based on purchasing + behavior and engagement. + - **Solution:** A graph is built where nodes represent customers and edges + represent interactions (purchases, reviews, social connections). The model + predicts the category of each user based on how similar they are to other users + **not just by their personal data, but also by how they are connected to others**. + +- **Disease Classification in Biomedical Networks** + - **Problem:** Identifying proteins or genes associated with a disease. + - **Solution:** A protein interaction graph is built where nodes are proteins + and edges represent biochemical interactions. The model classifies unknown + proteins based on their interactions with known disease-related proteins, + rather than just their individual properties. + +### Node Embedding Generation + +Node embedding is an **unsupervised learning** technique that converts nodes +into numerical vector representations, preserving their **structural relationships** +within the graph. Unlike simple feature aggregation, node embeddings +**capture the influence of neighboring nodes and graph topology**, making +them powerful for downstream tasks like clustering, anomaly detection, +and link prediction. These combinations can provide valuable insights. +Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) +capabilities to find similar nodes based on their embeddings. + +**Feature Embeddings versus Node Embeddings** + +**Feature Embeddings** are vector representations derived from the attributes or +features associated with nodes. These embeddings aim to capture the inherent +characteristics of the data. For example, in a social network, a +feature embedding might encode user attributes like age, location, and +interests. Techniques like **Word2Vec**, **TF-IDF**, or **autoencoders** are +commonly used to generate such embeddings. + +In the context of graphs, **Node Embeddings** are a +**combination of a node's feature embedding and the structural information from its connected edges**. +Essentially, they aggregate both the node's attributes and the connectivity patterns +within the graph. This fusion helps capture not only the individual properties of +a node but also its position and role within the network. + +**How it works in ArangoGraphML** + +- The model learns an embedding (a vector representation) for each node based on its + **position within the graph and its connections**. +- It **does not rely on labeled data** – instead, it captures structural patterns + through graph traversal and aggregation of neighbor information. +- These embeddings can be used for similarity searches, clustering, and predictive tasks. + +**Example Use Cases** + +- **Recommendation Systems (E-commerce & Streaming Platforms)** + - **Problem:** Platforms like Amazon, Netflix, and Spotify need to recommend products, + movies, or songs. + - **Solution:** A user-item interaction graph is built where nodes are users + and products, and edges represent interactions (purchases, ratings, listens). + **Embeddings encode relationships**, allowing the system to recommend similar + items based on user behavior and network influence rather than just individual + preferences. + +- **Anomaly Detection in Cybersecurity & Finance** + - **Problem:** Detecting unusual activity (e.g., cyber attacks, money laundering) + in complex networks. + - **Solution:** A network of IP addresses, users, and transactions is represented as + a graph. Nodes with embeddings that significantly deviate from normal patterns + are flagged as potential threats. The key advantage here is that anomalies are + detected based on **network structure, not just individual activity logs**. + +- **Link Prediction (Social & Knowledge Graphs)** + - **Problem:** Predicting new relationships, such as suggesting friends on + social media or forecasting research paper citations. + - **Solution:** A social network graph is created where nodes are users, and + edges represent friendships. **Embeddings capture the likelihood of + connections forming based on shared neighborhoods and structural + similarities, even if users have never interacted before**. + +### Key Differences + +| Feature | Node Classification | Node Embedding Generation | +|-----------------------|---------------------|----------------------------| +| **Learning Type** | Supervised | Unsupervised | +| **Input Data** | Labeled nodes | Graph structure & features | +| **Output** | Predicted labels | Node embeddings (vectors) | +| **Key Advantage** | Learns labels based on node connections and attributes | Learns structural patterns and node relationships | +| **Use Cases** | Fraud detection, customer segmentation, disease classification | Recommendations, anomaly detection, link prediction | + +ArangoGraphML provides the infrastructure to efficiently train and apply these +models, helping users extract meaningful insights from complex graph data. + +## Metrics and Compliance + +ArangoGraphML supports tracking your ML pipeline by storing all relevant metadata +and metrics in a Graph called ArangoPipe. This is only available to you and is never +viewable by ArangoDB. This metadata graph links all experiments to the source data, feature generation activities, training runs, and prediction -jobs. Having everything linked across the entire pipeline ensures that, at any -time, anything done that could be considered associated with sensitive user data, -it is logged and easily accessible. +jobs, allowing you to track the entire ML pipeline without having to leave ArangoDB. ### Security Each deployment that uses ArangoGraphML has an `arangopipe` database created, -which houses all this information. Since the data lives with the deployment, +which houses all ML Metadata information. Since this data lives within the deployment, it benefits from the ArangoGraph SOC 2 compliance and Enterprise security features. All ArangoGraphML services live alongside the ArangoGraph deployment and are only -accessible within that organization. \ No newline at end of file +accessible within that organization. diff --git a/site/content/3.11/data-science/arangographml/getting-started.md b/site/content/3.11/data-science/arangographml/getting-started.md index 8a485a254d..6bd614167e 100644 --- a/site/content/3.11/data-science/arangographml/getting-started.md +++ b/site/content/3.11/data-science/arangographml/getting-started.md @@ -59,7 +59,7 @@ ArangoGraphML comes with other ArangoDB Magic Commands! See the full list [here] **API Documentation: [arangoml.ArangoML](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML)** The `ArangoML` class is the main entry point for the `arangoml` package. -It requires the following parameters: +It has the following parameters: - `client`: An instance of arango.client.ArangoClient. Defaults to `None`. If not provided, the **hosts** argument must be provided. - `hosts`: The ArangoDB host(s) to connect to. This can be a single host, or a list of hosts. @@ -67,12 +67,10 @@ It requires the following parameters: - `password`: The ArangoDB password to use for authentication. - `user_token`: The ArangoDB user token to use for authentication. This is an alternative to username/password authentication. -- `ca_cert_file`: (Optional) The path to the CA certificate file to use for TLS - verification. -- `user_token`: (Optional) The ArangoDB user token to use for authentication. - This is an alternative to username/password authentication. +- `ca_cert_file`: The path to the CA certificate file to use for TLS + verification. Defaults to `None`. - `api_endpoint`: The URL to the ArangoGraphML API Service. -- `settings`: (Optional) A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). +- `settings_files`: A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). - `version`: The ArangoML API date version. Defaults to the latest version. It is possible to instantiate an ArangoML object in multiple ways: @@ -188,7 +186,7 @@ Let's get started! {{< tab "ArangoGraphML" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It comes pre-installed in the ArangoGraphML notebook environment. @@ -205,7 +203,7 @@ DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" {{< tab "Self-managed" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It can be installed with the following command: @@ -273,7 +271,8 @@ arangoml.projects.list_projects() - `outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`. - `dimensionalityReduction`: Object configuring dimensionality reduction. - - `disabled`: Boolean for enabling or disabling dimensionality reduction. Default is `false`. + - `disabled`: Whether to disable dimensionality reduction. Default is `false`, + therefore dimensionality reduction is applied after Featurization by default. - `size`: The number of dimensions to reduce the feature length to. Default is `512`. - `defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys: @@ -286,11 +285,11 @@ arangoml.projects.list_projects() - `jobConfiguration` Optional: A set of configurations that are applied to the job. - `batchSize`: The number of documents to process in a single batch. Default is `32`. - - `runAnalysisChecks`: Boolean for enabling or disabling analysis checks. Default is `true`. - - `skipLabels`: Boolean for enabling or disabling label skipping. Default is `false`. - - `overwriteFSGraph`: Boolean for enabling or disabling overwriting the feature store graph. Default is `false`. - - `writeToSourceGraph`: Boolean for enabling or disabling writing features to the source graph. Default is `true`. - - `useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`. + - `runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`. + - `skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`. + - `useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph. + - `overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s + - `writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. - `vertexCollections`: A dictionary mapping the vertex collection names to the following values: @@ -299,8 +298,8 @@ arangoml.projects.list_projects() - `config`: Collection-level configuration settings. - `featurePrefix`: Identical to global `featurePrefix` but for this collection. - `dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection. - - `outputName`: Identical to global `outputName` but for this collection. - - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType` but for this collection. + - `outputName`: Identical to global `outputName`, but specifically for this collection. + - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection. - `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported. The Featurization Specification example is used for the GDELT dataset: @@ -517,7 +516,7 @@ arangoml.jobs.cancel_job(prediction_job.job_id) **API Documentation: [ArangoML.jobs.train](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.train)** -Training Graph Machine Learning Models with ArangoGraphML only requires two steps: +Training Graph Machine Learning Models with ArangoGraphML requires two steps: 1. Describe which data points should be included in the Training Job. 2. Pass the Training Specification to the Training Service. @@ -536,7 +535,12 @@ Training Graph Machine Learning Models with ArangoGraphML only requires two step - `targetCollection`: The ArangoDB collection name that contains the prediction label. - `inputFeatures`: The name of the feature to be used as input. - `labelField`: The name of the attribute to be predicted. - - `batchSize`: The number of documents to process in a single batch. Default is `64`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification. + - `targetCollection`: The ArangoDB collection used to generate the embeddings. + - `embeddingSize`: The size of the embedding vector. Default is `128`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. If `featureSetID` is provided, this can be omitted. - `graph`: The ArangoDB graph name. @@ -549,7 +553,6 @@ A Training Specification allows for concisely defining your training task in a single object and then passing that object to the training service using the Python API client, as shown below. - The ArangoGraphML Training Service is responsible for training a series of Graph Machine Learning Models using the data provided in the Training Specification. It assumes that the data has been featurized and is ready to be @@ -560,6 +563,8 @@ Given that we have run a Featurization Job, we can create the Training Specifica ```py # 1. Define the Training Specification +# Node Classification example + training_spec = { "featureSetID": featurization_job_result.result.feature_set_id, "mlSpec": { @@ -570,6 +575,20 @@ training_spec = { } }, } + +# Node Embedding example +# NOTE: Full Graph Embeddings support is coming soon + +training_spec = { + "featureSetID": featurization_job_result.result.feature_set_id, + "mlSpec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingSize": 128, + "generateEmbeddings": True, + } + }, +} ``` Once the specification has been defined, a Training Job can be triggered using the `arangoml.jobs.train` method: @@ -588,7 +607,7 @@ Once a Training Job has been submitted, you can wait for it to complete using th training_job_result = arangoml.wait_for_training(training_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -649,6 +668,65 @@ training_job_result = arangoml.wait_for_training(training_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "ml_spec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingLevel": "NODE_EMBEDDINGS", + "embeddingSize": 128, + "embeddingTrainingType": "UNSUPERVISED", + "batchSize": 64, + "generateEmbeddings": true, + "bestModelSelection": "BEST_LOSS", + "persistModels": "ALL_MODELS", + "modelConfigurations": {} + } + }, + "metagraph": { + "graph": "OPEN_INTELLIGENCE_ANGOLA", + "vertexCollections": { + "Actor": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Country": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Event": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x", + "y": "OPEN_INTELLIGENCE_ANGOLA_y" + }, + "Source": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Location": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Region": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + } + }, + "edgeCollections": { + "eventActor": {}, + "hasSource": {}, + "hasLocation": {}, + "inCountry": {}, + "inRegion": {} + } + }, + "time_submitted": "2025-03-27T02:55:15.099680", + "time_started": "2025-03-27T02:57:25.143948", + "time_ended": "2025-03-27T03:01:24.619737", + "training_type": "Training" +} +``` + You can also cancel a Training Job using the `arangoml.jobs.cancel_job` method: ```py @@ -674,10 +752,15 @@ models = arangoml.list_models( print(len(models)) ``` - The cell below selects the model with the highest **test accuracy** using [ArangoML.get_best_model](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.get_best_model), but there may be other factors that motivate you to choose another model. See the `model_statistics` in the output field below for more information on the full list of available metrics. ```py + +# 2. Select the best Model + +# Get best Node Classification Model +# Sort by highest test accuracy + best_model = arangoml.get_best_model( project.name, training_job.job_id, @@ -685,10 +768,21 @@ best_model = arangoml.get_best_model( sort_child_key="accuracy", ) +# Get best Graph Embedding Model +# Sort by lowest loss + +best_model = arangoml.get_best_model( + project.name, + training_job.job_id, + sort_parent_key="loss", + sort_child_key=None, + reverse=False +) + print(best_model) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -722,6 +816,22 @@ print(best_model) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_display_name": "graphsageencdec Model", + "model_name": "graphsageencdec Model 55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_statistics": { + "loss": 0.13700408464796796, + "val_acc": 0.5795393939393939, + "test_acc": 0.5809545454545455 + }, + "model_tasks": [ "GRAPH_EMBEDDINGS" ] +} +``` + ## Prediction **API Documentation: [ArangoML.jobs.predict](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.predict)** @@ -739,15 +849,24 @@ collection, or within the source documents. - `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`. - `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`. - `schedule`: A cron expression to schedule the prediction job (e.g `0 0 * * *` for daily predictions). Default is `None`. - +- `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`. ```py # 1. Define the Prediction Specification +# Node Classification Example +prediction_spec = { + "projectName": project.name, + "databaseName": dataset_db.name, + "modelID": best_model.model_id, +} + +# Node Embedding Example prediction_spec = { "projectName": project.name, "databaseName": dataset_db.name, "modelID": best_model.model_id, + "embeddingsField": "embeddings" } ``` @@ -756,7 +875,12 @@ Once the specification has been defined, a Prediction Job can be triggered using ```py # 2. Submit a Prediction Job + +# For Node Classification prediction_job = arangoml.jobs.predict(prediction_spec) + +# For Graph Embeddings +prediction_job = arangoml.jobs.generate(prediction_spec) ``` Similar to the Training Service, we can wait for a Prediction Job to complete with the `arangoml.wait_for_prediction` method: @@ -767,7 +891,7 @@ Similar to the Training Service, we can wait for a Prediction Job to complete wi prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "b2a422bb-5650-4fbc-ba6b-0578af0049d9", @@ -789,15 +913,37 @@ prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "25260362-9764-47d0-abb4-247cbdce6c7b", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "job_state_information": { + "outputGraphName": "OPEN_INTELLIGENCE_ANGOLA", + "outputCollectionName": "Event", + "outputAttribute": "embeddings", + "numberOfPredictedDocuments": 0, # 0 All documents already have up-to-date embeddings + }, + "time_submitted": "2025-03-27T14:02:33.094191", + "time_started": "2025-03-27T14:09:34.206659", + "time_ended": "2025-03-27T14:09:35.791630", + "prediction_type": "Prediction" +} +``` + You can also cancel a Prediction Job using the `arangoml.jobs.cancel_job` method: ```py arangoml.jobs.cancel_job(prediction_job.job_id) ``` -### Viewing Predictions +### Viewing Inference Results -We can now access our predictions via AQL: +We can now access our results via AQL: ```py import json @@ -814,4 +960,8 @@ query = f""" docs = list(dataset_db.aql.execute(query)) print(json.dumps(docs, indent=2)) -``` \ No newline at end of file +``` + +## What's next + +With the generated Feature (and optionally Node) Embeddings, you can now use them for downstream tasks like clustering, anomaly detection, and link prediction. Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) capabilities to find similar nodes based on their embeddings. diff --git a/site/content/3.12/components/tools/arango-datasets.md b/site/content/3.12/components/tools/arango-datasets.md index 78dfbe2950..a24903550f 100644 --- a/site/content/3.12/components/tools/arango-datasets.md +++ b/site/content/3.12/components/tools/arango-datasets.md @@ -3,9 +3,9 @@ title: ArangoDB Datasets menuTitle: ArangoDB Datasets weight: 60 description: >- - `arango_datasets` is a Python package for loading sample datasets into ArangoDB + `arango-datasets` is a Python package for loading sample datasets into ArangoDB --- -You can use the `arango_datasets` package in conjunction with the `python-arango` +You can use the `arango-datasets` package in conjunction with the `python-arango` driver to load example data into your ArangoDB deployments. The data is hosted on AWS S3. There are a number of existing datasets already available and you can view them by calling the `list_datasets()` method as shown below. @@ -24,7 +24,7 @@ You can find the source code repository of the module on GitHub: ## Usage -Once you have installed the `arango_datasets` package, you can use it to +Once you have installed the `arango-datasets` package, you can use it to download and import datasets into your deployment with `arango_datasets.Datasets`. The `Datasets` constructor requires a valid [python-arango](../../develop/drivers/python.md) diff --git a/site/content/3.12/data-science/arangographml/_index.md b/site/content/3.12/data-science/arangographml/_index.md index 9b1780f3c6..2d5d3324de 100644 --- a/site/content/3.12/data-science/arangographml/_index.md +++ b/site/content/3.12/data-science/arangographml/_index.md @@ -7,33 +7,37 @@ description: >- aliases: - graphml --- -Traditional machine learning overlooks the connections and relationships +Traditional Machine Learning (ML) overlooks the connections and relationships between data points, which is where graph machine learning excels. However, accessibility to GraphML has been limited to sizable enterprises equipped with -specialized teams of data scientists. ArangoGraphML, on the other hand, -simplifies the utilization of GraphML, enabling a broader range of personas to -extract profound insights from their data. +specialized teams of data scientists. ArangoGraphML simplifies the utilization of GraphML, +enabling a broader range of personas to extract profound insights from their data. ## How GraphML works -GraphML focuses on the utilization of neural networks specifically for -graph-related tasks. It is well-suited for addressing vague or fuzzy problems -and facilitating their resolution. The process involves incorporating a graph's -topology (node and edge structure) and the node and edge characteristics and -features to create a numerical representation known as an embedding. +Graph machine learning leverages the inherent structure of graph data, where +entities (nodes) and their relationships (edges) form a network. Unlike +traditional ML, which primarily operates on tabular data, GraphML applies +specialized algorithms like Graph Neural Networks (GNNs), node embeddings, and +link prediction to uncover complex patterns and insights. + +1. **Graph Construction**: + Raw data is transformed into a graph structure, defining nodes and edges based + on real-world relationships. +2. **Featurization**: + Nodes and edges are enriched with features that help in training predictive models. +3. **Model Training**: + Machine learning techniques are applied on GNNs to identify patterns and make predictions. +4. **Inference & Insights**: + The trained model is used to classify nodes, detect anomalies, recommend items, + or predict future connections. + +ArangoGraphML streamlines these steps, providing an intuitive and scalable +framework to integrate GraphML into various applications, from fraud detection +to recommendation systems. ![GraphML Embeddings](../../../images/GraphML-Embeddings.webp) -Graph Neural Networks (GNNs) are explicitly designed to learn meaningful -numerical representations, or embeddings, for nodes and edges in a graph. - -By applying a series of steps, GNNs effectively create graph embeddings, -which are numerical representations that encode the essential information -about the nodes and edges in the graph. These embeddings can then be used -for various tasks, such as node classification, link prediction, and -graph-level classification, where the model can make predictions based on the -learned patterns and relationships within the graph. - ![GraphML Workflow](../../../images/GraphML-How-it-works.webp) It is no longer necessary to understand the complexities involved with graph @@ -45,71 +49,133 @@ The platform comes preloaded with all the tools needed to prepare your graph for machine learning, high-accuracy training, and persisting predictions back to the database for application use. -### Classification - -Node classification is a natural fit for graph databases as it can leverage -existing graph analytics insights during model training. For instance, if you -have performed some community detection, you can use these insights as inputs -for graph machine learning. - -#### What is Node Classification - -The goal of node classification is to categorize the nodes in a graph based on -their neighborhood connections and characteristics in the graph. Based on the -behaviors or patterns in the graph, the Graph Neural Network (GNN) will be able -to learn what makes a node belong to a category. - -Node classification can be used to solve complex problems such as: -- Entity Categorization - - Email - - Books - - WebPage - - Transaction -- Social Networks - - Events - - Friends - - Interests -- BioPharmaceutical - - Protein-protein interaction - - Drug Categorization - - Sequence grouping -- Behavior - - Fraud - - Purchase/decision making - - Anomaly - -Many use cases can be solved with node classification. With many challenges, -there are multiple ways to attempt to solve them, and that's why the -ArangoGraphML node classification is only the first of many techniques to be -introduced. You can sign up to get immediate access to our latest stable -features and also try out other features included in the pipeline, such as -embedding similarity or link prediction. - -For more information, [get in touch](https://www.arangodb.com/contact/) -with the ArangoDB team. - -### Metrics and Compliance - -#### Training Performance - -Before using a model to provide predictions to your application, there needs -to be a way to determine its level of accuracy. Additionally, a mechanism must -be in place to ensure the experiments comply with auditor requirements. - -ArangoGraphML supports these objectives by storing all relevant training data -and metrics in a metadata graph, which is only available to you and is never -viewable by ArangoDB. This metagraph contains valuable training metrics such as -average accuracy (the general metric for determining model performance), F1, -Recall, Precision, and confusion matrix data. This graph links all experiments +## Supported Tasks + +### Node Classification + +Node classification is a **supervised learning** task where the goal is to +predict the label of a node based on both its own features and its relationships +within the graph. It requires a set of labeled nodes to train a model, which then +classifies unlabeled nodes based on learned patterns. + +**How it works in ArangoGraphML** + +- A portion of the nodes in a graph is labeled for training. +- The model learns patterns from both **node features** and + **structural relationships** (neighboring nodes and connections). +- It predicts labels for unlabeled nodes based on these learned patterns. + +**Example Use Cases** + +- **Fraud Detection in Financial Networks** + - **Problem:** Fraudsters often create multiple accounts or interact within + suspicious clusters to evade detection. + - **Solution:** A transaction graph is built where nodes represent users and + edges represent transactions. The model learns patterns from labeled + fraudulent and legitimate users, detecting hidden fraud rings based on + **both user attributes and transaction relationships**. + +- **Customer Segmentation in E-Commerce & Social Media** + - **Problem:** Businesses need to categorize customers based on purchasing + behavior and engagement. + - **Solution:** A graph is built where nodes represent customers and edges + represent interactions (purchases, reviews, social connections). The model + predicts the category of each user based on how similar they are to other users + **not just by their personal data, but also by how they are connected to others**. + +- **Disease Classification in Biomedical Networks** + - **Problem:** Identifying proteins or genes associated with a disease. + - **Solution:** A protein interaction graph is built where nodes are proteins + and edges represent biochemical interactions. The model classifies unknown + proteins based on their interactions with known disease-related proteins, + rather than just their individual properties. + +### Node Embedding Generation + +Node embedding is an **unsupervised learning** technique that converts nodes +into numerical vector representations, preserving their **structural relationships** +within the graph. Unlike simple feature aggregation, node embeddings +**capture the influence of neighboring nodes and graph topology**, making +them powerful for downstream tasks like clustering, anomaly detection, +and link prediction. These combinations can provide valuable insights. +Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) +capabilities to find similar nodes based on their embeddings. + +**Feature Embeddings versus Node Embeddings** + +**Feature Embeddings** are vector representations derived from the attributes or +features associated with nodes. These embeddings aim to capture the inherent +characteristics of the data. For example, in a social network, a +feature embedding might encode user attributes like age, location, and +interests. Techniques like **Word2Vec**, **TF-IDF**, or **autoencoders** are +commonly used to generate such embeddings. + +In the context of graphs, **Node Embeddings** are a +**combination of a node's feature embedding and the structural information from its connected edges**. +Essentially, they aggregate both the node's attributes and the connectivity patterns +within the graph. This fusion helps capture not only the individual properties of +a node but also its position and role within the network. + +**How it works in ArangoGraphML** + +- The model learns an embedding (a vector representation) for each node based on its + **position within the graph and its connections**. +- It **does not rely on labeled data** – instead, it captures structural patterns + through graph traversal and aggregation of neighbor information. +- These embeddings can be used for similarity searches, clustering, and predictive tasks. + +**Example Use Cases** + +- **Recommendation Systems (E-commerce & Streaming Platforms)** + - **Problem:** Platforms like Amazon, Netflix, and Spotify need to recommend products, + movies, or songs. + - **Solution:** A user-item interaction graph is built where nodes are users + and products, and edges represent interactions (purchases, ratings, listens). + **Embeddings encode relationships**, allowing the system to recommend similar + items based on user behavior and network influence rather than just individual + preferences. + +- **Anomaly Detection in Cybersecurity & Finance** + - **Problem:** Detecting unusual activity (e.g., cyber attacks, money laundering) + in complex networks. + - **Solution:** A network of IP addresses, users, and transactions is represented as + a graph. Nodes with embeddings that significantly deviate from normal patterns + are flagged as potential threats. The key advantage here is that anomalies are + detected based on **network structure, not just individual activity logs**. + +- **Link Prediction (Social & Knowledge Graphs)** + - **Problem:** Predicting new relationships, such as suggesting friends on + social media or forecasting research paper citations. + - **Solution:** A social network graph is created where nodes are users, and + edges represent friendships. **Embeddings capture the likelihood of + connections forming based on shared neighborhoods and structural + similarities, even if users have never interacted before**. + +### Key Differences + +| Feature | Node Classification | Node Embedding Generation | +|-----------------------|---------------------|----------------------------| +| **Learning Type** | Supervised | Unsupervised | +| **Input Data** | Labeled nodes | Graph structure & features | +| **Output** | Predicted labels | Node embeddings (vectors) | +| **Key Advantage** | Learns labels based on node connections and attributes | Learns structural patterns and node relationships | +| **Use Cases** | Fraud detection, customer segmentation, disease classification | Recommendations, anomaly detection, link prediction | + +ArangoGraphML provides the infrastructure to efficiently train and apply these +models, helping users extract meaningful insights from complex graph data. + +## Metrics and Compliance + +ArangoGraphML supports tracking your ML pipeline by storing all relevant metadata +and metrics in a Graph called ArangoPipe. This is only available to you and is never +viewable by ArangoDB. This metadata graph links all experiments to the source data, feature generation activities, training runs, and prediction -jobs. Having everything linked across the entire pipeline ensures that, at any -time, anything done that could be considered associated with sensitive user data, -it is logged and easily accessible. +jobs, allowing you to track the entire ML pipeline without having to leave ArangoDB. ### Security Each deployment that uses ArangoGraphML has an `arangopipe` database created, -which houses all this information. Since the data lives with the deployment, +which houses all ML Metadata information. Since this data lives within the deployment, it benefits from the ArangoGraph SOC 2 compliance and Enterprise security features. All ArangoGraphML services live alongside the ArangoGraph deployment and are only -accessible within that organization. \ No newline at end of file +accessible within that organization. diff --git a/site/content/3.12/data-science/arangographml/getting-started.md b/site/content/3.12/data-science/arangographml/getting-started.md index 8a485a254d..6bd614167e 100644 --- a/site/content/3.12/data-science/arangographml/getting-started.md +++ b/site/content/3.12/data-science/arangographml/getting-started.md @@ -59,7 +59,7 @@ ArangoGraphML comes with other ArangoDB Magic Commands! See the full list [here] **API Documentation: [arangoml.ArangoML](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML)** The `ArangoML` class is the main entry point for the `arangoml` package. -It requires the following parameters: +It has the following parameters: - `client`: An instance of arango.client.ArangoClient. Defaults to `None`. If not provided, the **hosts** argument must be provided. - `hosts`: The ArangoDB host(s) to connect to. This can be a single host, or a list of hosts. @@ -67,12 +67,10 @@ It requires the following parameters: - `password`: The ArangoDB password to use for authentication. - `user_token`: The ArangoDB user token to use for authentication. This is an alternative to username/password authentication. -- `ca_cert_file`: (Optional) The path to the CA certificate file to use for TLS - verification. -- `user_token`: (Optional) The ArangoDB user token to use for authentication. - This is an alternative to username/password authentication. +- `ca_cert_file`: The path to the CA certificate file to use for TLS + verification. Defaults to `None`. - `api_endpoint`: The URL to the ArangoGraphML API Service. -- `settings`: (Optional) A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). +- `settings_files`: A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). - `version`: The ArangoML API date version. Defaults to the latest version. It is possible to instantiate an ArangoML object in multiple ways: @@ -188,7 +186,7 @@ Let's get started! {{< tab "ArangoGraphML" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It comes pre-installed in the ArangoGraphML notebook environment. @@ -205,7 +203,7 @@ DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" {{< tab "Self-managed" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It can be installed with the following command: @@ -273,7 +271,8 @@ arangoml.projects.list_projects() - `outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`. - `dimensionalityReduction`: Object configuring dimensionality reduction. - - `disabled`: Boolean for enabling or disabling dimensionality reduction. Default is `false`. + - `disabled`: Whether to disable dimensionality reduction. Default is `false`, + therefore dimensionality reduction is applied after Featurization by default. - `size`: The number of dimensions to reduce the feature length to. Default is `512`. - `defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys: @@ -286,11 +285,11 @@ arangoml.projects.list_projects() - `jobConfiguration` Optional: A set of configurations that are applied to the job. - `batchSize`: The number of documents to process in a single batch. Default is `32`. - - `runAnalysisChecks`: Boolean for enabling or disabling analysis checks. Default is `true`. - - `skipLabels`: Boolean for enabling or disabling label skipping. Default is `false`. - - `overwriteFSGraph`: Boolean for enabling or disabling overwriting the feature store graph. Default is `false`. - - `writeToSourceGraph`: Boolean for enabling or disabling writing features to the source graph. Default is `true`. - - `useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`. + - `runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`. + - `skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`. + - `useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph. + - `overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s + - `writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. - `vertexCollections`: A dictionary mapping the vertex collection names to the following values: @@ -299,8 +298,8 @@ arangoml.projects.list_projects() - `config`: Collection-level configuration settings. - `featurePrefix`: Identical to global `featurePrefix` but for this collection. - `dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection. - - `outputName`: Identical to global `outputName` but for this collection. - - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType` but for this collection. + - `outputName`: Identical to global `outputName`, but specifically for this collection. + - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection. - `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported. The Featurization Specification example is used for the GDELT dataset: @@ -517,7 +516,7 @@ arangoml.jobs.cancel_job(prediction_job.job_id) **API Documentation: [ArangoML.jobs.train](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.train)** -Training Graph Machine Learning Models with ArangoGraphML only requires two steps: +Training Graph Machine Learning Models with ArangoGraphML requires two steps: 1. Describe which data points should be included in the Training Job. 2. Pass the Training Specification to the Training Service. @@ -536,7 +535,12 @@ Training Graph Machine Learning Models with ArangoGraphML only requires two step - `targetCollection`: The ArangoDB collection name that contains the prediction label. - `inputFeatures`: The name of the feature to be used as input. - `labelField`: The name of the attribute to be predicted. - - `batchSize`: The number of documents to process in a single batch. Default is `64`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification. + - `targetCollection`: The ArangoDB collection used to generate the embeddings. + - `embeddingSize`: The size of the embedding vector. Default is `128`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. If `featureSetID` is provided, this can be omitted. - `graph`: The ArangoDB graph name. @@ -549,7 +553,6 @@ A Training Specification allows for concisely defining your training task in a single object and then passing that object to the training service using the Python API client, as shown below. - The ArangoGraphML Training Service is responsible for training a series of Graph Machine Learning Models using the data provided in the Training Specification. It assumes that the data has been featurized and is ready to be @@ -560,6 +563,8 @@ Given that we have run a Featurization Job, we can create the Training Specifica ```py # 1. Define the Training Specification +# Node Classification example + training_spec = { "featureSetID": featurization_job_result.result.feature_set_id, "mlSpec": { @@ -570,6 +575,20 @@ training_spec = { } }, } + +# Node Embedding example +# NOTE: Full Graph Embeddings support is coming soon + +training_spec = { + "featureSetID": featurization_job_result.result.feature_set_id, + "mlSpec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingSize": 128, + "generateEmbeddings": True, + } + }, +} ``` Once the specification has been defined, a Training Job can be triggered using the `arangoml.jobs.train` method: @@ -588,7 +607,7 @@ Once a Training Job has been submitted, you can wait for it to complete using th training_job_result = arangoml.wait_for_training(training_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -649,6 +668,65 @@ training_job_result = arangoml.wait_for_training(training_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "ml_spec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingLevel": "NODE_EMBEDDINGS", + "embeddingSize": 128, + "embeddingTrainingType": "UNSUPERVISED", + "batchSize": 64, + "generateEmbeddings": true, + "bestModelSelection": "BEST_LOSS", + "persistModels": "ALL_MODELS", + "modelConfigurations": {} + } + }, + "metagraph": { + "graph": "OPEN_INTELLIGENCE_ANGOLA", + "vertexCollections": { + "Actor": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Country": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Event": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x", + "y": "OPEN_INTELLIGENCE_ANGOLA_y" + }, + "Source": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Location": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Region": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + } + }, + "edgeCollections": { + "eventActor": {}, + "hasSource": {}, + "hasLocation": {}, + "inCountry": {}, + "inRegion": {} + } + }, + "time_submitted": "2025-03-27T02:55:15.099680", + "time_started": "2025-03-27T02:57:25.143948", + "time_ended": "2025-03-27T03:01:24.619737", + "training_type": "Training" +} +``` + You can also cancel a Training Job using the `arangoml.jobs.cancel_job` method: ```py @@ -674,10 +752,15 @@ models = arangoml.list_models( print(len(models)) ``` - The cell below selects the model with the highest **test accuracy** using [ArangoML.get_best_model](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.get_best_model), but there may be other factors that motivate you to choose another model. See the `model_statistics` in the output field below for more information on the full list of available metrics. ```py + +# 2. Select the best Model + +# Get best Node Classification Model +# Sort by highest test accuracy + best_model = arangoml.get_best_model( project.name, training_job.job_id, @@ -685,10 +768,21 @@ best_model = arangoml.get_best_model( sort_child_key="accuracy", ) +# Get best Graph Embedding Model +# Sort by lowest loss + +best_model = arangoml.get_best_model( + project.name, + training_job.job_id, + sort_parent_key="loss", + sort_child_key=None, + reverse=False +) + print(best_model) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -722,6 +816,22 @@ print(best_model) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_display_name": "graphsageencdec Model", + "model_name": "graphsageencdec Model 55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_statistics": { + "loss": 0.13700408464796796, + "val_acc": 0.5795393939393939, + "test_acc": 0.5809545454545455 + }, + "model_tasks": [ "GRAPH_EMBEDDINGS" ] +} +``` + ## Prediction **API Documentation: [ArangoML.jobs.predict](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.predict)** @@ -739,15 +849,24 @@ collection, or within the source documents. - `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`. - `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`. - `schedule`: A cron expression to schedule the prediction job (e.g `0 0 * * *` for daily predictions). Default is `None`. - +- `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`. ```py # 1. Define the Prediction Specification +# Node Classification Example +prediction_spec = { + "projectName": project.name, + "databaseName": dataset_db.name, + "modelID": best_model.model_id, +} + +# Node Embedding Example prediction_spec = { "projectName": project.name, "databaseName": dataset_db.name, "modelID": best_model.model_id, + "embeddingsField": "embeddings" } ``` @@ -756,7 +875,12 @@ Once the specification has been defined, a Prediction Job can be triggered using ```py # 2. Submit a Prediction Job + +# For Node Classification prediction_job = arangoml.jobs.predict(prediction_spec) + +# For Graph Embeddings +prediction_job = arangoml.jobs.generate(prediction_spec) ``` Similar to the Training Service, we can wait for a Prediction Job to complete with the `arangoml.wait_for_prediction` method: @@ -767,7 +891,7 @@ Similar to the Training Service, we can wait for a Prediction Job to complete wi prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "b2a422bb-5650-4fbc-ba6b-0578af0049d9", @@ -789,15 +913,37 @@ prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "25260362-9764-47d0-abb4-247cbdce6c7b", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "job_state_information": { + "outputGraphName": "OPEN_INTELLIGENCE_ANGOLA", + "outputCollectionName": "Event", + "outputAttribute": "embeddings", + "numberOfPredictedDocuments": 0, # 0 All documents already have up-to-date embeddings + }, + "time_submitted": "2025-03-27T14:02:33.094191", + "time_started": "2025-03-27T14:09:34.206659", + "time_ended": "2025-03-27T14:09:35.791630", + "prediction_type": "Prediction" +} +``` + You can also cancel a Prediction Job using the `arangoml.jobs.cancel_job` method: ```py arangoml.jobs.cancel_job(prediction_job.job_id) ``` -### Viewing Predictions +### Viewing Inference Results -We can now access our predictions via AQL: +We can now access our results via AQL: ```py import json @@ -814,4 +960,8 @@ query = f""" docs = list(dataset_db.aql.execute(query)) print(json.dumps(docs, indent=2)) -``` \ No newline at end of file +``` + +## What's next + +With the generated Feature (and optionally Node) Embeddings, you can now use them for downstream tasks like clustering, anomaly detection, and link prediction. Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) capabilities to find similar nodes based on their embeddings. diff --git a/site/content/3.13/components/tools/arango-datasets.md b/site/content/3.13/components/tools/arango-datasets.md index 78dfbe2950..a24903550f 100644 --- a/site/content/3.13/components/tools/arango-datasets.md +++ b/site/content/3.13/components/tools/arango-datasets.md @@ -3,9 +3,9 @@ title: ArangoDB Datasets menuTitle: ArangoDB Datasets weight: 60 description: >- - `arango_datasets` is a Python package for loading sample datasets into ArangoDB + `arango-datasets` is a Python package for loading sample datasets into ArangoDB --- -You can use the `arango_datasets` package in conjunction with the `python-arango` +You can use the `arango-datasets` package in conjunction with the `python-arango` driver to load example data into your ArangoDB deployments. The data is hosted on AWS S3. There are a number of existing datasets already available and you can view them by calling the `list_datasets()` method as shown below. @@ -24,7 +24,7 @@ You can find the source code repository of the module on GitHub: ## Usage -Once you have installed the `arango_datasets` package, you can use it to +Once you have installed the `arango-datasets` package, you can use it to download and import datasets into your deployment with `arango_datasets.Datasets`. The `Datasets` constructor requires a valid [python-arango](../../develop/drivers/python.md) diff --git a/site/content/3.13/data-science/arangographml/_index.md b/site/content/3.13/data-science/arangographml/_index.md index 9b1780f3c6..2d5d3324de 100644 --- a/site/content/3.13/data-science/arangographml/_index.md +++ b/site/content/3.13/data-science/arangographml/_index.md @@ -7,33 +7,37 @@ description: >- aliases: - graphml --- -Traditional machine learning overlooks the connections and relationships +Traditional Machine Learning (ML) overlooks the connections and relationships between data points, which is where graph machine learning excels. However, accessibility to GraphML has been limited to sizable enterprises equipped with -specialized teams of data scientists. ArangoGraphML, on the other hand, -simplifies the utilization of GraphML, enabling a broader range of personas to -extract profound insights from their data. +specialized teams of data scientists. ArangoGraphML simplifies the utilization of GraphML, +enabling a broader range of personas to extract profound insights from their data. ## How GraphML works -GraphML focuses on the utilization of neural networks specifically for -graph-related tasks. It is well-suited for addressing vague or fuzzy problems -and facilitating their resolution. The process involves incorporating a graph's -topology (node and edge structure) and the node and edge characteristics and -features to create a numerical representation known as an embedding. +Graph machine learning leverages the inherent structure of graph data, where +entities (nodes) and their relationships (edges) form a network. Unlike +traditional ML, which primarily operates on tabular data, GraphML applies +specialized algorithms like Graph Neural Networks (GNNs), node embeddings, and +link prediction to uncover complex patterns and insights. + +1. **Graph Construction**: + Raw data is transformed into a graph structure, defining nodes and edges based + on real-world relationships. +2. **Featurization**: + Nodes and edges are enriched with features that help in training predictive models. +3. **Model Training**: + Machine learning techniques are applied on GNNs to identify patterns and make predictions. +4. **Inference & Insights**: + The trained model is used to classify nodes, detect anomalies, recommend items, + or predict future connections. + +ArangoGraphML streamlines these steps, providing an intuitive and scalable +framework to integrate GraphML into various applications, from fraud detection +to recommendation systems. ![GraphML Embeddings](../../../images/GraphML-Embeddings.webp) -Graph Neural Networks (GNNs) are explicitly designed to learn meaningful -numerical representations, or embeddings, for nodes and edges in a graph. - -By applying a series of steps, GNNs effectively create graph embeddings, -which are numerical representations that encode the essential information -about the nodes and edges in the graph. These embeddings can then be used -for various tasks, such as node classification, link prediction, and -graph-level classification, where the model can make predictions based on the -learned patterns and relationships within the graph. - ![GraphML Workflow](../../../images/GraphML-How-it-works.webp) It is no longer necessary to understand the complexities involved with graph @@ -45,71 +49,133 @@ The platform comes preloaded with all the tools needed to prepare your graph for machine learning, high-accuracy training, and persisting predictions back to the database for application use. -### Classification - -Node classification is a natural fit for graph databases as it can leverage -existing graph analytics insights during model training. For instance, if you -have performed some community detection, you can use these insights as inputs -for graph machine learning. - -#### What is Node Classification - -The goal of node classification is to categorize the nodes in a graph based on -their neighborhood connections and characteristics in the graph. Based on the -behaviors or patterns in the graph, the Graph Neural Network (GNN) will be able -to learn what makes a node belong to a category. - -Node classification can be used to solve complex problems such as: -- Entity Categorization - - Email - - Books - - WebPage - - Transaction -- Social Networks - - Events - - Friends - - Interests -- BioPharmaceutical - - Protein-protein interaction - - Drug Categorization - - Sequence grouping -- Behavior - - Fraud - - Purchase/decision making - - Anomaly - -Many use cases can be solved with node classification. With many challenges, -there are multiple ways to attempt to solve them, and that's why the -ArangoGraphML node classification is only the first of many techniques to be -introduced. You can sign up to get immediate access to our latest stable -features and also try out other features included in the pipeline, such as -embedding similarity or link prediction. - -For more information, [get in touch](https://www.arangodb.com/contact/) -with the ArangoDB team. - -### Metrics and Compliance - -#### Training Performance - -Before using a model to provide predictions to your application, there needs -to be a way to determine its level of accuracy. Additionally, a mechanism must -be in place to ensure the experiments comply with auditor requirements. - -ArangoGraphML supports these objectives by storing all relevant training data -and metrics in a metadata graph, which is only available to you and is never -viewable by ArangoDB. This metagraph contains valuable training metrics such as -average accuracy (the general metric for determining model performance), F1, -Recall, Precision, and confusion matrix data. This graph links all experiments +## Supported Tasks + +### Node Classification + +Node classification is a **supervised learning** task where the goal is to +predict the label of a node based on both its own features and its relationships +within the graph. It requires a set of labeled nodes to train a model, which then +classifies unlabeled nodes based on learned patterns. + +**How it works in ArangoGraphML** + +- A portion of the nodes in a graph is labeled for training. +- The model learns patterns from both **node features** and + **structural relationships** (neighboring nodes and connections). +- It predicts labels for unlabeled nodes based on these learned patterns. + +**Example Use Cases** + +- **Fraud Detection in Financial Networks** + - **Problem:** Fraudsters often create multiple accounts or interact within + suspicious clusters to evade detection. + - **Solution:** A transaction graph is built where nodes represent users and + edges represent transactions. The model learns patterns from labeled + fraudulent and legitimate users, detecting hidden fraud rings based on + **both user attributes and transaction relationships**. + +- **Customer Segmentation in E-Commerce & Social Media** + - **Problem:** Businesses need to categorize customers based on purchasing + behavior and engagement. + - **Solution:** A graph is built where nodes represent customers and edges + represent interactions (purchases, reviews, social connections). The model + predicts the category of each user based on how similar they are to other users + **not just by their personal data, but also by how they are connected to others**. + +- **Disease Classification in Biomedical Networks** + - **Problem:** Identifying proteins or genes associated with a disease. + - **Solution:** A protein interaction graph is built where nodes are proteins + and edges represent biochemical interactions. The model classifies unknown + proteins based on their interactions with known disease-related proteins, + rather than just their individual properties. + +### Node Embedding Generation + +Node embedding is an **unsupervised learning** technique that converts nodes +into numerical vector representations, preserving their **structural relationships** +within the graph. Unlike simple feature aggregation, node embeddings +**capture the influence of neighboring nodes and graph topology**, making +them powerful for downstream tasks like clustering, anomaly detection, +and link prediction. These combinations can provide valuable insights. +Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) +capabilities to find similar nodes based on their embeddings. + +**Feature Embeddings versus Node Embeddings** + +**Feature Embeddings** are vector representations derived from the attributes or +features associated with nodes. These embeddings aim to capture the inherent +characteristics of the data. For example, in a social network, a +feature embedding might encode user attributes like age, location, and +interests. Techniques like **Word2Vec**, **TF-IDF**, or **autoencoders** are +commonly used to generate such embeddings. + +In the context of graphs, **Node Embeddings** are a +**combination of a node's feature embedding and the structural information from its connected edges**. +Essentially, they aggregate both the node's attributes and the connectivity patterns +within the graph. This fusion helps capture not only the individual properties of +a node but also its position and role within the network. + +**How it works in ArangoGraphML** + +- The model learns an embedding (a vector representation) for each node based on its + **position within the graph and its connections**. +- It **does not rely on labeled data** – instead, it captures structural patterns + through graph traversal and aggregation of neighbor information. +- These embeddings can be used for similarity searches, clustering, and predictive tasks. + +**Example Use Cases** + +- **Recommendation Systems (E-commerce & Streaming Platforms)** + - **Problem:** Platforms like Amazon, Netflix, and Spotify need to recommend products, + movies, or songs. + - **Solution:** A user-item interaction graph is built where nodes are users + and products, and edges represent interactions (purchases, ratings, listens). + **Embeddings encode relationships**, allowing the system to recommend similar + items based on user behavior and network influence rather than just individual + preferences. + +- **Anomaly Detection in Cybersecurity & Finance** + - **Problem:** Detecting unusual activity (e.g., cyber attacks, money laundering) + in complex networks. + - **Solution:** A network of IP addresses, users, and transactions is represented as + a graph. Nodes with embeddings that significantly deviate from normal patterns + are flagged as potential threats. The key advantage here is that anomalies are + detected based on **network structure, not just individual activity logs**. + +- **Link Prediction (Social & Knowledge Graphs)** + - **Problem:** Predicting new relationships, such as suggesting friends on + social media or forecasting research paper citations. + - **Solution:** A social network graph is created where nodes are users, and + edges represent friendships. **Embeddings capture the likelihood of + connections forming based on shared neighborhoods and structural + similarities, even if users have never interacted before**. + +### Key Differences + +| Feature | Node Classification | Node Embedding Generation | +|-----------------------|---------------------|----------------------------| +| **Learning Type** | Supervised | Unsupervised | +| **Input Data** | Labeled nodes | Graph structure & features | +| **Output** | Predicted labels | Node embeddings (vectors) | +| **Key Advantage** | Learns labels based on node connections and attributes | Learns structural patterns and node relationships | +| **Use Cases** | Fraud detection, customer segmentation, disease classification | Recommendations, anomaly detection, link prediction | + +ArangoGraphML provides the infrastructure to efficiently train and apply these +models, helping users extract meaningful insights from complex graph data. + +## Metrics and Compliance + +ArangoGraphML supports tracking your ML pipeline by storing all relevant metadata +and metrics in a Graph called ArangoPipe. This is only available to you and is never +viewable by ArangoDB. This metadata graph links all experiments to the source data, feature generation activities, training runs, and prediction -jobs. Having everything linked across the entire pipeline ensures that, at any -time, anything done that could be considered associated with sensitive user data, -it is logged and easily accessible. +jobs, allowing you to track the entire ML pipeline without having to leave ArangoDB. ### Security Each deployment that uses ArangoGraphML has an `arangopipe` database created, -which houses all this information. Since the data lives with the deployment, +which houses all ML Metadata information. Since this data lives within the deployment, it benefits from the ArangoGraph SOC 2 compliance and Enterprise security features. All ArangoGraphML services live alongside the ArangoGraph deployment and are only -accessible within that organization. \ No newline at end of file +accessible within that organization. diff --git a/site/content/3.13/data-science/arangographml/getting-started.md b/site/content/3.13/data-science/arangographml/getting-started.md index 8a485a254d..6bd614167e 100644 --- a/site/content/3.13/data-science/arangographml/getting-started.md +++ b/site/content/3.13/data-science/arangographml/getting-started.md @@ -59,7 +59,7 @@ ArangoGraphML comes with other ArangoDB Magic Commands! See the full list [here] **API Documentation: [arangoml.ArangoML](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML)** The `ArangoML` class is the main entry point for the `arangoml` package. -It requires the following parameters: +It has the following parameters: - `client`: An instance of arango.client.ArangoClient. Defaults to `None`. If not provided, the **hosts** argument must be provided. - `hosts`: The ArangoDB host(s) to connect to. This can be a single host, or a list of hosts. @@ -67,12 +67,10 @@ It requires the following parameters: - `password`: The ArangoDB password to use for authentication. - `user_token`: The ArangoDB user token to use for authentication. This is an alternative to username/password authentication. -- `ca_cert_file`: (Optional) The path to the CA certificate file to use for TLS - verification. -- `user_token`: (Optional) The ArangoDB user token to use for authentication. - This is an alternative to username/password authentication. +- `ca_cert_file`: The path to the CA certificate file to use for TLS + verification. Defaults to `None`. - `api_endpoint`: The URL to the ArangoGraphML API Service. -- `settings`: (Optional) A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). +- `settings_files`: A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). - `version`: The ArangoML API date version. Defaults to the latest version. It is possible to instantiate an ArangoML object in multiple ways: @@ -188,7 +186,7 @@ Let's get started! {{< tab "ArangoGraphML" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It comes pre-installed in the ArangoGraphML notebook environment. @@ -205,7 +203,7 @@ DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" {{< tab "Self-managed" >}} -The [`arango_datasets` Python package](../../components/tools/arango-datasets.md) +The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package allows you to load pre-defined datasets into ArangoDB. It can be installed with the following command: @@ -273,7 +271,8 @@ arangoml.projects.list_projects() - `outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`. - `dimensionalityReduction`: Object configuring dimensionality reduction. - - `disabled`: Boolean for enabling or disabling dimensionality reduction. Default is `false`. + - `disabled`: Whether to disable dimensionality reduction. Default is `false`, + therefore dimensionality reduction is applied after Featurization by default. - `size`: The number of dimensions to reduce the feature length to. Default is `512`. - `defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys: @@ -286,11 +285,11 @@ arangoml.projects.list_projects() - `jobConfiguration` Optional: A set of configurations that are applied to the job. - `batchSize`: The number of documents to process in a single batch. Default is `32`. - - `runAnalysisChecks`: Boolean for enabling or disabling analysis checks. Default is `true`. - - `skipLabels`: Boolean for enabling or disabling label skipping. Default is `false`. - - `overwriteFSGraph`: Boolean for enabling or disabling overwriting the feature store graph. Default is `false`. - - `writeToSourceGraph`: Boolean for enabling or disabling writing features to the source graph. Default is `true`. - - `useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`. + - `runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`. + - `skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`. + - `useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph. + - `overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s + - `writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. - `vertexCollections`: A dictionary mapping the vertex collection names to the following values: @@ -299,8 +298,8 @@ arangoml.projects.list_projects() - `config`: Collection-level configuration settings. - `featurePrefix`: Identical to global `featurePrefix` but for this collection. - `dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection. - - `outputName`: Identical to global `outputName` but for this collection. - - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType` but for this collection. + - `outputName`: Identical to global `outputName`, but specifically for this collection. + - `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection. - `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported. The Featurization Specification example is used for the GDELT dataset: @@ -517,7 +516,7 @@ arangoml.jobs.cancel_job(prediction_job.job_id) **API Documentation: [ArangoML.jobs.train](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.train)** -Training Graph Machine Learning Models with ArangoGraphML only requires two steps: +Training Graph Machine Learning Models with ArangoGraphML requires two steps: 1. Describe which data points should be included in the Training Job. 2. Pass the Training Specification to the Training Service. @@ -536,7 +535,12 @@ Training Graph Machine Learning Models with ArangoGraphML only requires two step - `targetCollection`: The ArangoDB collection name that contains the prediction label. - `inputFeatures`: The name of the feature to be used as input. - `labelField`: The name of the attribute to be predicted. - - `batchSize`: The number of documents to process in a single batch. Default is `64`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification. + - `targetCollection`: The ArangoDB collection used to generate the embeddings. + - `embeddingSize`: The size of the embedding vector. Default is `128`. + - `batchSize`: The number of documents to process in a single training batch. Default is `64`. + - `generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`. - `metagraph`: Metadata to represent the vertex & edge collections of the graph. If `featureSetID` is provided, this can be omitted. - `graph`: The ArangoDB graph name. @@ -549,7 +553,6 @@ A Training Specification allows for concisely defining your training task in a single object and then passing that object to the training service using the Python API client, as shown below. - The ArangoGraphML Training Service is responsible for training a series of Graph Machine Learning Models using the data provided in the Training Specification. It assumes that the data has been featurized and is ready to be @@ -560,6 +563,8 @@ Given that we have run a Featurization Job, we can create the Training Specifica ```py # 1. Define the Training Specification +# Node Classification example + training_spec = { "featureSetID": featurization_job_result.result.feature_set_id, "mlSpec": { @@ -570,6 +575,20 @@ training_spec = { } }, } + +# Node Embedding example +# NOTE: Full Graph Embeddings support is coming soon + +training_spec = { + "featureSetID": featurization_job_result.result.feature_set_id, + "mlSpec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingSize": 128, + "generateEmbeddings": True, + } + }, +} ``` Once the specification has been defined, a Training Job can be triggered using the `arangoml.jobs.train` method: @@ -588,7 +607,7 @@ Once a Training Job has been submitted, you can wait for it to complete using th training_job_result = arangoml.wait_for_training(training_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -649,6 +668,65 @@ training_job_result = arangoml.wait_for_training(training_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "ml_spec": { + "graphEmbeddings": { + "targetCollection": "Event", + "embeddingLevel": "NODE_EMBEDDINGS", + "embeddingSize": 128, + "embeddingTrainingType": "UNSUPERVISED", + "batchSize": 64, + "generateEmbeddings": true, + "bestModelSelection": "BEST_LOSS", + "persistModels": "ALL_MODELS", + "modelConfigurations": {} + } + }, + "metagraph": { + "graph": "OPEN_INTELLIGENCE_ANGOLA", + "vertexCollections": { + "Actor": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Country": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Event": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x", + "y": "OPEN_INTELLIGENCE_ANGOLA_y" + }, + "Source": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Location": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + }, + "Region": { + "x": "OPEN_INTELLIGENCE_ANGOLA_x" + } + }, + "edgeCollections": { + "eventActor": {}, + "hasSource": {}, + "hasLocation": {}, + "inCountry": {}, + "inRegion": {} + } + }, + "time_submitted": "2025-03-27T02:55:15.099680", + "time_started": "2025-03-27T02:57:25.143948", + "time_ended": "2025-03-27T03:01:24.619737", + "training_type": "Training" +} +``` + You can also cancel a Training Job using the `arangoml.jobs.cancel_job` method: ```py @@ -674,10 +752,15 @@ models = arangoml.list_models( print(len(models)) ``` - The cell below selects the model with the highest **test accuracy** using [ArangoML.get_best_model](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.get_best_model), but there may be other factors that motivate you to choose another model. See the `model_statistics` in the output field below for more information on the full list of available metrics. ```py + +# 2. Select the best Model + +# Get best Node Classification Model +# Sort by highest test accuracy + best_model = arangoml.get_best_model( project.name, training_job.job_id, @@ -685,10 +768,21 @@ best_model = arangoml.get_best_model( sort_child_key="accuracy", ) +# Get best Graph Embedding Model +# Sort by lowest loss + +best_model = arangoml.get_best_model( + project.name, + training_job.job_id, + sort_parent_key="loss", + sort_child_key=None, + reverse=False +) + print(best_model) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "691ceb2f-1931-492a-b4eb-0536925a4697", @@ -722,6 +816,22 @@ print(best_model) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "6047e53a-f1dd-4725-83e8-74ac44629c11", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_display_name": "graphsageencdec Model", + "model_name": "graphsageencdec Model 55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "model_statistics": { + "loss": 0.13700408464796796, + "val_acc": 0.5795393939393939, + "test_acc": 0.5809545454545455 + }, + "model_tasks": [ "GRAPH_EMBEDDINGS" ] +} +``` + ## Prediction **API Documentation: [ArangoML.jobs.predict](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.predict)** @@ -739,15 +849,24 @@ collection, or within the source documents. - `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`. - `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`. - `schedule`: A cron expression to schedule the prediction job (e.g `0 0 * * *` for daily predictions). Default is `None`. - +- `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`. ```py # 1. Define the Prediction Specification +# Node Classification Example +prediction_spec = { + "projectName": project.name, + "databaseName": dataset_db.name, + "modelID": best_model.model_id, +} + +# Node Embedding Example prediction_spec = { "projectName": project.name, "databaseName": dataset_db.name, "modelID": best_model.model_id, + "embeddingsField": "embeddings" } ``` @@ -756,7 +875,12 @@ Once the specification has been defined, a Prediction Job can be triggered using ```py # 2. Submit a Prediction Job + +# For Node Classification prediction_job = arangoml.jobs.predict(prediction_spec) + +# For Graph Embeddings +prediction_job = arangoml.jobs.generate(prediction_spec) ``` Similar to the Training Service, we can wait for a Prediction Job to complete with the `arangoml.wait_for_prediction` method: @@ -767,7 +891,7 @@ Similar to the Training Service, we can wait for a Prediction Job to complete wi prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) ``` -**Example Output:** +**Example Output (Node Classification):** ```py { "job_id": "b2a422bb-5650-4fbc-ba6b-0578af0049d9", @@ -789,15 +913,37 @@ prediction_job_result = arangoml.wait_for_prediction(prediction_job.job_id) } ``` +**Example Output (Node Embeddings):** +```py +{ + "job_id": "25260362-9764-47d0-abb4-247cbdce6c7b", + "job_status": "COMPLETED", + "project_name": "OPEN_INTELLIGENCE_ANGOLA_GraphML_Node_Embeddings", + "project_id": "647025872", + "database_name": "OPEN_INTELLIGENCE_ANGOLA", + "model_id": "55ae93c2-3497-4405-9c63-0fa0e4a5b5bd", + "job_state_information": { + "outputGraphName": "OPEN_INTELLIGENCE_ANGOLA", + "outputCollectionName": "Event", + "outputAttribute": "embeddings", + "numberOfPredictedDocuments": 0, # 0 All documents already have up-to-date embeddings + }, + "time_submitted": "2025-03-27T14:02:33.094191", + "time_started": "2025-03-27T14:09:34.206659", + "time_ended": "2025-03-27T14:09:35.791630", + "prediction_type": "Prediction" +} +``` + You can also cancel a Prediction Job using the `arangoml.jobs.cancel_job` method: ```py arangoml.jobs.cancel_job(prediction_job.job_id) ``` -### Viewing Predictions +### Viewing Inference Results -We can now access our predictions via AQL: +We can now access our results via AQL: ```py import json @@ -814,4 +960,8 @@ query = f""" docs = list(dataset_db.aql.execute(query)) print(json.dumps(docs, indent=2)) -``` \ No newline at end of file +``` + +## What's next + +With the generated Feature (and optionally Node) Embeddings, you can now use them for downstream tasks like clustering, anomaly detection, and link prediction. Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) capabilities to find similar nodes based on their embeddings.