Graphusion

Graphusion is a pipeline that extract Knowledge Graph triples from text.

CITATION

I have deep respect for the creators of this repo and authors of the related AGENTiGraph paper.
The DiscoverAI YouTube channel also had an interesting video highlighting the various use cases and diverse backgrounds of the authors of the paper: https://www.youtube.com/watch?v=tB5s7Q-8DsE

@misc{zhao2024agentigraphinteractiveknowledgegraph,
      title={AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data}, 
      author={Xinjie Zhao and Moritz Blum and Rui Yang and Boming Yang and Luis Márquez Carpintero and Mónica Pina-Navarro and Tony Wang and Xin Li and Huitao Li and Yanran Fu and Rongrong Wang and Juntao Zhang and Irene Li},
      year={2024},
      eprint={2410.11531},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.11531}, 
}

Setup

Create a new conda environment and install the required packages:

conda create -n graphusion python=3.10
conda activate graphusion
pip install -r requirements.txt

Input Data Structure

Text Data: Place raw text files in data/[dataset_name]/raw/
Relation Definitions: JSON file defining relationship types and descriptions
Optional Expert Data:
- Gold concepts (TSV): Expert-provided concept definitions
- Refined concepts (TSV): Curated concept list
- Annotated graph: Existing knowledge graph structure

Output Files

The pipeline generates three main outputs:

concept_abstracts.json: Mapping between concepts and their source abstracts
step-02.jsonl: Extracted knowledge graph triples
step-03.jsonl: Final fused and refined knowledge graph

Configuration Options

Parameter	Description	Type	Required
run_name	Unique identifier for the run	string	No
dataset	Input dataset name	string	Yes
relation_definitions_file	Path to relations JSON	string	Yes
model	LLM model name	string	No
max_resp_tok	Maximum response tokens	int	No
max_input_char	Maximum input characters	int	No
language	Content language	string	No
verbose	Enable detailed logging	bool	No

Usage

The pipeline processes text files from the data/[dataset_name]/raw directory (e.g., data/test/raw) as input. Furthermore, the pipeline requires relation definitions as a JSON file. This file defines the relations and provides a description of the relation (e.g., data/test/relation_types.json). In addition, some information can be provided to improve the results (--gold_concept_file, --refined_concepts_file, --annotated_graph_file) or to skip pipeline steps (--input_json_file, --input_triple_file). See parameters below.

The ACL data is originally in a csv format. Therefore, we provide the notebook preprocess.ipynb to convert the data into the required text files.

The pipeline can be run using the following command:

usage: main.py [-h] [--run_name RUN_NAME] --dataset DATASET --relation_definitions_file RELATION_DEFINITIONS_FILE [--input_json_file INPUT_JSON_FILE]
               [--input_triple_file INPUT_TRIPLE_FILE] [--model MODEL] [--max_resp_tok MAX_RESP_TOK] [--max_input_char MAX_INPUT_CHAR]
               [--prompt_tpextraction PROMPT_TPEXTRACTION] [--prompt_fusion PROMPT_FUSION] [--gold_concept_file GOLD_CONCEPT_FILE]
               [--refined_concepts_file REFINED_CONCEPTS_FILE] [--annotated_graph_file ANNOTATED_GRAPH_FILE] [--language LANGUAGE] [--verbose]

options:
  -h, --help            show this help message and exit
  --run_name RUN_NAME   Assign a name to this run. The name will be used to, e.g., determine the output directory. We recommend to use unique and descriptive names to
                        distinguish the results of different models.
  --dataset DATASET     Name of the dataset. Is used to, e.g., determine the input directory.
  --relation_definitions_file RELATION_DEFINITIONS_FILE
                        Path to the relation definitions file. The file should be a JSON file, where the keys are the relation types and the values are dictionaries with the
                        following keys: 'label', 'description'.
  --input_json_file INPUT_JSON_FILE
                        Path to the input file. Step 1 will be skipped if this argument is provided. The input file should be a JSON file with the following structure:
                        {'concept1': [{'abstract': ['abstract1', ...], 'label: 0},...} E.g. data/test/concept_abstracts.json is the associated file createddurin step 1 in the
                        test run.
  --input_triple_file INPUT_TRIPLE_FILE
                        Path to the input file storing the triples in the format as outputted by the candidate triple extraction model. Step 1 and step 2 will be skipped if
                        this argument is provided.
  --model MODEL         Name of the LLM that should be used for the KG construction.
  --max_resp_tok MAX_RESP_TOK
                        Maximum number of tokens in the response of the candidate triple extraction model.
  --max_input_char MAX_INPUT_CHAR
                        Maximum number of characters in the input of the candidate triple extraction model.
  --prompt_tpextraction PROMPT_TPEXTRACTION
                        Path to the prompt template for step 1.
  --prompt_fusion PROMPT_FUSION
                        Path to the prompt template for fusion.
  --gold_concept_file GOLD_CONCEPT_FILE
                        Path to a file with concepts that are provided by experts. The file should be a tsv file, each row should look like: 'concept id | concept
  --refined_concepts_file REFINED_CONCEPTS_FILE
                        In step 2 (candidate triple extraction) many new concepts might be added. Instead of using these, concepts can be provided through this parameter.
                        Specify the path to a file with refined concepts of the graph. The file should be a tsv file, each row should look like: "concept id | concept name"
  --annotated_graph_file ANNOTATED_GRAPH_FILE
                        Path to the annotated graph.
  --language LANGUAGE   Language of the abstracts.
  --verbose             Print additional information to the console.

The output of the pipeline are the following files:

concept_abstracts: The json file mapping the extracted concepts to their abstracts.
step-02.jsonl: The extracted triples in linewise JSON format.
step-03.jsonl: The fused triples in linewise JSON format.

Example

To run the full pipeline on a small sample (test) dataset, call: python main.py --run_name "test" --dataset "test" --relation_definitions_file "data/test/relation_types.json" --gold_concept_file "data/test/gold_concepts.tsv" --refined_concepts_file "data/test/refined_concepts.tsv"

To reproduce the Graphusion results on the ACL (nlp) dataset, call: python main.py --run_name "acl" --dataset "nlp" --relation_definitions_file "data/nlp/relation_types.json" --gold_concept_file "data/nlp/gold_concepts.tsv" --refined_concepts_file "data/nlp/refined_concepts.tsv"`

AGENTiGraph Framework Overview**

Introduction

AGENTiGraph is a novel multi-agent framework that bridges Large Language Models (LLMs) and Knowledge Graphs (KGs), aiming to overcome limitations in AI models for complex, domain-specific tasks. The platform allows multiple agents, each leveraging LLMs, to interpret user queries, extract concepts, plan tasks, interact with the KG, reason, generate responses, and dynamically add new knowledge.

The framework includes seven specialized agents designed to manage specific tasks, which enhances consistency, reasoning, and adaptability when handling complex queries.

Paper Title: AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data
Authors: Researchers from eight institutions, including University of Tokyo, Duke Medical School, and Yale University
Publication Date: October 15, 2024
Link: arxiv.org/pdf/2410.11531

Key Components of AGENTiGraph

Multi-Agent System: Integrates multiple LLM-powered agents with a KG to perform complex tasks in a structured, decomposed manner.
Semantic Mapping: Uses BERT-derived embeddings to map extracted entities and relations from user queries onto the KG, aligning similar terms for accurate information retrieval.
Task Decomposition: Queries are broken down into manageable tasks, allowing agents to interact with the KG and perform reasoning in a stepwise fashion.
Agent Functions:
- User Intent Interpretation: Classifies the user’s query type (e.g., relationship judgment, prerequisite prediction).
- Key Concept Extraction: Extracts entities and relations from the query.
- Task Planning: Decomposes the query into logical, executable tasks.
- Knowledge Graph Interaction: Uses Cypher or SPARQL queries to retrieve information from the KG.
- Reasoning: Analyzes retrieved data to synthesize relevant information.
- Response Generation: Crafts user responses.
- Dynamic Knowledge Integration: Updates the KG with newly discovered relationships.

Example Use Case: Exploring Quantum Entanglement

Query: "What foundational concepts should I learn to understand the relationship between quantum entanglement and teleportation?"

User Intent Interpretation Agent: Identifies the query as a request for prerequisite knowledge.
Key Concept Extraction Agent: Extracts "quantum entanglement" and "teleportation" as key concepts.
Task Planning Agent: Plans tasks to map out foundational concepts.
Knowledge Graph Interaction Agent: Executes SPARQL or Cypher queries to retrieve relevant nodes and relations.
Reasoning Agent: Synthesizes information from the KG, identifying foundational topics.
Response Generation Agent: Formulates a structured response.
Dynamic Knowledge Integration Agent: Updates the KG if any new relevant relationships are found.

AGENTiGraph's Technical Methodology

Semantic Mapping: Uses BERT embeddings to match user query entities with KG nodes even when terminology differs.
Knowledge Graph Querying: Executes structured queries to retrieve knowledge, although the framework currently relies on greedy query processing rather than advanced methods like beam search.
Performance Metrics: AGENTiGraph demonstrates 95% accuracy in task classification and a 90% success rate in task execution.

Insights on Agent-Specific Prompts

Each agent in AGENTiGraph is configured with a tailored prompt to optimize task performance:

Example Prompt (User Intent Interpretation): "You are an expert NLP task classifier specializing in knowledge graph interactions. Analyze the given query and classify it into categories like relationship judgment or prerequisite prediction."
Example Prompt (Key Concept Extraction): "Identify and extract key concepts and relations, mapping them to the KG schema using BERT-derived embeddings."

Potential for Further Optimization

While AGENTiGraph's current setup performs well, further enhancements are possible:

Beam Search: Could enable exploring multiple reasoning paths simultaneously.
Custom BERT Models: Domain-specific BERT systems can improve semantic accuracy, especially in specialized fields.

Conclusion

AGENTiGraph represents a significant advancement in AI research, showcasing how multi-agent systems and knowledge graphs can be combined with LLMs for improved factual consistency, reasoning, and adaptability. The platform opens new possibilities for complex, structured information retrieval and reasoning in domain-specific AI applications.

AGENTiGraph & Graphusion Implementation Guide

Overview

This repository contains the Graphusion pipeline described in the AGENTiGraph paper's appendix D.2. While the paper describes a full system with a dual-mode interface, this implementation focuses on the knowledge graph construction pipeline.

Repository Contents

Knowledge Graph Construction Pipeline:
- Step 1: Concept extraction (step_01_concept_extraction.py)
- Step 2: Triple extraction (step_02_triple_extraction.py)
- Step 3: Knowledge fusion (step_03_fusion.py)
Key Input Files:
- Relation definitions: data/test/relation_types.json
- Gold concepts: data/test/gold_concepts.tsv
- Refined concepts: data/test/refined_concepts.tsv

Setup Instructions

Create and activate Python virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

Install requirements:

pip install -r requirements.txt

Set OpenAI API key:

export OPENAI_API_KEY="your-key-here"

Running the Pipeline

Basic pipeline execution:

python main.py --run_name "test" --dataset "test" \
    --relation_definitions_file "data/test/relation_types.json" \
    --gold_concept_file "data/test/gold_concepts.tsv" \
    --refined_concepts_file "data/test/refined_concepts.tsv"

Pipeline outputs (in output/[run_name]/):
- concept_abstracts.json: Concept to abstract mappings
- concepts.tsv: Extracted concepts
- step-02.jsonl: Initial triple extractions
- step-03.jsonl: Refined/fused triples

Neo4j Integration

The step-03.jsonl file contains triples that can be loaded into Neo4j to create the knowledge graph. The paper describes using this graph with a dual-mode interface (chatbot and exploration views), but that interface implementation is not included in this repository.

Understanding the Pipeline

Concept Extraction (Step 1):
- Uses BERTopic for concept discovery
- Integrates with predefined gold concepts
- Outputs concepts and their associated abstracts
Triple Extraction (Step 2):
- Uses LLM to extract relationships between concepts
- Follows predefined relation types from relation_definitions.json
- Generates candidate triples
Knowledge Fusion (Step 3):
- Refines and integrates triples
- Produces final knowledge graph structure
- Outputs step-03.jsonl for Neo4j import

Relationship to AGENTiGraph Paper

While the paper describes a complete system with:

7 specialized agents
Dual-mode user interface
Chat and exploration capabilities

This repository implements the underlying knowledge graph construction pipeline (Graphusion) that would support such a system. The UI and agent implementations are not included.

Available Test Files

The repository includes Jupyter notebooks for testing different components:

test_fusion.ipynb: Tests knowledge fusion
test_graph.ipynb: Tests graph operations
test_topic_extraction.ipynb: Tests concept extraction
test_tpextraction.ipynb: Tests triple extraction

These can be useful for understanding and debugging the pipeline components.

Next Steps

If you want to build the full AGENTiGraph system:

Use this pipeline to construct your knowledge graph
Implement the 7 agents described in the paper
Build the dual-mode interface following the paper's specifications
Connect everything using the Neo4j database as the central knowledge store

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
data/test		data/test
prompts		prompts
test		test
.gitignore		.gitignore
.python-version		.python-version
agentigraph.txt		agentigraph.txt
clustering.py		clustering.py
discoverai-yt-summary.md		discoverai-yt-summary.md
fig_architecture.png		fig_architecture.png
graphs.py		graphs.py
guide.md		guide.md
main.py		main.py
med_bert.py		med_bert.py
models.py		models.py
preprocess.ipynb		preprocess.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
step_01_concept_extraction.py		step_01_concept_extraction.py
step_01_concept_extraction_alternative.py		step_01_concept_extraction_alternative.py
step_02_triple_extraction.py		step_02_triple_extraction.py
step_03_fusion.py		step_03_fusion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graphusion

CITATION

Setup

Input Data Structure

Output Files

Configuration Options

Usage

Example

AGENTiGraph Framework Overview**

Introduction

Key Components of AGENTiGraph

Example Use Case: Exploring Quantum Entanglement

AGENTiGraph's Technical Methodology

Insights on Agent-Specific Prompts

Potential for Further Optimization

Conclusion

AGENTiGraph & Graphusion Implementation Guide

Overview

Repository Contents

Setup Instructions

Running the Pipeline

Neo4j Integration

Understanding the Pipeline

Relationship to AGENTiGraph Paper

Available Test Files

Next Steps

About

Releases

Packages

Contributors 2

Languages

donbr/cgprompt

Folders and files

Latest commit

History

Repository files navigation

Graphusion

CITATION

Setup

Input Data Structure

Output Files

Configuration Options

Usage

Example

AGENTiGraph Framework Overview**

Introduction

Key Components of AGENTiGraph

Example Use Case: Exploring Quantum Entanglement

AGENTiGraph's Technical Methodology

Insights on Agent-Specific Prompts

Potential for Further Optimization

Conclusion

AGENTiGraph & Graphusion Implementation Guide

Overview

Repository Contents

Setup Instructions

Running the Pipeline

Neo4j Integration

Understanding the Pipeline

Relationship to AGENTiGraph Paper

Available Test Files

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages