diff --git a/README.md b/README.md index a4ed011..1a2c345 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,6 @@ ![Knowledge Graph Banner](./assets/KG_banner.png) *A knowledge graph generated using this code* - ghpages link of this graph: https://rahulnyk.github.io/knowledge_graph/ @@ -19,7 +18,7 @@ Source: https://www.ibm.com/topics/knowledge-graph 5. Populate nodes (concepts) and edges (relations). 6. Visualise and Query. -Step 6 is purely optional, but it has certain artistic gratification associated with it. Network graphs are beautiful objects (just look at the banner image above, isnt it beautiful?). Fortunately there are good number of python libraries available for generating graph visualisations. +Step 6 is purely optional, but it has certain artistic gratification associated with it. Network graphs are beautiful objects (just look at the banner image above, isn't it beautiful?). Fortunately, there are a good number of Python libraries available for generating graph visualisations. ## Why Graph? Once the Knowledge Graph (KG) is build, we can use it for many purposes. We can run graph algorithms and calculate centralities of any node, to understand how important a concept (node) is to this body of work. We can calculate communities to bunch the concepts together to better analyse the text. We can understand the connectedness between seemingly disconnected concepts. @@ -29,22 +28,42 @@ The best of all, we can achieve **Graph Retrieval Augmented Generation (GRAG)** --- ## This project -Here I have created a simple knowledge graph from a pdf document. All the components I used here are set up locally, so this project can be run very easily on a personal machine. -I have adopted a no-GPT approach here to keep things economical. I am using the fantastic *Mistral 7B openorca instruct* as LLM which crushes this use cases wonderfully. The model can be set up locally using Ollama so generating the KG is basically free (No calls to GPT). +Here I have created a simple knowledge graph from a PDF document. The process I follow here is very similar to what is outlined in the above sections, with some simplifications. + +First I split the entire text into chunks. Then I extract concepts mentioned within each chunk using an LLM. Note that I am not extracting entities using an NER model here. There is a difference between concepts and entities. For example 'Bangalore' is an entity, and 'Pleasant weather in Bangalore' is a concept. In my experience, concepts make more meaningful KG than entities. + +I assume that the concepts that are mentioned in the vicinity of each other are related. So every edge in the KG is a text chunk in which the two connected concepts are mentioned. -Here is a list of libraries I am using in this project +Once the nodes (concepts) and the edges (text chunks) are calculated, It is easy to create a graph out of them using the libraries mentioned here. +All the components I used here are set up locally, so this project can be run very easily on a personal machine. I have adopted a no-GPT approach here to keep things economical. I am using the fantastic Mistral 7B openorca instruct, which crushes this use case wonderfully. The model can be set up locally using Ollama so generating the KG is basically free (No calls to GPT). +To generate a graph there are two notebooks you need to tweak. -### Mistral 7B with Ollama. -The Amazing Mistral 7b model for extracting concepts out of text chunks. +- [extract_concepts.ipynb](https://github.com/rahulnyk/knowledge_graph/blob/main/extract_concepts.ipynb): This notebook loads the documents, splits them up into chunks of text, and extracts concepts from each chunk. It outputs two CSV files in the data_output directory. + +- [concept_graph.ipynb](https://github.com/rahulnyk/knowledge_graph/blob/main/concept_graph.ipynb): This notebook reads the csv files, and creates a graph out of them. I am also calculating the graph communities here for colouring the nodes community-wise. That's how the graph in the banner image is so colourful. The notebook also generates the pyvis graph visualisation. + +--- +## Tech Stack -### Python Pandas +### Mistral 7B + + +I am using the [Mistral 7B Openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) for extracting concepts out of text chunks. It can follow the system prompt instructions very well. + +### Ollama + + +Ollama makes it easy to host any model locally. Mistral 7B OpenOrca version is already available with Ollama to use out of the box. + +### Pandas dataframes for graph schema (can use a graphdb at a later stage). ### NetworkX + + This is a python library that makes dealing with graphs super easy ### Pyvis -Pyvis python library for visualisation. It generates amazing web visualisatins using VueJS, so the final graphs can be hosted on the web like github pages. +[Pyvis python library](https://github.com/WestHealth/pyvis/tree/master) for visualisation. Pyvis generates Javascript Graph visualisations using python, so the final graphs can be hosted on the web. For example the [github link of this repo](https://rahulnyk.github.io/knowledge_graph/) is a graph generated by pyvis -// Still to complete this README // \ No newline at end of file