Virtual Havruta is a groundbreaking project that represents a collaboration between TUM Venture Labs, Sefaria, and appliedAI Initiative GmbH. This initiative harnesses the power of Language Model-based Retrieval-Augmented Generation (RAG) techniques to create an innovative study companion. Designed for individuals seeking a deeper understanding of Judaism's scriptures, Virtual Havruta stands as a beacon of knowledge and inspiration in the domain of religious study.
The primary goal of Virtual Havruta is to offer trustworthy and factually correct responses to users interested in exploring various aspects of Judaism. By showcasing how different branches of Judaism would approach specific questions and providing reliable references, our tool aims not only to educate but also to inspire. Moreover, the underlying technology has versatile applications, extending to fields like code generation, customer service, internal knowledge retrieval, and engineering support.
- Domain-Specific Application: Utilizes LLM-based RAG techniques tailored for the study of Judaism scriptures.
- Addressing LLM Challenges: Aligns with the current industrial trend to mitigate the issue of hallucination in Language Models.
- Comprehensive Study Companion: Offers insightful analysis into different interpretations within Judaism, coupled with dependable references.
- Collaborative Effort: A product of the joint efforts of Sefaria, TUM Venture Labs, and appliedAI Initiative GmbH, symbolizing a unique blend of religious scholarship and cutting-edge technology.
Virtual Havruta integrates advanced retrieval-augmented generation models to analyze and respond to user queries. By delving into a vast repository of religious texts and interpretations, it provides nuanced perspectives on various Judaic topics. This approach ensures that users receive not just answers, but also contextually rich and theologically sound insights.
The application of Virtual Havruta is vast, ranging from individual study sessions to group discussions and academic research. Its ability to provide diverse viewpoints and references makes it an invaluable tool for anyone seeking to explore the depths of Judaism's rich textual tradition.
This document outlines the core functions used in the VirtualHavruta
class. The functions are grouped by their purpose, detailing the inputs, outputs, and their role within the system.
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
__init__(self, prompts_file: str, config_file: str, logger) |
Initializes the instance with prompts, configurations, and reference information from YAML files. | - prompts_file: str : Path to prompts YAML file- config_file: str : Path to configuration YAML file- logger : Logger instance |
None |
initialize_prompt_templates(self) |
Initializes prompt templates for various chat interactions and updates class attributes. | None | None |
create_prompt_template(self, category: str, template: str, ref_mode: bool = False) -> ChatPromptTemplate |
Creates a prompt template based on a given category and template, optionally including reference data. | - category: str : Category of the prompt- template: str : Template within the category- ref_mode: bool = False : Include reference data if True |
ChatPromptTemplate object |
initialize_llm_instances(self) |
Initializes language model instances based on configuration parameters. | None | None |
initialize_llm_chains(self, model, suffixes) |
Initializes language model chains, each with a specific prompt template and suffix. | - model : Language model instance- suffixes: list[str] : List of suffix identifiers |
None |
create_llm_chain(self, llm, prompt_template) |
Creates a language model chain configured with a specified language model and prompt template. | - llm : Language model instance- prompt_template : Prompt template for the chain |
LLMChain instance |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
make_prediction(self, chain, query: str, action: str, msg_id: str = '', ref_data: str = '') |
Executes a prediction using a specified language model chain, providing logging and token tracking. | - chain : Language model chain- query: str : Input query- action: str : Action type for logging- msg_id: str = '' : Message ID for logging- ref_data: str = '' : Reference data (optional) |
Tuple (result: str, tokens_used: int) |
anti_attack(self, query: str, msg_id: str = '') |
Analyzes a query for potential attacks using an anti-attack language model chain. | - query: str : Query to analyze- msg_id: str = '' : Message ID for logging |
Tuple (detection: str, explanation: str, tokens_used: int) |
adaptor(self, query: str, msg_id: str = '') |
Adapts a query using an adaptation-specific language model chain. | - query: str : Query to adapt- msg_id: str = '' : Message ID for logging |
Tuple (adapted_text: str, tokens_used: int) |
editor(self, query: str, msg_id: str = '') |
Edits a query using an editing-optimized language model chain. | - query: str : Query to edit- msg_id: str = '' : Message ID for logging |
Tuple (edited_text: str, tokens_used: int) |
optimizer(self, query: str, msg_id: str = '') |
Optimizes a query, extracting various components from the optimization results. | - query: str : Query to optimize- msg_id: str = '' : Message ID for logging |
Tuple (translation: str, extraction: str, elaboration: str, quotation: str, challenge: str, proposal: str, tokens_used: int) |
qa(self, query: str, ref_data: str, msg_id: str = '') |
Executes a question-answering task using a language model chain. | - query: str : Question query- ref_data: str : Reference data- msg_id: str = '' : Message ID for logging |
Tuple (response: str, tokens_used: int) |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
retrieve_docs(self, query: str, msg_id: str = '', filter_mode: str = 'primary') |
Retrieves documents matching a query, filtered as primary or secondary sources. | - query: str : Query string- msg_id: str = '' : Message ID for logging- filter_mode: str = 'primary' : 'primary' or 'secondary' |
List of documents |
retrieve_docs_metadata_filtering(self, query: str, msg_id: str = '', metadata_filter: = None) |
Retrieves documents matching a query, filtered based on metadata. | - query: str : Query string- msg_id: str = '' : Message ID for logging- `list: A list of documents that meet the criteria of the specified metadata filter. |
List of documents |
retrieve_nodes_matching_linker_results(self, linker_results: list[dict], msg_id: str = '', filter_mode: str = 'primary', url_prefix: str = "https://www.sefaria.org/") |
Retrieves nodes corresponding to linker results from the graph database. | - linker_results: list[dict] : Results from the linker API- msg_id: str = '' : Message ID for logging- filter_mode: str = 'primary' : 'primary' or 'secondary'- url_prefix: str : URL prefix |
List of Document objects |
get_retrieval_results_knowledge_graph(self, url: str, direction: str, order: int, score_central_node: float, filter_mode_nodes: str, msg_id: str = '') |
Retrieves neighbor nodes of a given URL from the knowledge graph. | - url: str : Central node URL- direction: str : Edge direction ('incoming', 'outgoing', 'both_ways')- order: int : Number of hops- score_central_node: float : Central node score- filter_mode_nodes: str= None : Node filter mode- msg_id: str = '' : Message ID for logging |
List of tuples (Document, score) |
query_graph_db_by_url(self, urls: list[str]) |
Queries the graph database for nodes with given URLs. | - urls: list[str] : List of URLs |
List of Document objects |
query_sefaria_linker(self, text_title="", text_body="", with_text=1, debug=0, max_segments=0, msg_id: str = '') |
Queries the Sefaria Linker API and returns the JSON response. | - text_title: str = "" : Text title- text_body: str = "" : Text body- with_text: int = 1 : Include text in response- debug: int = 0 : Debug flag- max_segments: int = 0 : Max segments- msg_id: str = '' : Message ID for logging |
JSON response (dict or str) |
retrieve_docs_linker(self, screen_res: str, enriched_query: str, msg_id: str = '', filter_mode: str = 'primary') |
Retrieves documents from the Sefaria Linker API based on a query. | - screen_res: str : Screen result query- enriched_query: str : Enriched query- msg_id: str = '' : Message ID for logging- filter_mode: str = 'primary' : 'primary' or 'secondary' |
List of document dictionaries |
retrieve_situational_info(self, msg_id: str = '') |
Retrieves current date and time as a formatted string. | - msg_id: str = '' : Message ID for logging |
Formatted date and time string |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
select_reference(self, query: str, retrieval_res, msg_id: str = '') |
Selects useful references from retrieval results using a language model. | - query: str : Query string- retrieval_res : Retrieved documents- msg_id: str = '' : Message ID for logging |
Tuple (selected_retrieval_res: list, tokens_used: int) |
sort_reference(self, scripture_query: str, enriched_query: str, retrieval_res, filter_mode: str = 'primary', msg_id: str = '') |
Sorts retrieval results based on relevance to the query. | - scripture_query: str : Scripture query- enriched_query: str : Enriched query- retrieval_res : Retrieval results- filter_mode: str = 'primary' : Filter mode- msg_id: str = '' : Message ID for logging |
Tuple (sorted_src_rel_dict: dict, src_data_dict: dict, src_ref_dict: dict, total_tokens: int) |
merge_references_by_url(self, retrieval_res: list[tuple[Document, float]], msg_id: str = '') |
Merges chunks with the same URL to consolidate content and sources. | - retrieval_res: list[tuple[Document, float]] : Documents and scores- msg_id: str = '' : Message ID for logging |
Tuple (sorted_src_rel_dict: dict, src_data_dict: dict, src_ref_dict: dict) |
merge_linker_refs(self, retrieved_docs: list, p_sorted_src_rel_dict: dict, p_src_data_dict: dict, p_src_ref_dict: dict, msg_id: str = '') |
Merges new linker references into existing reference dictionaries. | - retrieved_docs: list : New documents- p_sorted_src_rel_dict: dict : Existing relevance dict- p_src_data_dict: dict : Existing data dict- p_src_ref_dict: dict : Existing ref dict- msg_id: str = '' : Message ID for logging |
Tuple of updated dictionaries |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
score_document_by_graph_distance(self, n_hops: int, start_score: float, score_decrease_per_hop: float) -> float |
Scores a document based on its distance from the central node in the graph. | - n_hops: int : Number of hops- start_score: float : Starting score- score_decrease_per_hop: float : Score decrease per hop |
float score |
rank_documents(self, chunks: list[Document], enriched_query: str, scripture_query: str = None, semantic_similarity_scores: list[float]= None, filter_mode: str = None, msg_id: str = '') |
Ranks documents based on relevance to the query. | - chunks: list[Document] : Documents to rank- enriched_query: str : Enriched query- scripture_query: str = None : Scripture query- semantic_similarity_scores: list[float] = None : Precomputed scores- filter_mode: str = None : Filter mode- msg_id: str = '' : Message ID for logging |
Tuple (sorted_chunks: list[Document], ranking_scores: list[float], total_token_count: int) |
compute_semantic_similarity_documents_query(self, documents: list[Document], query: str, msg_id: str = '') |
Computes semantic similarity between documents and a query. | - documents: list[Document] : Documents- query: str : Query string- msg_id: str = '' : Message ID for logging |
np.array of similarity scores |
get_reference_class(self, documents: list[Document], scripture_query: str, enriched_query: str, msg_id: str = '') |
Determines the reference class for each document based on the query. | - documents: list[Document] : Documents- scripture_query: str : Scripture query- enriched_query: str : Enriched query- msg_id: str = '' : Message ID for logging |
Tuple (reference_classes: np.array, total_token_count: int) |
get_page_rank_scores(self, documents: list[Document], msg_id: str = '') |
Retrieves PageRank scores for documents. | - documents: list[Document] : Documents- msg_id: str = '' : Message ID for logging |
np.array of PageRank scores |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
get_graph_neighbors_by_url(self, url: str, relationship: str, depth: int, filter_mode_nodes: str = None, msg_id: str = '') |
Retrieves neighbor nodes from the graph database based on a URL. | - url: str : Central node URL- relationship: str : Edge relationship- depth: int : Neighbor depth- filter_mode_nodes: str = None : Node filter mode- msg_id: str = '' : Message ID for logging |
List of tuples (Node, distance) |
get_chunks_corresponding_to_nodes(self, nodes: list[Document], batch_size: int = 20, max_nodes: int = None, unique_url: bool = True, msg_id: str = '') |
Retrieves chunks corresponding to given nodes. | - nodes: list[Document] : Nodes- batch_size: int = 20 : Batch size- max_nodes: int = None : Max nodes- unique_url: bool = True : Ensure unique URLs- msg_id: str = '' : Message ID for logging |
List of Document objects |
get_node_corresponding_to_chunk(self, chunk: Document, msg_id: str = '') |
Retrieves the node corresponding to a given chunk. | - chunk: Document : Chunk document- msg_id: str = '' : Message ID for logging |
Document object representing the node |
is_primary_document(self, doc: Document) -> bool |
Checks if a document is a primary document. | - doc: Document : Document to check |
bool |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
topic_ontology(self, extraction: str = '', msgid: str = '', slugs_mode: bool = False) |
Processes topic names to find slugs and retrieves topic descriptions. | - extraction: str = '' : Topic names- msgid: str = '' : Message ID for logging- slugs_mode: bool = False : Return slugs if True |
Dict of descriptions or list of slugs |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
generate_ref_str(self, sorted_src_rel_dict, src_data_dict, src_ref_dict, msg_id: str = '', ref_mode: str = 'primary', n_citation_base: int = 0, is_linker_search: bool = False) |
Constructs formatted reference strings and citation lists. | - sorted_src_rel_dict : Sorted relevance dict- src_data_dict : Source data dict- src_ref_dict : Source ref dict- msg_id: str = '' : Message ID for logging- ref_mode: str = 'primary' : Reference mode- n_citation_base: int = 0 : Starting citation index- is_linker_search: bool = False : Linker search flag |
Tuple (conc_ref_data: str, citations: str, deeplinks: list, n_citation: int) |
generate_kg_deeplink(self, deeplinks, msg_id: str = '') |
Generates a Knowledge Graph deep link URL. | - deeplinks : List of deep links- msg_id: str = '' : Message ID for logging |
str deep link URL |
Function Name | Purpose | Input Parameters | Output |
---|---|---|---|
graph_traversal_retriever(self, screen_res: str, scripture_query: str, enriched_query: str, filter_mode_nodes: str = None, linker_results: list[dict] = None, semantic_search_results: list[tuple[Document, float]] = None, msg_id: str = '') |
Retrieves related chunks by traversing the graph starting from seed chunks. | - screen_res: str : Screen result query- scripture_query: str : Scripture query- enriched_query: str : Enriched query- filter_mode_nodes: str = None : Node filter mode- linker_results: list[dict]= None : Linker results- semantic_search_results: list[tuple[Document, float]] = None : Semantic search results- msg_id: str = '' : Message ID for logging |
Tuple (retrieval_res_kg: list[tuple[Document, float]], total_token_count: int) |
This guide explains how to modify the config.yaml file for the Virtual Havruta project. The configuration file controls the environment, database connections, Slack integration, model API setups, and various other settings.
- Environment-related parameters
These parameters control the application's behavior, logging, and thought process visibility.
environment: use_app_mention: false show_thought_process: true show_kg_link: true log_name: Virtual-Havruta
use_app_mention
: Set totrue
to respond only when mentioned in Slack, orfalse
to respond to all messages.show_thought_process
: Set totrue
to display the intermediate thought process in Slack responses, orfalse
to hide it.show_kg_link
: Set totrue
to include Knowledge Graph (KG) visualization links in responses, orfalse
to hide the KG link.log_name
: Name used for logging. Useful for identifying logs from different runs or environments.
- Database-related parameters
These settings define the database connections for embedding-based and KG-based queries.
database: embed: url: bolt://publicip:7687 username: user password: password@dev top_k: 15 metadata_fields: ['metadata_field_name1', 'metadata_field_name2'] topic_fields: ['topic_field_name1', 'topic_field_name2'] kg: url: bolt://publicip_kg:7687 username: user password: password@dev order: 1 direction: both_ways k_seeds: 5 max_depth: 2 name: db_name neo4j_deeplink: http://neodash.graphapp.io/xyz
Embed settings:
url
: The Neo4j database connection URL.username
/password
: Database credentials for Neo4j.top_k
: Number of top search results to retrieve.metadata_fields
: Metadata fields used for query filtering.topic_fields
: Topic fields used for expanding queries.
KG settings:
url
: Connection URL for the Knowledge Graph database.order
: Specifies search order.direction
: Determines the direction of edges between nodes. Options are:incoming
: Search for newer references.outgoing
: Search for older references.both_ways
: Search in both directions.
k_seeds
: Number of starting seeds for the KG search.max_depth
: Maximum depth for KG traversal, which limits the path length.neo4j_deeplink
: A direct link to the Neo4j visualizer.
- Slack-related parameters
These parameters configure the Slack bot's authentication.
slack: slack_bot_token: slack_bot_token slack_app_token: slack_app_token
slack_bot_token
: The token for the Slack bot's authentication.slack_app_token
: The application token used for real-time WebSocket communication with Slack.
- Model API parameters
Settings to configure which models the application uses, including main, support, and embedding models.
openai_model_api: api_key: openai_model_api_key main_model: main_model_name main_model_temperature: 0 support_model: support_model_name support_model_temperature: 0 embedding_model: embedding_model_name
api_key
: The OpenAI API key for accessing models.main_model
: The main model used to generate responses.main_model_temperature
: Controls the randomness of the main model’s output (0 = deterministic, 1 = more random).support_model
: A secondary model for additional tasks.support_model_temperature
: Similar tomain_model_temperature
, but for the support model.embedding_model
: Model used for generating embeddings.
- LLM Chain Setups
This section defines the sequence of chains used for different tasks handled by the main model and the support model.
llm_chain_setups: main_model: ['chain1', 'chain2'] main_model_json: ['chain3'] support_model: ['chain4', 'chain5', 'chain6'] support_model_json: []
main_model
: Chains used by the main model for text responses.main_model_json
: Chains used for JSON-related tasks by the main model.support_model
: Chains used by the support model for auxiliary tasks.support_model_json
: JSON-related tasks handled by the support model.
- Reference Settings
Settings related to how primary and secondary references are filtered and cited.
references: primary_source_filter: ['filter1', 'filter2', 'filter3'] num_primary_citations: 1 num_secondary_citations: 1
primary_source_filter
: Filters applied to primary references during search.num_primary_citations
: Number of primary source citations to include.num_secondary_citations
: Number of secondary source citations to include.
- Linker References
Settings for linking references from the database.
linker_references: primary_source_filter: ['filter1', 'filter2', 'filter3', 'filter4', 'filter5'] num_primary_citations: -1 num_secondary_citations: -1
primary_source_filter
: Additional filters applied to primary sources.num_primary_citations
: Number of primary citations to include from linked references.num_secondary_citations
: Number of secondary citations to include from linked references.
While currently focused on Judaic scriptures, the underlying technology of Virtual Havruta has potential for broader applications. Its adaptability to other domains highlights the project's versatility and the promise of RAG technology in various fields.
This project is a testament to the power of collaboration, bringing together expertise from TUM Venture Labs, Sefaria, and appliedAI Initiative GmbH. We extend our gratitude to all contributors for their dedication and innovative spirit.