Skip to content

Commit

Permalink
Convert CLI to Typer app (#1305)
Browse files Browse the repository at this point in the history
  • Loading branch information
jgbradley1 authored Oct 24, 2024
1 parent 94f1e62 commit d6e6f5c
Show file tree
Hide file tree
Showing 29 changed files with 541 additions and 515 deletions.
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241017135754184606.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "reorganize cli layer"
}
4 changes: 2 additions & 2 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"poe", "query",
"--root", "<path_to_ragtest_root_demo>",
"--method", "global",
"What are the top themes in this story",
"--query", "What are the top themes in this story",
]
},
{
Expand All @@ -30,7 +30,7 @@
"request": "launch",
"module": "poetry",
"args": [
"poe", "prompt_tune",
"poe", "prompt-tune",
"--config",
"<path_to_ragtest_root_demo>/settings.yaml",
]
Expand Down
9 changes: 4 additions & 5 deletions docs/config/init.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
# Configuring GraphRAG Indexing

To start using GraphRAG, you need to configure the system. The `init` command is the easiest way to get started. It will create a `.env` and `settings.yaml` files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.
To start using GraphRAG, you must generate a configuration file. The `init` command is the easiest way to get started. It will create a `.env` and `settings.yaml` files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.

## Usage

```sh
python -m graphrag.index [--init] [--root PATH]
graphrag init [--root PATH]
```

## Options

- `--init` - Initialize the directory with the necessary configuration files.
- `--root PATH` - The root directory to initialize. Default is the current directory.
- `--root PATH` - The project root directory to initialize graphrag at. Default is the current directory.

## Example

```sh
python -m graphrag.index --init --root ./ragtest
graphrag init --root ./ragtest
```

## Output
Expand Down
22 changes: 11 additions & 11 deletions docs/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,23 +52,23 @@ Next we'll inject some required config variables:

First let's make sure to setup the required environment variables. For details on these environment variables, and what environment variables are available, see the [variables documentation](config/overview.md).

To initialize your workspace, let's first run the `graphrag.index --init` command.
Since we have already configured a directory named \.ragtest` in the previous step, we can run the following command:
To initialize your workspace, first run the `graphrag init` command.
Since we have already configured a directory named `./ragtest` in the previous step, run the following command:

```sh
python -m graphrag.index --init --root ./ragtest
graphrag init --root ./ragtest
```

This will create two files: `.env` and `settings.yaml` in the `./ragtest` directory.

- `.env` contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined,
`GRAPHRAG_API_KEY=<API_KEY>`. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key.
`GRAPHRAG_API_KEY=<API_KEY>`. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key. If you are using another form of authentication (i.e. managed identity), please delete this file.
- `settings.yaml` contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.
<br/>

#### <ins>OpenAI and Azure OpenAI</ins>

To run in OpenAI mode, just make sure to update the value of `GRAPHRAG_API_KEY` in the `.env` file with your OpenAI API key.
If running in OpenAI mode, update the value of `GRAPHRAG_API_KEY` in the `.env` file with your OpenAI API key.

#### <ins>Azure OpenAI</ins>

Expand All @@ -90,13 +90,13 @@ deployment_name: <azure_model_deployment_name>
Finally we'll run the pipeline!
```sh
python -m graphrag.index --root ./ragtest
graphrag index --root ./ragtest
```

![pipeline executing from the CLI](img/pipeline-running.png)

This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your `settings.yml` file).
Once the pipeline is complete, you should see a new folder called `./ragtest/output/<timestamp>/artifacts` with a series of parquet files.
Once the pipeline is complete, you should see a new folder called `./ragtest/output` with a series of parquet files.

# Using the Query Engine

Expand All @@ -107,19 +107,19 @@ Now let's ask some questions using this dataset.
Here is an example using Global search to ask a high-level question:

```sh
python -m graphrag.query \
graphrag query \
--root ./ragtest \
--method global \
"What are the top themes in this story?"
--query "What are the top themes in this story?"
```

Here is an example using Local search to ask a more specific question about a particular character:

```sh
python -m graphrag.query \
graphrag query \
--root ./ragtest \
--method local \
"Who is Scrooge, and what are his main relationships?"
--query "Who is Scrooge and what are his main relationships?"
```

Please refer to [Query Engine](query/overview.md) docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.
8 changes: 4 additions & 4 deletions docs/index/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,21 @@
The GraphRAG indexer CLI allows for no-code usage of the GraphRAG Indexer.

```bash
python -m graphrag.index --verbose --root </workspace/project/root> \
graphrag index --verbose --root </workspace/project/root> \
--config <custom_config.yml> --resume <timestamp> \
--reporter <rich|print|none> --emit json,csv,parquet \
--nocache
--no-cache
```

## CLI Arguments

- `--verbose` - Adds extra logging information during the run.
- `--root <data-project-dir>` - the data root directory. This should contain an `input` directory with the input data, and an `.env` file with environment variables. These are described below.
- `--init` - This will initialize the data project directory at the specified `root` with bootstrap configuration and prompt-overrides.
- `--resume <output-timestamp>` - if specified, the pipeline will attempt to resume a prior run. The parquet files from the prior run will be loaded into the system as inputs, and the workflows that generated those files will be skipped. The input value should be the timestamped output folder, e.g. "20240105-143721".
- `--config <config_file.yml>` - This will opt-out of the Default Configuration mode and execute a custom configuration. If this is used, then none of the environment-variables below will apply.
- `--reporter <reporter>` - This will specify the progress reporter to use. The default is `rich`. Valid values are `rich`, `print`, and `none`.
- `--dry-run` - Runs the indexing pipeline without executing any steps in order to inspect and validate the configuration file.
- `--emit <types>` - This specifies the table output formats the pipeline should emit. The default is `parquet`. Valid values are `parquet`, `csv`, and `json`, comma-separated.
- `--nocache` - This will disable the caching mechanism. This is useful for debugging and development, but should not be used in production.
- `--no-cache` - This will disable the caching mechanism. This is useful for debugging and development, but should not be used in production.
- `--output <directory>` - Specify the output directory for pipeline artifacts.
- `--reports <directory>` - Specify the output directory for reporting.
8 changes: 4 additions & 4 deletions docs/prompt_tuning/auto_prompt_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ Figure 1: Auto Tuning Conceptual Diagram.

## Prerequisites

Before running auto tuning make sure you have already initialized your workspace with the `graphrag.index --init` command. This will create the necessary configuration files and the default prompts. Refer to the [Init Documentation](../config/init.md) for more information about the initialization process.
Before running auto tuning, ensure you have already initialized your workspace with the `graphrag init` command. This will create the necessary configuration files and the default prompts. Refer to the [Init Documentation](../config/init.md) for more information about the initialization process.

## Usage

You can run the main script from the command line with various options:

```bash
python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN] [--method METHOD] [--limit LIMIT] [--language LANGUAGE] \
graphrag prompt-tune [--root ROOT] [--domain DOMAIN] [--method METHOD] [--limit LIMIT] [--language LANGUAGE] \
[--max-tokens MAX_TOKENS] [--chunk-size CHUNK_SIZE] [--n-subset-max N_SUBSET_MAX] [--k K] \
[--min-examples-required MIN_EXAMPLES_REQUIRED] [--no-entity-types] [--output OUTPUT]
```
Expand Down Expand Up @@ -56,15 +56,15 @@ python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN] [--method METHOD
## Example Usage

```bash
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --domain "environmental news" \
python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --domain "environmental news" \
--method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \
--no-entity-types --output /path/to/output
```

or, with minimal configuration (suggested):

```bash
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types
python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types
```

## Document Selection Methods
Expand Down
8 changes: 4 additions & 4 deletions docs/query/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
The GraphRAG query CLI allows for no-code usage of the GraphRAG Query engine.

```bash
python -m graphrag.query --config <config_file.yml> --data <path-to-data> --community_level <comunit-level> --response_type <response-type> --method <"local"|"global"> <query>
graphrag query --config <config_file.yml> --data <path-to-data> --community-level <comunit-level> --response-type <response-type> --method <"local"|"global"> <query>
```

## CLI Arguments

- `--config <config_file.yml>` - The configuration yaml file to use when running the query. If this is used, then none of the environment-variables below will apply.
- `--data <path-to-data>` - Folder containing the `.parquet` output files from running the Indexer.
- `--community_level <community-level>` - Community level in the Leiden community hierarchy from which we will load the community reports higher value means we use reports on smaller communities. Default: 2
- `--response_type <response-type>` - Free form text describing the response type and format, can be anything, e.g. `Multiple Paragraphs`, `Single Paragraph`, `Single Sentence`, `List of 3-7 Points`, `Single Page`, `Multi-Page Report`. Default: `Multiple Paragraphs`.
- `--community-level <community-level>` - Community level in the Leiden community hierarchy from which we will load the community reports higher value means we use reports on smaller communities. Default: 2
- `--response-type <response-type>` - Free form text describing the response type and format, can be anything, e.g. `Multiple Paragraphs`, `Single Paragraph`, `Single Sentence`, `List of 3-7 Points`, `Single Page`, `Multi-Page Report`. Default: `Multiple Paragraphs`.
- `--method <"local"|"global">` - Method to use to answer the query, one of local or global. For more information check [Overview](overview.md)
- `--streaming` - Stream back the LLM response

Expand Down Expand Up @@ -41,4 +41,4 @@ You can further customize the execution by providing these environment variables
- `GRAPHRAG_GLOBAL_SEARCH_DATA_MAX_TOKENS` - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: `12000`
- `GRAPHRAG_GLOBAL_SEARCH_MAP_MAX_TOKENS` - Default: `500`
- `GRAPHRAG_GLOBAL_SEARCH_REDUCE_MAX_TOKENS` - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500). Default: `2000`
- `GRAPHRAG_GLOBAL_SEARCH_CONCURRENCY` - Default: `32`
- `GRAPHRAG_GLOBAL_SEARCH_CONCURRENCY` - Default: `32`
8 changes: 8 additions & 0 deletions graphrag/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The GraphRAG package."""

from .cli.main import app

app(prog_name="graphrag")
4 changes: 2 additions & 2 deletions graphrag/api/prompt_tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ async def generate_indexing_prompts(
domain: str | None = None,
language: str | None = None,
max_tokens: int = MAX_TOKEN_COUNT,
skip_entity_types: bool = False,
discover_entity_types: bool = True,
min_examples_required: PositiveInt = 2,
n_subset_max: PositiveInt = 300,
k: PositiveInt = 15,
Expand Down Expand Up @@ -114,7 +114,7 @@ async def generate_indexing_prompts(
)

entity_types = None
if not skip_entity_types:
if discover_entity_types:
reporter.info("Generating entity types...")
entity_types = await generate_entity_types(
llm,
Expand Down
4 changes: 4 additions & 0 deletions graphrag/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""CLI for GraphRAG."""
100 changes: 19 additions & 81 deletions graphrag/index/cli.py → graphrag/cli/index.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""Main definition."""
"""CLI implementation of index subcommand."""

import asyncio
import logging
Expand All @@ -17,17 +17,11 @@
load_config,
resolve_paths,
)
from graphrag.index.emit.types import TableEmitterType
from graphrag.index.validate_config import validate_config_names
from graphrag.logging import ProgressReporter, ReporterType, create_progress_reporter
from graphrag.utils.cli import redact

from .emit.types import TableEmitterType
from .graph.extractors.claims.prompts import CLAIM_EXTRACTION_PROMPT
from .graph.extractors.community_reports.prompts import COMMUNITY_REPORT_PROMPT
from .graph.extractors.graph.prompts import GRAPH_EXTRACTION_PROMPT
from .graph.extractors.summarize.prompts import SUMMARIZE_PROMPT
from .init_content import INIT_DOTENV, INIT_YAML
from .validate_config import validate_config_names

# Ignore warnings from numba
warnings.filterwarnings("ignore", message=".*NumbaDeprecationWarning.*")

Expand Down Expand Up @@ -72,37 +66,32 @@ def handle_signal(signum, _):


def index_cli(
root_dir: str,
init: bool,
root_dir: Path,
verbose: bool,
resume: str,
resume: str | None,
update_index_id: str | None,
memprofile: bool,
nocache: bool,
cache: bool,
reporter: ReporterType,
config_filepath: str | None,
config_filepath: Path | None,
emit: list[TableEmitterType],
dryrun: bool,
skip_validations: bool,
output_dir: str | None,
dry_run: bool,
skip_validation: bool,
output_dir: Path | None,
):
"""Run the pipeline with the given config."""
progress_reporter = create_progress_reporter(reporter)
info, error, success = _logger(progress_reporter)
run_id = resume or update_index_id or time.strftime("%Y%m%d-%H%M%S")

if init:
_initialize_project_at(root_dir, progress_reporter)
sys.exit(0)

root = Path(root_dir).resolve()
config = load_config(root, config_filepath)

config.storage.base_dir = output_dir or config.storage.base_dir
config.reporting.base_dir = output_dir or config.reporting.base_dir
config = load_config(root_dir, config_filepath)
config.storage.base_dir = str(output_dir) if output_dir else config.storage.base_dir
config.reporting.base_dir = (
str(output_dir) if output_dir else config.reporting.base_dir
)
resolve_paths(config, run_id)

if nocache:
if not cache:
config.cache.type = CacheType.none

enabled_logging, log_path = enable_logging_with_config(config, verbose)
Expand All @@ -114,16 +103,16 @@ def index_cli(
True,
)

if skip_validations:
if skip_validation:
validate_config_names(progress_reporter, config)

info(f"Starting pipeline run for: {run_id}, {dryrun=}", verbose)
info(f"Starting pipeline run for: {run_id}, {dry_run=}", verbose)
info(
f"Using default configuration: {redact(config.model_dump())}",
verbose,
)

if dryrun:
if dry_run:
info("Dry run complete, exiting...", True)
sys.exit(0)

Expand Down Expand Up @@ -153,54 +142,3 @@ def index_cli(
success("All workflows completed successfully.", True)

sys.exit(1 if encountered_errors else 0)


def _initialize_project_at(path: str, reporter: ProgressReporter) -> None:
"""Initialize the project at the given path."""
reporter.info(f"Initializing project at {path}")
root = Path(path)
if not root.exists():
root.mkdir(parents=True, exist_ok=True)

settings_yaml = root / "settings.yaml"
if settings_yaml.exists():
msg = f"Project already initialized at {root}"
raise ValueError(msg)

with settings_yaml.open("wb") as file:
file.write(INIT_YAML.encode(encoding="utf-8", errors="strict"))

dotenv = root / ".env"
if not dotenv.exists():
with dotenv.open("wb") as file:
file.write(INIT_DOTENV.encode(encoding="utf-8", errors="strict"))

prompts_dir = root / "prompts"
if not prompts_dir.exists():
prompts_dir.mkdir(parents=True, exist_ok=True)

entity_extraction = prompts_dir / "entity_extraction.txt"
if not entity_extraction.exists():
with entity_extraction.open("wb") as file:
file.write(
GRAPH_EXTRACTION_PROMPT.encode(encoding="utf-8", errors="strict")
)

summarize_descriptions = prompts_dir / "summarize_descriptions.txt"
if not summarize_descriptions.exists():
with summarize_descriptions.open("wb") as file:
file.write(SUMMARIZE_PROMPT.encode(encoding="utf-8", errors="strict"))

claim_extraction = prompts_dir / "claim_extraction.txt"
if not claim_extraction.exists():
with claim_extraction.open("wb") as file:
file.write(
CLAIM_EXTRACTION_PROMPT.encode(encoding="utf-8", errors="strict")
)

community_report = prompts_dir / "community_report.txt"
if not community_report.exists():
with community_report.open("wb") as file:
file.write(
COMMUNITY_REPORT_PROMPT.encode(encoding="utf-8", errors="strict")
)
Loading

0 comments on commit d6e6f5c

Please sign in to comment.