graphrag

Default Configuration Mode (using YAML/JSON)

The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.

Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.

For example:

# .env
GRAPHRAG_API_KEY=some_api_key

# settings.yml
llm: 
  api_key: ${GRAPHRAG_API_KEY}

Config Sections

Language Model Setup

models

This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.

For example:

models:
  default_chat_model:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_chat
    model: gpt-4o
    model_supports_json: true
  default_embedding_model:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding
    model: text-embedding-ada-002

Fields

Input Files and Chunking

input

Our pipeline can ingest .csv, .txt, or .json data from an input location. See the inputs page for more details and examples.

Fields

chunks

These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the metadata setting in the input document config, which will replicate document metadata into each chunk.

Fields

Outputs and Storage

output

This section controls the storage mechanism used by the pipeline used for exporting output tables.

Fields

update_index_output

The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.

Fields

cache

This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.

Fields

reporting

This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.

Fields

vector_store

Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).

Fields

Workflow Configurations

These settings control each individual workflow as they execute.

workflows

list[str] - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.

embed_text

By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the target and names fields.

Supported embeddings names are:

Fields

extract_graph

Tune the language model-based graph extraction process.

Fields

summarize_descriptions

Fields

extract_graph_nlp

Defines settings for NLP-based graph extraction methods.

Fields

prune_graph

Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.

Fields

cluster_graph

These are the settings used for Leiden hierarchical clustering of the graph to create communities.

Fields

extract_claims

Fields

community_reports

Fields

embed_graph

We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default.

Fields

umap

Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you must enable graph embedding as well.

Fields

snapshots

Fields

Query

Fields

Fields

Fields

Fields