Api Übersicht

In [1]

Kopiert!

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.
# Copyright (c) 2024 Microsoft Corporation. # Lizenziert unter der MIT-Lizenz.

API Übersicht¶

Dieses Notebook demonstriert, wie mit GraphRAG als Bibliothek über die API anstatt über die CLI interagiert werden kann. Beachten Sie, dass die CLI von GraphRAG tatsächlich über diese API für alle Operationen mit der Bibliothek verbunden ist.

In [2]

Kopiert!

from pathlib import Path
from pprint import pprint

import pandas as pd

import graphrag.api as api
from graphrag.config.load_config import load_config
from graphrag.index.typing.pipeline_run_result import PipelineRunResult
from pathlib import Path from pprint import pprint import pandas as pd import graphrag.api as api from graphrag.config.load_config import load_config from graphrag.index.typing.pipeline_run_result import PipelineRunResult

In [3]

Kopiert!

PROJECT_DIRECTORY = "<your project directory>"
PROJECT_DIRECTORY = ""

Voraussetzung¶

Als Voraussetzung für alle API-Operationen wird ein GraphRagConfig-Objekt benötigt. Es ist das primäre Mittel zur Steuerung des Verhaltens von GraphRAG und kann aus einer settings.yaml-Konfigurationsdatei instanziiert werden.

Bitte beachten Sie die CLI-Dokumentation für detailliertere Informationen, wie die settings.yaml-Datei generiert wird.

Ein `GraphRagConfig`-Objekt generieren¶

In [4]

Kopiert!

# note that we expect this to fail on the deployed docs because the PROJECT_DIRECTORY is not set to a real location.
# if you run this notebook locally, make sure to point at a location containing your settings.yaml
graphrag_config = load_config(Path(PROJECT_DIRECTORY))
# Beachten Sie, dass wir erwarten, dass dies in den bereitgestellten Dokumenten fehlschlägt, da PROJECT_DIRECTORY nicht auf einen echten Speicherort gesetzt ist. # Wenn Sie dieses Notebook lokal ausführen, stellen Sie sicher, dass Sie auf einen Speicherort verweisen, der Ihre settings.yaml enthält graphrag_config = load_config(Path(PROJECT_DIRECTORY))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 3
      1 # note that we expect this to fail on the deployed docs because the PROJECT_DIRECTORY is not set to a real location.
      2 # if you run this notebook locally, make sure to point at a location containing your settings.yaml
----> 3 graphrag_config = load_config(Path(PROJECT_DIRECTORY))

File ~/work/graphrag/graphrag/graphrag/config/load_config.py:183, in load_config(root_dir, config_filepath, cli_overrides)
    151 """Load configuration from a file.
    152 
    153 Parameters
   (...)    180     If there are pydantic validation errors when instantiating the config.
    181 """
    182 root = root_dir.resolve()
--> 183 config_path = _get_config_path(root, config_filepath)
    184 _load_dotenv(config_path)
    185 config_extension = config_path.suffix

File ~/work/graphrag/graphrag/graphrag/config/load_config.py:106, in _get_config_path(root_dir, config_filepath)
    104         raise FileNotFoundError(msg)
    105 else:
--> 106     config_path = _search_for_config_in_root_dir(root_dir)
    108 if not config_path:
    109     msg = f"Config file not found in root directory: {root_dir}"

File ~/work/graphrag/graphrag/graphrag/config/load_config.py:40, in _search_for_config_in_root_dir(root)
     38 if not root.is_dir():
     39     msg = f"Invalid config path: {root} is not a directory"
---> 40     raise FileNotFoundError(msg)
     42 for file in _default_config_files:
     43     if (root / file).is_file():

FileNotFoundError: Invalid config path: /home/runner/work/graphrag/graphrag/docs/examples_notebooks/<your project directory> is not a directory

Indexing API¶

Indexing ist der Prozess der Aufnahme von rohen Textdaten und der Erstellung eines Wissensgraphen. GraphRAG unterstützt derzeit Plaintext (.txt) und .csv-Dateiformate.

Einen Index erstellen¶

In [5]

Kopiert!

index_result: list[PipelineRunResult] = await api.build_index(config=graphrag_config)

# index_result is a list of workflows that make up the indexing pipeline that was run
for workflow_result in index_result:
    status = f"error\n{workflow_result.errors}" if workflow_result.errors else "success"
    print(f"Workflow Name: {workflow_result.workflow}\tStatus: {status}")
index_result: list[PipelineRunResult] = await api.build_index(config=graphrag_config) # index_result ist eine Liste von Workflows, die die Indexierungs-Pipeline ausmachen, die ausgeführt wurde. for workflow_result in index_result: status = f"error\n{workflow_result.errors}" if workflow_result.errors else "success" print(f"Workflow Name: {workflow_result.workflow}\tStatus: {status}")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 index_result: list[PipelineRunResult] = await api.build_index(config=graphrag_config)
      3 # index_result is a list of workflows that make up the indexing pipeline that was run
      4 for workflow_result in index_result:

NameError: name 'graphrag_config' is not defined

Einen Index abfragen¶

Um einen Index abzufragen, müssen zunächst mehrere Indexdateien in den Speicher geladen und an die Abfrage-API übergeben werden.

In [6]

Kopiert!





entities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/entities.parquet")
communities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/communities.parquet")
community_reports = pd.read_parquet(
    f"{PROJECT_DIRECTORY}/output/community_reports.parquet"
)

response, context = await api.global_search(
    config=graphrag_config,
    entities=entities,
    communities=communities,
    community_reports=community_reports,
    community_level=2,
    dynamic_community_selection=False,
    response_type="Multiple Paragraphs",
    query="Who is Scrooge and what are his main relationships?",
)
entities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/entities.parquet") communities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/communities.parquet") community_reports = pd.read_parquet( f"{PROJECT_DIRECTORY}/output/community_reports.parquet" ) response, context = await api.global_search( config=graphrag_config, entities=entities, communities=communities, community_reports=community_reports, community_level=2, dynamic_community_selection=False, response_type="Multiple Paragraphs", query="Wer ist Dagobert Duck und was sind seine wichtigsten Beziehungen?", )

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 entities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/entities.parquet")
      2 communities = pd.read_parquet(f"{PROJECT_DIRECTORY}/output/communities.parquet")
      3 community_reports = pd.read_parquet(
      4     f"{PROJECT_DIRECTORY}/output/community_reports.parquet"
      5 )

File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
    666     use_nullable_dtypes = False
    667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
    670     path,
    671     columns=columns,
    672     filters=filters,
    673     storage_options=storage_options,
    674     use_nullable_dtypes=use_nullable_dtypes,
    675     dtype_backend=dtype_backend,
    676     filesystem=filesystem,
    677     **kwargs,
    678 )

File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    256 if manager == "array":
    257     to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
    259     path,
    260     filesystem,
    261     storage_options=storage_options,
    262     mode="rb",
    263 )
    264 try:
    265     pa_table = self.api.parquet.read_table(
    266         path_or_handle,
    267         columns=columns,
   (...)    270         **kwargs,
    271     )

File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
    131 handles = None
    132 if (
    133     not fs
    134     and not is_dir
   (...)    139     # fsspec resources can also point to directories
    140     # this branch is used for example when reading from non-fsspec URLs
--> 141     handles = get_handle(
    142         path_or_handle, mode, is_text=False, storage_options=storage_options
    143     )
    144     fs = None
    145     path_or_handle = handles.handle

File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    873         handle = open(
    874             handle,
    875             ioargs.mode,
   (...)    878             newline="",
    879         )
    880     else:
    881         # Binary mode
--> 882         handle = open(handle, ioargs.mode)
    883     handles.append(handle)
    885 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: '<your project directory>/output/entities.parquet'

Das Antwortobjekt ist die offizielle Antwort von GraphRAG, während das Kontextobjekt verschiedene Metadaten bezüglich des Abfrageprozesses enthält, der zur Erzielung der endgültigen Antwort verwendet wurde.

In [7]

Kopiert!

print(response)
print(response)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(response)

NameError: name 'response' is not defined

Ein tieferes Eintauchen in den Kontext liefert Benutzern extrem granulare Informationen, wie z. B. welche Datenquellen (bis hin zu Text-Chunks) letztendlich abgerufen und als Teil des Kontexts, der an das LLM-Modell gesendet wurde, verwendet wurden).

In [8]

Kopiert!

pprint(context)  # noqa: T203
pprint(context) # noqa: T203

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 pprint(context)  # noqa: T203

NameError: name 'context' is not defined