【实测闭坑】LazyGraphRAG利用本地ollama提供Embedding model服务和火山引擎的deepseek API构建本地知识库

发布于:2025-03-21 ⋅ 阅读:(36) ⋅ 点赞:(0)

LazyGraphRAG

  • 2024年4月,为解决传统RAG在全局性的查询总结任务上表现不佳,微软多部门联合提出Project GraphRAG(大模型驱动的KG);
  • 2024年7月,微软正式开源GraphRAG项目,引起极大关注,至今23.2k star,但落地时却面临巨大成本痛点(具体:LLM用于实体关系抽取+描述,社区总结);
  • 2024年11月,为了上述痛点,微软发布了LazyGraphRAG,将数据索引成本降低1000倍,只有GraphRAG的0.1%(使用 NLP 名词短语提取来识别概念及其共现,再利用图形统计来优化概念图并提取分层社区结构);
  • 2025.3月,微软GraphRAG项目迎来2.0.0版本,正式开源LazyGraphRAG,即NLP graph extraction功能。
  • 微软GraphRAG项目迎来2.1.0版本,添加对 JSON 输入文件的支持。

在这里插入图片描述

LazyGraphRAG代码解读

  • NLP图谱抽取 graphrag/index/workflows/extract_graph_nlp.py
    在这里插入图片描述

  • 建立名词短语图

graphrag/index/operations/build_noun_graph/build_noun_graph.py

  1. 主函数 build_noun_graph在这里插入图片描述

    • 功能:构建名词图
    • 输入:
      • text_unit_df: 包含文本单元的DataFrame
      • text_analyzer: 名词短语提取器
      • normalize_edge_weights: 是否标准化边权重
      • num_threads: 线程数(默认4)
      • cache: 缓存对象(可选)
    • 输出:包含节点和边的两个DataFrame
  2. 节点提取函数 _extract_nodes在这里插入图片描述

    • 功能:从文本单元中提取初始节点
    • 主要步骤:
      • 使用缓存加速名词短语提取
      • 并行处理文本单元
      • 对提取的名词短语进行分组统计
    • 输出:包含标题、频率和文本单元ID的DataFrame
  3. 边提取函数 _extract_edges

    • 功能:从节点中提取边
    • 主要步骤:
      • 将出现在同一文本单元中的节点连接
      • 确保源节点始终小于目标节点
      • 计算边权重
      • 可选:使用PMI(点互信息)标准化边权重
    • 输出:包含源节点、目标节点、权重和文本单元ID的DataFrame
  4. 辅助函数

    • _create_relationships: 创建名词短语之间的关系对
    • _calculate_pmi_edge_weights: 计算PMI边权重

这个模块的核心功能是通过自然语言处理技术从文本中提取名词短语,构建名词图,并计算节点之间的关系强度。它使用了并行处理和缓存机制来提高性能,并提供了边权重标准化的选项。

LazyGraphRAG技术原理

  • 使用 NLP 名词短语提取来提取概念及其共现

  • 使用图形统计来优化概念图并提取分层社区结构

微软LazyGraphRAG:新一代超低成本RAG

https://github.com/microsoft/graphrag/blob/main/CHANGELOG.md

实测LazyGraphRAG

网上很多教程都是很老的,说利用ollama需要改源码,目前的版本是不需要改任何源码的,直接设置好配置文件即可
我跑了一个18万单词的例子 50wTokens没了
LazyGraphRAG测试结果如下

conda create -n graphrag python=3.10
conda activate graphrag
pip install graphrag
graphrag init --root ./ragtest

数据:

curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt

我首先测试了LazyGraphRAG利用本地ollama提供Embedding model服务和火山引擎的deepseek API构建本地知识库
失败了
在这里插入图片描述
错误log:
主要是ds的json遵循能力还是有点弱啊

  File "/home/zli/miniconda3/envs/graphrag/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/zli/miniconda3/envs/graphrag/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ModelMetaclass is not JSON serializable
22:53:43,292 graphrag.callbacks.file_workflow_callbacks INFO Community Report Extraction Error details=None
22:53:43,293 graphrag.index.operations.summarize_communities.strategies WARNING No report found for community: 8.0

主要原因deepseek-V3也不是很友好啊,我没钱prompt 微调啊,晕死
将模型从deepseek切换为豆包后成功!
在这里插入图片描述
在这里插入图片描述

配置如下

vim settings.yaml

models:
  default_chat_model:
    type: openai_chat # or azure_openai_chat
    api_base: https://ark.cn-beijing.volces.com/api/v3/
    # api_version: 2024-05-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: #deepseek-v3-241226
    # deployment_name: <azure_model_deployment_name>
    encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting
  default_embedding_model:
    type: openai_embedding # or azure_openai_embedding
    api_base: http://localhost:11434/v1/
    # api_version: 2024-05-01-preview
    #auth_type: api_key # or azure_managed_identity
    #type: openai_chat
    api_key: ollama
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: bge-m3
    # deployment_name: <azure_model_deployment_name>
    encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output/lancedb
    container_name: default
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file # or blob
  file_type: text #[csv, text, json]
  base_dir: "input"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"

### Workflow settings ###
# 这里是设置利用GraphRAG,如果利用LazyGraphRAG,下面的需要注释掉
#extract_graph:
#  model_id: default_chat_model
#  prompt: "prompts/extract_graph.txt"
#  entity_types: [organization,person,geo,event]
#  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500
# 这里是设置利用LazyGraphRAG测
extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  graph_prompt: "prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 8000
  max_input_length: 4000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: false
  embeddings: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"