Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share my config to change to your local LLM and embedding #374

Open
KylinMountain opened this issue Jul 5, 2024 · 22 comments
Open

Share my config to change to your local LLM and embedding #374

KylinMountain opened this issue Jul 5, 2024 · 22 comments

Comments

@KylinMountain
Copy link
Contributor

settings.yaml

config the llm to llama3 in groq or any other model compatible with OAI API.

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: llama3-8b-8192
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://api.groq.com/openai/v1
  max_tokens: 8192
  concurrent_requests: 1 # the number of parallel inflight requests that may be made
  tokens_per_minute: 28000 # set a leaky bucket throttle
  requests_per_minute: 29 # set a leaky bucket throttle
  # request_timeout: 180.0
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  max_retries: 10
  max_retry_wait: 60.0
  sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

Using the llama.cpp to server embedding API, it is compatible with OAI API.
The start command:

./server -m ./models/mymodels/qwen1.5-chat-ggml-model-Q4_K_M.gguf -c 8192 -n -1 -t 7 --embeddings

So the embedding in the setting config is

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-ada-002
    api_base: http://localhost:8080
    batch_size: 1 # the number of documents to send in a single request
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

But....

⠦ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
└── create_final_community_reports
❌ Errors occurred during the pipeline run, see logs for more details.

@gdhua
Copy link

gdhua commented Jul 5, 2024

Is that OK? I would also like to switch to a local ollama supported model

@qwaszaq
Copy link

qwaszaq commented Jul 5, 2024

I also changed it for Mixtral 8x7B under LM Studio and embeddings to Nomic using the config data from LMStudio

@KylinMountain
Copy link
Contributor Author

Finally, it works. There's a bug when doing create_final_community_reports, it will use llm in setting instead of the llm under community_report in setting.yaml.

Llama3 context window is only 8192, it is not enough to do summary for create_final_community_reports. So you must have a context window like 32k model.

⠙ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
🚀 All workflows completed successfully.

@KylinMountain
Copy link
Contributor Author

Is that OK? I would also like to switch to a local ollama supported model

ollama doesn't support OAI compatible embedding API , try use the llamap.cpp to server the model.
Otherwise you can modify the code to use huggingface embedding model.

@bernardmaltais
Copy link

bernardmaltais commented Jul 5, 2024

Llama3 context window is only 8192, it is not enough to do summary for create_final_community_reports. So you must have a context window like 32k model.

What model do you recommend for the task?

@qwaszaq
Copy link

qwaszaq commented Jul 5, 2024 via email

@KylinMountain
Copy link
Contributor Author

i am using moonshot and qwen max.
you also can try mistral 8x7b it is have 32k context window, ps the model on groq is only 5000 tokens in tpm of free tier. it is not ok to do that.

@emrgnt-cmplxty
Copy link

Are people finding that OSS models are strong enough to actually do meaningful work with the graphRAG approach in this repository?

@bmaltais
Copy link

bmaltais commented Jul 5, 2024

Hard. to tell. Even commercial models like gpt-3.5-turbo are not providing mind blowing results when compared to something like Google's NotebookLM

A lot of the time GraphRAG fail to provide the correct answer, where NotebookLM nails it.

Example GraphRAG global:

python -m graphrag.query --method global --root . "What is an example of a windows virtual machine name structure?"


INFO: Reading settings from settings.yaml
creating llm client with {'api_key': 'REDACTED,len=56', 'type': "openai_chat", 'model': 'gpt-3.5-turbo', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: ### Windows Virtual Machine Name Structure

Analysts have highlighted that a typical example of a Windows virtual machine name structure in the Azure environment follows a specific convention. For instance, a virtual machine name could be structured as SCPC-CTO-VDC-CORE-VM01. In this example, the initial segment, SCPC-CTO-VDC-CORE, signifies the resource group to which the virtual machine belongs. The latter part, VM01, indicates the specific instance of the virtual machine [Data: Reports (71, 78, 77, 60, 46, +more)].

This naming convention showcases the importance of incorporating various identifiers within the virtual machine name to denote crucial information such as the purpose, ownership, and sequence of the virtual machine within the infrastructure. Such structured naming not only aids in easy identification but also plays a significant role in efficient resource management within the Cloud Computing domain.

Analysts also emphasize that the naming structure for Windows virtual machines may include elements like the resource group name, department abbreviation, environment designation, and sequential numbering for individual instances. These components collectively contribute to creating a standardized and organized naming system that facilitates effective resource allocation and management [Data: Reports (71, 78, 77, 60, 46, +more)].

In essence, the meticulous design of the Windows virtual machine name structure, as outlined by the analysts, serves as a fundamental aspect of maintaining clarity, consistency, and operational efficiency within the Azure environment. Understanding and adhering to such naming conventions shall play a vital role in streamlining processes and enhancing overall system coherence.

GraphRAG Global:

python -m graphrag.query --method local --root . "What is an example of a windows virtual machine name structure?" 


INFO: Reading settings from settings.yaml
creating llm client with {'api_key': 'REDACTED,len=56', 'type': "openai_chat", 'model': 'gpt-3.5-turbo', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=56', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response: An example of a Windows virtual machine name structure within the Azure environment follows a specific naming convention outlined in the data. For instance, a Windows virtual machine name typically consists of various components that adhere to the standard naming patterns. In the context of the provided data, a Windows virtual machine name might include elements such as the department code, environment, CSP region, device type, and a user-defined string. An example of such a naming structure could be "ScPcSWA-MyApp01" [Data: Sources (38), Sources (36)].

This naming convention ensures consistency and clarity in identifying and managing Windows virtual machines within the Azure cloud computing environment. By incorporating specific elements in the name, such as the department code and user-defined string, administrators can easily track and categorize virtual machines based on their purpose, location, or ownership. Adhering to these naming standards is crucial for effective resource management and governance in cloud environments like Azure.

Same question with NobetookLM:

Windows Virtual Machine Name Structure Example

An example of a Windows virtual machine name structure is **ScPcSWA-MyApp01**. This structure uses the following fields: 

* **ScPc:** This is the four-character prefix representing the department code (Sc), environment (P for Production), and cloud service provider region (c for Azure Central). 
* **SWA:** This is the optional device type, representing Server, Windows, and Domain Controller. 
* **-:** This is the mandatory field delimiter.
* **MyApp01:** This is the user-defined string.

The device type field is optional and aligns with SSC's end-state data center naming standard and asset tracking systems. The user-defined string allows for flexibility and customization based on specific requirements.

GraphRAG Global was totally wrong with the answer. Local was better... but NotebookLM provided a more relevant answer.

Maybe if I used Chatgpt-4o it might do better... but I am not willing to pay the $ to discover it.

@jgbradley1
Copy link
Contributor

jgbradley1 commented Jul 5, 2024

I can understand the cost argument for development/testing reasons. What we’ve found so far is that use of OSS models leads to more noise in the knowledge graph and therefore a degradation in the overall quality of the graph that is constructed. With a subpar graph, you’re likely to see a wide range of issues in the query response.


We encourage testing with other models but we find that the gpt-4 turbo and gpt-4o LLM’s provide the best quality in practice (at this time). When using models that produce low precision results, this can cause problems in the knowledge graph due to the noise that they introduce. With the GPT-4 family, those models are strongly biased toward precision and the noise is minimal (even when compared to gpt-3.5 turbo).


For a better quality knowledge graph construction, also consider taking a closer look at the prompts generated from the auto-templating process. These prompts are a vital component of the graphrag approach. Our docs don’t currently cover this feature in detail but you can increase the quality of your knowledge graphs by manually reviewing the auto-generated prompts and editing/tuning them (if there are clear errors) to your own data before indexing.

@bmaltais
Copy link

bmaltais commented Jul 5, 2024

@jgbradley1 Thank you for the info. I did create the prompt for the document using the auto-generated feature of graphrag. Still performed less than expected. Probably because I used gpt-3.5 turbo for the whole process.

@COPILOT-WDP
Copy link

What dataset have you indexed? Would be curious to run the process using GPT-4o and compare to NotebookLM (running on Gemini 1.5 Pro I believe).

@ayushjadia
Copy link

Is that OK? I would also like to switch to a local ollama supported model

ollama doesn't support OAI compatible embedding API , try use the llamap.cpp to server the model. Otherwise you can modify the code to use huggingface embedding model.

How can we modify code to support huggingface embedding too?

@RicardoLeeV587
Copy link

RicardoLeeV587 commented Jul 10, 2024

Hi everyone, I'd like to share my configuration of using local LLM and embedding to run GraphRag.

As for the LLM, I use Mistral-7B-Instruct-v0.3. It has 60K+ input length so it can handle the create_community_reports easily.

for the embedding model, I use e5-mistral-7b-instruct, as this is the best open source sentence embedding I find through some literature review.

All of the previous models can be served through vLLM, so you can build your local rag system with some speed boost provided by vLLM.

Besides, there is a small issue lies in query phase. Since the GraphRag request the LLM server through OpenAI style, the "system" content is not capable for mistral model. However, you can import your customize chat template to overcome this issue. Here is the template I use:

{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        {{- message['content'] -}}
    {%- else -%}
        {%- if message['role'] == 'user' -%}
            {{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}}
        {%- else -%}
            {{-'' + message['content'] + '</s>' -}}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{-''-}}
{%- endif -%}

I have already run through the whole local process on the novel “A Christmas Coral”. Hope this message can help everyone who wants to build your own local GraphRag 🎉.

@menghongtao
Copy link

Hi everyone, I'd like to share my configuration of using local LLM and embedding to run GraphRag.

As for the LLM, I use Mistral-7B-Instruct-v0.3. It has 60K+ input length so it can handle the create_community_reports easily.

for the embedding model, I use e5-mistral-7b-instruct, as this is the best open source sentence embedding I find through some literature review.

All of the previous models can be served through vLLM, so you can build your local rag system with some speed boost provided by vLLM.

Besides, there is a small issue lies in query phase. Since the GraphRag request the LLM server through OpenAI style, the "system" content is not capable for mistral model. However, you can import your customize chat template to overcome this issue. Here is the template I use:

{%- for message in messages %} {%- if message['role'] == 'system' -%} {{- message['content'] -}} {%- else -%} {%- if message['role'] == 'user' -%} {{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}} {%- else -%} {{-'' + message['content'] + '</s>' -}} {%- endif -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt -%} {{-''-}} {%- endif -%}

I have already run through the whole local process on the novel “A Christmas Coral”. Hope this message can help everyone who wants to build your own local GraphRag 🎉.

Thanks for your sharing. I also want to use Mistral model, Could you please paste your settings.yaml file

@ayushjadia
Copy link

Hi everyone, I'd like to share my configuration of using local LLM and embedding to run GraphRag.

As for the LLM, I use Mistral-7B-Instruct-v0.3. It has 60K+ input length so it can handle the create_community_reports easily.

for the embedding model, I use e5-mistral-7b-instruct, as this is the best open source sentence embedding I find through some literature review.

All of the previous models can be served through vLLM, so you can build your local rag system with some speed boost provided by vLLM.

Besides, there is a small issue lies in query phase. Since the GraphRag request the LLM server through OpenAI style, the "system" content is not capable for mistral model. However, you can import your customize chat template to overcome this issue. Here is the template I use:

{%- for message in messages %} {%- if message['role'] == 'system' -%} {{- message['content'] -}} {%- else -%} {%- if message['role'] == 'user' -%} {{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}} {%- else -%} {{-'' + message['content'] + '</s>' -}} {%- endif -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt -%} {{-''-}} {%- endif -%}

I have already run through the whole local process on the novel “A Christmas Coral”. Hope this message can help everyone who wants to build your own local GraphRag 🎉.

Please share ur settings.yaml file

@RicardoLeeV587
Copy link

Hi everyone. Here is the settings.yaml I used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  # model: gpt-4-turbo-preview
  # model: "/data3/litian/Redemption/LLama-3/Meta-Llama-3-8B-Instruct"
  model: "/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3"
  # model: "/data3/litian/Redemption/generativeModel/Meta-Llama-3-8B-Instruct"
  model_supports_json: false # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  api_base: http://localhost:8000/v1
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    # model: text-embedding-3-small
    model: "/data3/litian/Redemption/embeddingModel/test/e5-mistral-7b-instruct"
    # api_base: https://<instance>.openai.azure.com
    api_base: http://localhost:8001/v1
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

@menghongtao
Copy link

Thanks for sharing! I have another question for your help, where to change the template you have pasted before?

@RicardoLeeV587
Copy link

Thanks for sharing! I have another question for your help, where to change the template you have pasted before?

By vLLM you can use --chat-template to specify your own template. The bash script is shown as follow:

base_model="/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3"

api_key="12345"
n_gpu=1

python -m vllm.entrypoints.openai.api_server \
  --model ${base_model} \
  --dtype float16 \
  --tensor-parallel-size ${n_gpu} \
  --api-key ${api_key} \
  --enforce-eager \
  --chat-template=./template/mistral.jinja

@1193700079
Copy link

谢谢分享!我还有一个问题需要您帮忙,在哪里可以更改之前粘贴的模板?

通过 vLLM 您可以使用 --chat-template 指定您自己的模板。bash 脚本如下所示:

base_model="/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3"

api_key="12345"
n_gpu=1

python -m vllm.entrypoints.openai.api_server \
  --model ${base_model} \
  --dtype float16 \
  --tensor-parallel-size ${n_gpu} \
  --api-key ${api_key} \
  --enforce-eager \
  --chat-template=./template/mistral.jinja

Hello, I would like to know how to use vllm to start the embedding model。

I look at your setting.yaml
embeddings:
model: "/data3/litian/Redemption/embeddingModel/test/e5-mistral-7b-instruct"
api_base: http://localhost:8001/v1
llm:
model: "/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3"
api_base: http://localhost:8000/v1

@RicardoLeeV587
Copy link

``

谢谢分享!我还有一个问题需要您帮忙,在哪里可以更改之前粘贴的模板?

通过 vLLM 您可以使用 --chat-template 指定您自己的模板。bash 脚本如下所示:

base_model="/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3"

api_key="12345"
n_gpu=1

python -m vllm.entrypoints.openai.api_server \
  --model ${base_model} \
  --dtype float16 \
  --tensor-parallel-size ${n_gpu} \
  --api-key ${api_key} \
  --enforce-eager \
  --chat-template=./template/mistral.jinja

Hello, I would like to know how to use vllm to start the embedding model。

I look at your setting.yaml embeddings: model: "/data3/litian/Redemption/embeddingModel/test/e5-mistral-7b-instruct" api_base: http://localhost:8001/v1 llm: model: "/data3/litian/Redemption/generativeModel/Mistral-7B-Instruct-v0.3" api_base: http://localhost:8000/v1

Hi, Actually vLLM support e5-mistral-7b-instruct. I think this is the only embedding model that vLLM support officially (If I am wrong, please correct me 😊). You can start it through the following command:

base_model="/data3/litian/Redemption/embeddingModel/test/e5-mistral-7b-instruct"

api_key="12345"
n_gpu=1

python -m vllm.entrypoints.openai.api_server --port 8001 --model ${base_model} --dtype auto --tensor-parallel-size ${n_gpu} --api-key ${api_key}

@s106916
Copy link
Contributor

s106916 commented Jul 13, 2024

this is a temp solution for local ollama
https://github.com/s106916/graphrag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests