[Ollama] GraphRAG Community Support for running Ollama #345

dx111ge · 2024-07-03T12:28:06Z

is there a working example for using Ollama? Or is it not supposed to work? Did try, but without any success.

Thanks in advance

bmaltais · 2024-07-03T14:35:23Z

Embeddings are not working with Ollama... I was able to get things working with Ollama for the entities and openai for embeddings.

bmaltais · 2024-07-03T14:36:30Z

Working config can be found here: #339 (comment)

av · 2024-07-03T19:11:48Z

Ollama works as expected

GRAPHRAG_API_KEY=123
GRAPHRAG_API_BASE=http://172.17.0.1:11434/v1
# GRAPHRAG_LLM_MODEL=llama3:instruct
GRAPHRAG_LLM_MODEL=codestral
GRAPHRAG_LLM_THREAD_COUNT=4
GRAPHRAG_LLM_CONCURRENT_REQUESTS=8
GRAPHRAG_LLM_MAX_TOKENS=2048

GRAPHRAG_EMBEDDING_API_BASE=http://172.17.0.1:11435/v1
GRAPHRAG_EMBEDDING_MODEL=mxbai-embed-large

:11435 is a dead-simple proxy that converts HTTP requests from OAI to Ollama format

API shapes

OAI

JSON.stringify({
  object: "list",
  data: [
    ...results.map((r, i) => ({
      object: "embedding",
      index: i,
      embedding: r.embedding,
    })),
  ],
  model,
  usage: {
    prompt_tokens: 0,
    total_tokens: 0,
  },
})

Ollama

JSON.stringify({
  model,
  prompt: input,
})

bmaltais · 2024-07-03T22:26:46Z

Ollama works as expected

GRAPHRAG_API_KEY=123
GRAPHRAG_API_BASE=http://172.17.0.1:11434/v1
# GRAPHRAG_LLM_MODEL=llama3:instruct
GRAPHRAG_LLM_MODEL=codestral
GRAPHRAG_LLM_THREAD_COUNT=4
GRAPHRAG_LLM_CONCURRENT_REQUESTS=8
GRAPHRAG_LLM_MAX_TOKENS=2048

GRAPHRAG_EMBEDDING_API_BASE=http://172.17.0.1:11435/v1
GRAPHRAG_EMBEDDING_MODEL=mxbai-embed-large

:11435 is a dead-simple proxy that converts HTTP requests from OAI to Ollama format

API shapes

OAI

JSON.stringify({
  object: "list",
  data: [
    ...results.map((r, i) => ({
      object: "embedding",
      index: i,
      embedding: r.embedding,
    })),
  ],
  model,
  usage: {
    prompt_tokens: 0,
    total_tokens: 0,
  },
})

Ollama

JSON.stringify({
  model,
  prompt: input,
})

Sorry for what might be obvious... but how do you run this proxy? When I run ollama serve it only listen on the default port and not on 11435

What do you use to run this proxy?

av · 2024-07-03T22:32:28Z

@bmaltais, no worries!

11435 is a proxy server written in JS/Node to specifically map request/response between OAI and Ollama formats, I didn't list the whole code as it's pretty much from the Node docs

bmaltais · 2024-07-03T23:08:09Z

@bmaltais, no worries!

11435 is a proxy server written in JS/Node to specifically map request/response between OAI and Ollama formats, I didn't list the whole code as it's pretty much from the Node docs

This is what I was afraid of ;-) I guess I will wait for something to be built by someone. I don't understand enough about node.js to build this.

vamshi-rvk · 2024-07-05T18:58:04Z

Ollama works as expected

GRAPHRAG_API_KEY=123
GRAPHRAG_API_BASE=http://172.17.0.1:11434/v1
# GRAPHRAG_LLM_MODEL=llama3:instruct
GRAPHRAG_LLM_MODEL=codestral
GRAPHRAG_LLM_THREAD_COUNT=4
GRAPHRAG_LLM_CONCURRENT_REQUESTS=8
GRAPHRAG_LLM_MAX_TOKENS=2048

GRAPHRAG_EMBEDDING_API_BASE=http://172.17.0.1:11435/v1
GRAPHRAG_EMBEDDING_MODEL=mxbai-embed-large

:11435 is a dead-simple proxy that converts HTTP requests from OAI to Ollama format

API shapes

Can you please explain how did you do this..for embeddings api..

SpaceLearner · 2024-07-06T17:00:46Z

It works with ollama embedding by changing the file in /opt/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py with

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
EmbeddingInput,
EmbeddingOutput,
LLMInput,
)

from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes

import ollama

class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
_client: OpenAIClientTypes
_configuration: OpenAIConfiguration

def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
    self.client = client
    self.configuration = configuration
async def _execute_llm(
    self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
    args = {
        "model": self.configuration.model,
        **(kwargs.get("model_parameters") or {}),
    }
    # embedding = await self.client.embeddings.create(
    #     input=input,
    #     **args,
    # )
    # inputs = input['input']
    # print(inputs)
    embedding_list = []
    for inp in input:
        embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
        embedding_list.append(embedding["embedding"])
    # return [d.embedding for d in embedding.data]
    return embedding_list

vamshi-rvk · 2024-07-06T19:20:14Z

It works with ollama embedding by changing the file in /opt/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py with

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM from graphrag.llm.types import ( EmbeddingInput, EmbeddingOutput, LLMInput, )

from .openai_configuration import OpenAIConfiguration from .types import OpenAIClientTypes

import ollama

class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]): _client: OpenAIClientTypes _configuration: OpenAIConfiguration
def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
    self.client = client
    self.configuration = configuration
async def _execute_llm(
    self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
    args = {
        "model": self.configuration.model,
        **(kwargs.get("model_parameters") or {}),
    }
    # embedding = await self.client.embeddings.create(
    #     input=input,
    #     **args,
    # )
    # inputs = input['input']
    # print(inputs)
    embedding_list = []
    for inp in input:
        embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
        embedding_list.append(embedding["embedding"])
    # return [d.embedding for d in embedding.data]
    return embedding_list

Can you please provide the complete /opt/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py replacement code and also the settings file.

bmaltais · 2024-07-06T23:13:22Z

@SpaceLearner Does it work when you try to query? I adapted your code to work with langchain, it create the embeddings... but when I try to do a local query I get an error.

This is my embeddings version:

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The EmbeddingsLLM class."""

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
    EmbeddingInput,
    EmbeddingOutput,
    LLMInput,
)

from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes

from langchain_community.embeddings import OllamaEmbeddings


class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
    """A text-embedding generator LLM."""

    _client: OpenAIClientTypes
    _configuration: OpenAIConfiguration

    def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
        self.client = client
        self.configuration = configuration

    async def _execute_llm(
        self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
    ) -> EmbeddingOutput | None:
        args = {
            "model": self.configuration.model,
            **(kwargs.get("model_parameters") or {}),
        }
        # embedding = await self.client.embeddings.create(
        #     input=input,
        #     **args,
        # )
        # return [d.embedding for d in embedding.data]
    
        ollama_emb = OllamaEmbeddings(**args)
        embedding_list = []
        for inp in input:
            embedding = ollama_emb.embed_documents([inp])
            # embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
            embedding_list.append(embedding[0])
        return embedding_list

This the error:

Error embedding chunk {'OpenAIEmbedding': "'NoneType' object is not iterable"}
Traceback (most recent call last):
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\__main__.py", line 75, in <module>
    run_local_search(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\cli.py", line 154, in run_local_search
    result = search_engine.search(query=query)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\search.py", line 118, in search
    context_text, context_records = self.context_builder.build_context(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\mixed_context.py", line 139, in build_context
    selected_entities = map_query_to_entities(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 55, in map_query_to_entities
    search_results = text_embedding_vectorstore.similarity_search_by_text(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\vector_stores\lancedb.py", line 118, in similarity_search_by_text
    query_embedding = text_embedder(text)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 57, in <lambda>
    text_embedder=lambda t: text_embedder.embed(t),
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\llm\oai\embedding.py", line 96, in embed
    chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\numpy\lib\function_base.py", line 550, in average
    raise ZeroDivisionError(
ZeroDivisionError: Weights sum to zero, can't be normalized

I suspect the query embeddings code also need to be modified...

xiaoquisme · 2024-07-07T14:36:23Z

@SpaceLearner Does it work when you try to query? I adapted your code to work with langchain, it create the embeddings... but when I try to do a local query I get an error.

This is my embeddings version:

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The EmbeddingsLLM class."""

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
    EmbeddingInput,
    EmbeddingOutput,
    LLMInput,
)

from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes

from langchain_community.embeddings import OllamaEmbeddings


class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
    """A text-embedding generator LLM."""

    _client: OpenAIClientTypes
    _configuration: OpenAIConfiguration

    def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
        self.client = client
        self.configuration = configuration

    async def _execute_llm(
        self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
    ) -> EmbeddingOutput | None:
        args = {
            "model": self.configuration.model,
            **(kwargs.get("model_parameters") or {}),
        }
        # embedding = await self.client.embeddings.create(
        #     input=input,
        #     **args,
        # )
        # return [d.embedding for d in embedding.data]
    
        ollama_emb = OllamaEmbeddings(**args)
        embedding_list = []
        for inp in input:
            embedding = ollama_emb.embed_documents([inp])
            # embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
            embedding_list.append(embedding[0])
        return embedding_list

This the error:

Error embedding chunk {'OpenAIEmbedding': "'NoneType' object is not iterable"}
Traceback (most recent call last):
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\__main__.py", line 75, in <module>
    run_local_search(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\cli.py", line 154, in run_local_search
    result = search_engine.search(query=query)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\search.py", line 118, in search
    context_text, context_records = self.context_builder.build_context(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\mixed_context.py", line 139, in build_context
    selected_entities = map_query_to_entities(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 55, in map_query_to_entities
    search_results = text_embedding_vectorstore.similarity_search_by_text(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\vector_stores\lancedb.py", line 118, in similarity_search_by_text
    query_embedding = text_embedder(text)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 57, in <lambda>
    text_embedder=lambda t: text_embedder.embed(t),
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\llm\oai\embedding.py", line 96, in embed
    chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\numpy\lib\function_base.py", line 550, in average
    raise ZeroDivisionError(
ZeroDivisionError: Weights sum to zero, can't be normalized

I suspect the query embeddings code also need to be modified...

hack the file C:\Users\user-name\miniconda3\Lib\site-packages\graphrag\query\llm\oai\embedding.py

with the fellowing contents(tips: only fix--method localparam, the --method global still error😅):

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""OpenAI Embedding model implementation."""

import asyncio
from collections.abc import Callable
from typing import Any

import numpy as np
import tiktoken
from tenacity import (
    AsyncRetrying,
    RetryError,
    Retrying,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)

from graphrag.query.llm.base import BaseTextEmbedding
from graphrag.query.llm.oai.base import OpenAILLMImpl
from graphrag.query.llm.oai.typing import (
    OPENAI_RETRY_ERROR_TYPES,
    OpenaiApiType,
)
from graphrag.query.llm.text_utils import chunk_text
from graphrag.query.progress import StatusReporter

from langchain_community.embeddings import OllamaEmbeddings



class OpenAIEmbedding(BaseTextEmbedding, OpenAILLMImpl):
    """Wrapper for OpenAI Embedding models."""

    def __init__(
        self,
        api_key: str | None = None,
        azure_ad_token_provider: Callable | None = None,
        model: str = "text-embedding-3-small",
        deployment_name: str | None = None,
        api_base: str | None = None,
        api_version: str | None = None,
        api_type: OpenaiApiType = OpenaiApiType.OpenAI,
        organization: str | None = None,
        encoding_name: str = "cl100k_base",
        max_tokens: int = 8191,
        max_retries: int = 10,
        request_timeout: float = 180.0,
        retry_error_types: tuple[type[BaseException]] = OPENAI_RETRY_ERROR_TYPES,  # type: ignore
        reporter: StatusReporter | None = None,
    ):
        OpenAILLMImpl.__init__(
            self=self,
            api_key=api_key,
            azure_ad_token_provider=azure_ad_token_provider,
            deployment_name=deployment_name,
            api_base=api_base,
            api_version=api_version,
            api_type=api_type,  # type: ignore
            organization=organization,
            max_retries=max_retries,
            request_timeout=request_timeout,
            reporter=reporter,
        )

        self.model = model
        self.encoding_name = encoding_name
        self.max_tokens = max_tokens
        self.token_encoder = tiktoken.get_encoding(self.encoding_name)
        self.retry_error_types = retry_error_types

    def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()

    async def aembed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's async function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        embedding_results = await asyncio.gather(*[
            self._aembed_with_retry(chunk, **kwargs) for chunk in token_chunks
        ])
        embedding_results = [result for result in embedding_results if result[0]]
        chunk_embeddings = [result[0] for result in embedding_results]
        chunk_lens = [result[1] for result in embedding_results]
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)  # type: ignore
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()

    def _embed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = Retrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            for attempt in retryer:
                with attempt:
                    embedding = (
                        OllamaEmbeddings(
                            model=self.model,
                        ).embed_query(text)
                        or []
                    )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)

    async def _aembed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = AsyncRetrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            async for attempt in retryer:
                with attempt:
                    embedding = (
                        await OllamaEmbeddings(
                            model=self.model,
                        ).embed_query(text) or [] )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)

mavershang · 2024-07-07T23:18:33Z

It seems I have it working now. It returns nothings if I set llm to llama3, but works ok when switching to mistral.
Is text or csv the only formats supported? Does it support pdf?

gdhua · 2024-07-08T05:20:28Z

To change the openai request format to the one supported by ollama, setting only requires the base_url parameter, for example, api_base: http://localhost:8000/v1

from http.server import BaseHTTPRequestHandler, HTTPServer
import json
from socketserver import ThreadingMixIn
from urllib.parse import urlparse, parse_qs
from queue import Queue
import requests
import argparse
from ascii_colors import ASCIIColors

# Directly defining server configurations
servers = [
    ("server1", {'url': 'http://localhost:11434', 'queue': Queue()}),
    # Add more servers if needed
]

# Define the Ollama model to use
ollama_model = 'qwen2:7b'


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--port', type=int, default=8000, help='Port number for the server')
    args = parser.parse_args()
    ASCIIColors.red("Ollama Proxy server")

    class RequestHandler(BaseHTTPRequestHandler):
        def _send_response(self, response):
            self.send_response(response.status_code)
            for key, value in response.headers.items():
                if key.lower() not in ['content-length', 'transfer-encoding', 'content-encoding']:
                    self.send_header(key, value)
            self.send_header('Transfer-Encoding', 'chunked')
            self.end_headers()

            try:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        self.wfile.write(b"%X\r\n%s\r\n" % (len(chunk), chunk))
                        self.wfile.flush()
                self.wfile.write(b"0\r\n\r\n")
            except BrokenPipeError:
                pass

        def do_GET(self):
            self.log_request()
            self.proxy()

        def do_POST(self):
            self.log_request()
            self.proxy()

        def proxy(self):
            url = urlparse(self.path)
            path = url.path
            get_params = parse_qs(url.query) or {}

            if self.command == "POST":
                content_length = int(self.headers['Content-Length'])
                post_data = self.rfile.read(content_length)
                post_data_str = post_data.decode('utf-8')
                try:
                    post_params = json.loads(post_data_str)
                except json.JSONDecodeError:
                    post_params = {}

                post_params['model'] = ollama_model
                post_params = json.dumps(post_params).encode('utf-8')
            else:
                post_params = {}

            # Find the server with the lowest number of queue entries.
            min_queued_server = servers[0]
            for server in servers:
                cs = server[1]
                if cs['queue'].qsize() < min_queued_server[1]['queue'].qsize():
                    min_queued_server = server

            if path == '/api/generate' or path == '/api/chat':
                que = min_queued_server[1]['queue']
                que.put_nowait(1)
                try:
                    post_data_dict = {}

                    if isinstance(post_data, bytes):
                        post_data_str = post_data.decode('utf-8')
                        post_data_dict = json.loads(post_data_str)

                    response = requests.request(self.command, min_queued_server[1]['url'] + path, params=get_params,
                                                data=post_params, stream=post_data_dict.get("stream", False))
                    self._send_response(response)
                except Exception:
                    pass
                finally:
                    que.get_nowait()
            else:
                # For other endpoints, just mirror the request.
                response = requests.request(self.command, min_queued_server[1]['url'] + path, params=get_params,
                                            data=post_params)
                self._send_response(response)

    class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
        pass

    print('Starting server')
    server = ThreadedHTTPServer(('', args.port), RequestHandler)  # Set the entry port here.
    print(f'Running server on port {args.port}')
    server.serve_forever()


if __name__ == "__main__":
    main()

av · 2024-07-08T09:07:51Z

@gdhua, your prompt-fu failed you, this proxy server doesn't transform embeddings API between OAI/Ollama formats.

@bmaltais, here's the final version of the proxy I ended up using. There was another issue with the fact GraphRAG sends raw token IDs into the embeddings API, rather than non-tokenised raw text.

Proxy server for OpenAI <-> Ollama embeddings

import os
import sys
import json
import logging

import asyncio
from aiohttp import web
import aiohttp
import tiktoken

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

config = {
    "proxy_port": int(os.environ.get("PROXY_PORT", 11435)),
    "api_url": os.environ.get("OLLAMA_ENDPOINT"),
    "tiktoken_encoding": "cl100k_base"
}

encoding = tiktoken.get_encoding(config["tiktoken_encoding"])

async def handle_embeddings(request):
    try:
        body = await request.json()
        model = body["model"]
        input_data = body["input"]

        print(f"/v1/embeddings handler {str(input_data)[:100]}")

        if isinstance(input_data, str):
            input_data = [input_data]

        results = await asyncio.gather(*[fetch_embeddings(model, i) for i in input_data])

        response_data = {
            "object": "list",
            "data": [
                {
                    "object": "embedding",
                    "index": i,
                    "embedding": r["embedding"]
                } for i, r in enumerate(results)
            ],
            "model": model,
            "usage": {
                "prompt_tokens": 0,
                "total_tokens": 0
            }
        }

        return web.json_response(response_data)

    except Exception as e:
        print(f"Error: {str(e)}")
        return web.Response(status=500)

async def fetch_embeddings(model, input_text):
    if isinstance(input_text, int):
        input_text = encoding.decode([input_text])

    # If array of ints - decode the logits with tiktoken
    if isinstance(input_text, list):
        input_text = encoding.decode(input_text)

    if not isinstance(input_text, str):
        raise ValueError(f"Input is not a string: {input_text}")

    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{config['api_url']}/api/embeddings",
            headers={"Content-Type": "application/json"},
            json={"model": model, "prompt": input_text}
        ) as response:
            text = await response.text()
            json_data = json.loads(text)

    print(f"Embeddings: {input_text[:50]}... -> {text[:50]}...")
    return json_data

def main():
    print('Starting embeddings proxy...')

    if not config["api_url"]:
        raise ValueError("OLLAMA_ENDPOINT environment variable is required")

    app = web.Application()
    app.router.add_post("/v1/embeddings", handle_embeddings)

    web.run_app(app, port=config["proxy_port"], host="0.0.0.0")

if __name__ == "__main__":
    main()

A few caveats:

It seems that ollama's embeddings are not working as expected in general for at least the smaller models. I only had some luck running gemma2 own embeddings (which are of course an order of magnitude slower)
When running RAG, be acutely aware about the differences between Global and Local search, as Global search will fail some basic queries you'd think RAG should handle

zeyunie-vecml · 2024-07-09T02:06:38Z

@xiaoquisme , errors when using --method global occurs on my situation as well, and my observation was that the response of llama3 is not aligned such that even the system prompt requires it to answer in json but it includes some filler sentences in the beginning/end of its response. A fix could be in line 233 of .../site-packages/graphrag/query/structured_search/global_search/search.py add this as the first line of the function:
search_response = search_response[max(0, search_response.find("{")):min(len(search_response), search_response.rfind("}") + 1)]
which in most of the time removes the filler sentences.

However, a disclaimer is that my llama3 sometimes even forgets to (where gpt rarely does) answer in the structure of json at all for queries like "Can you give me a joke for people read about this". I think this may only be fixed by improving the prompts or using a more "obedient" model.

vamshi-rvk · 2024-07-09T04:49:52Z

this worked for me.

https://github.com/TheAiSingularity/graphrag-local-ollama

AlonsoGuevara · 2024-07-09T22:04:19Z

I'm making this thread as our official discussion place for Ollama setup and troubleshooting.
Thanks for the engagement and support, what an amazing community!

s106916 · 2024-07-13T03:48:22Z

this is a temp hacked solution for ollama
https://github.com/s106916/graphrag

MarkJGx · 2024-07-14T13:06:59Z

#339 (comment)

homermeng · 2024-07-16T08:27:51Z

@SpaceLearner Does it work when you try to query? I adapted your code to work with langchain, it create the embeddings... but when I try to do a local query I get an error.
This is my embeddings version:

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The EmbeddingsLLM class."""

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
    EmbeddingInput,
    EmbeddingOutput,
    LLMInput,
)

from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes

from langchain_community.embeddings import OllamaEmbeddings


class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
    """A text-embedding generator LLM."""

    _client: OpenAIClientTypes
    _configuration: OpenAIConfiguration

    def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
        self.client = client
        self.configuration = configuration

    async def _execute_llm(
        self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
    ) -> EmbeddingOutput | None:
        args = {
            "model": self.configuration.model,
            **(kwargs.get("model_parameters") or {}),
        }
        # embedding = await self.client.embeddings.create(
        #     input=input,
        #     **args,
        # )
        # return [d.embedding for d in embedding.data]
    
        ollama_emb = OllamaEmbeddings(**args)
        embedding_list = []
        for inp in input:
            embedding = ollama_emb.embed_documents([inp])
            # embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
            embedding_list.append(embedding[0])
        return embedding_list

This the error:

Error embedding chunk {'OpenAIEmbedding': "'NoneType' object is not iterable"}
Traceback (most recent call last):
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\berna\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\__main__.py", line 75, in <module>
    run_local_search(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\cli.py", line 154, in run_local_search
    result = search_engine.search(query=query)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\search.py", line 118, in search
    context_text, context_records = self.context_builder.build_context(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\structured_search\local_search\mixed_context.py", line 139, in build_context
    selected_entities = map_query_to_entities(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 55, in map_query_to_entities
    search_results = text_embedding_vectorstore.similarity_search_by_text(
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\vector_stores\lancedb.py", line 118, in similarity_search_by_text
    query_embedding = text_embedder(text)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 57, in <lambda>
    text_embedder=lambda t: text_embedder.embed(t),
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\graphrag\query\llm\oai\embedding.py", line 96, in embed
    chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
  File "H:\llm_stuff\graphrag\venv\lib\site-packages\numpy\lib\function_base.py", line 550, in average
    raise ZeroDivisionError(
ZeroDivisionError: Weights sum to zero, can't be normalized

I suspect the query embeddings code also need to be modified...

hack the file C:\Users\user-name\miniconda3\Lib\site-packages\graphrag\query\llm\oai\embedding.py

with the fellowing contents(tips: only fix--method localparam, the --method global still error😅):

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""OpenAI Embedding model implementation."""

import asyncio
from collections.abc import Callable
from typing import Any

import numpy as np
import tiktoken
from tenacity import (
    AsyncRetrying,
    RetryError,
    Retrying,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)

from graphrag.query.llm.base import BaseTextEmbedding
from graphrag.query.llm.oai.base import OpenAILLMImpl
from graphrag.query.llm.oai.typing import (
    OPENAI_RETRY_ERROR_TYPES,
    OpenaiApiType,
)
from graphrag.query.llm.text_utils import chunk_text
from graphrag.query.progress import StatusReporter

from langchain_community.embeddings import OllamaEmbeddings



class OpenAIEmbedding(BaseTextEmbedding, OpenAILLMImpl):
    """Wrapper for OpenAI Embedding models."""

    def __init__(
        self,
        api_key: str | None = None,
        azure_ad_token_provider: Callable | None = None,
        model: str = "text-embedding-3-small",
        deployment_name: str | None = None,
        api_base: str | None = None,
        api_version: str | None = None,
        api_type: OpenaiApiType = OpenaiApiType.OpenAI,
        organization: str | None = None,
        encoding_name: str = "cl100k_base",
        max_tokens: int = 8191,
        max_retries: int = 10,
        request_timeout: float = 180.0,
        retry_error_types: tuple[type[BaseException]] = OPENAI_RETRY_ERROR_TYPES,  # type: ignore
        reporter: StatusReporter | None = None,
    ):
        OpenAILLMImpl.__init__(
            self=self,
            api_key=api_key,
            azure_ad_token_provider=azure_ad_token_provider,
            deployment_name=deployment_name,
            api_base=api_base,
            api_version=api_version,
            api_type=api_type,  # type: ignore
            organization=organization,
            max_retries=max_retries,
            request_timeout=request_timeout,
            reporter=reporter,
        )

        self.model = model
        self.encoding_name = encoding_name
        self.max_tokens = max_tokens
        self.token_encoder = tiktoken.get_encoding(self.encoding_name)
        self.retry_error_types = retry_error_types

    def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()

    async def aembed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's async function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        embedding_results = await asyncio.gather(*[
            self._aembed_with_retry(chunk, **kwargs) for chunk in token_chunks
        ])
        embedding_results = [result for result in embedding_results if result[0]]
        chunk_embeddings = [result[0] for result in embedding_results]
        chunk_lens = [result[1] for result in embedding_results]
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)  # type: ignore
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()

    def _embed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = Retrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            for attempt in retryer:
                with attempt:
                    embedding = (
                        OllamaEmbeddings(
                            model=self.model,
                        ).embed_query(text)
                        or []
                    )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)

    async def _aembed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = AsyncRetrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            async for attempt in retryer:
                with attempt:
                    embedding = (
                        await OllamaEmbeddings(
                            model=self.model,
                        ).embed_query(text) or [] )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)

Thanks. For anyone who don't use langchain and just want to use ollama's embedding model, you can make these changes and it will work for global query answering:

change "from langchain_community.embeddings import OllamaEmbeddings" to "import ollama";
in the "_embed_with_retry" function, change the code block "embedding = (
OllamaEmbeddings(
model=self.model,
).embed_query(text)
or []
)" to "embedding = (ollama.embeddings(model="nomic-embed-text", prompt=text) or [])"
in the "_aembed_with_retry" function, change the code block "embedding = (
await OllamaEmbeddings(
model=self.model,
).embed_query(text) or [] ) to "embedding = (ollama.embeddings(model="nomic-embed-text", prompt=text) or [])"

And yes, when doing local query there will still be an error concerning another function in this same .py file.

av mentioned this issue Jul 9, 2024

[Local Embeddings] Community Support thread #370

Open

This was referenced Jul 9, 2024

Which LLM models are supported？ #341

Closed

Is it possible to use Ollama or any other local LLM for indexing instead of openai ? #432

Closed

AlonsoGuevara changed the title ~~Ollama at all?~~ [Ollama] GraphRAG Community Support for running Ollama Jul 9, 2024

AlonsoGuevara added good first issue Good for newcomers oss_llm OSS LLM related issue community_support Issue handled by community members labels Jul 11, 2024

MarkJGx mentioned this issue Jul 14, 2024

JSON parsing: always fix all incoming json when operating _manual_json mode #551

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ollama] GraphRAG Community Support for running Ollama #345

[Ollama] GraphRAG Community Support for running Ollama #345

dx111ge commented Jul 3, 2024

bmaltais commented Jul 3, 2024 •

edited

Loading

bmaltais commented Jul 3, 2024

av commented Jul 3, 2024

OAI

Ollama

bmaltais commented Jul 3, 2024

OAI

Ollama

av commented Jul 3, 2024

bmaltais commented Jul 3, 2024

vamshi-rvk commented Jul 5, 2024

SpaceLearner commented Jul 6, 2024 •

edited

Loading

vamshi-rvk commented Jul 6, 2024 •

edited

Loading

bmaltais commented Jul 6, 2024 •

edited

Loading

xiaoquisme commented Jul 7, 2024 •

edited

Loading

mavershang commented Jul 7, 2024 •

edited

Loading

gdhua commented Jul 8, 2024 •

edited

Loading

av commented Jul 8, 2024

zeyunie-vecml commented Jul 9, 2024 •

edited

Loading

vamshi-rvk commented Jul 9, 2024

AlonsoGuevara commented Jul 9, 2024

s106916 commented Jul 13, 2024

MarkJGx commented Jul 14, 2024

homermeng commented Jul 16, 2024

[Ollama] GraphRAG Community Support for running Ollama #345

[Ollama] GraphRAG Community Support for running Ollama #345

Comments

dx111ge commented Jul 3, 2024

bmaltais commented Jul 3, 2024 • edited Loading

bmaltais commented Jul 3, 2024

av commented Jul 3, 2024

OAI

Ollama

bmaltais commented Jul 3, 2024

OAI

Ollama

av commented Jul 3, 2024

bmaltais commented Jul 3, 2024

vamshi-rvk commented Jul 5, 2024

SpaceLearner commented Jul 6, 2024 • edited Loading

vamshi-rvk commented Jul 6, 2024 • edited Loading

bmaltais commented Jul 6, 2024 • edited Loading

xiaoquisme commented Jul 7, 2024 • edited Loading

mavershang commented Jul 7, 2024 • edited Loading

gdhua commented Jul 8, 2024 • edited Loading

av commented Jul 8, 2024

zeyunie-vecml commented Jul 9, 2024 • edited Loading

vamshi-rvk commented Jul 9, 2024

AlonsoGuevara commented Jul 9, 2024

s106916 commented Jul 13, 2024

MarkJGx commented Jul 14, 2024

homermeng commented Jul 16, 2024

bmaltais commented Jul 3, 2024 •

edited

Loading

SpaceLearner commented Jul 6, 2024 •

edited

Loading

vamshi-rvk commented Jul 6, 2024 •

edited

Loading

bmaltais commented Jul 6, 2024 •

edited

Loading

xiaoquisme commented Jul 7, 2024 •

edited

Loading

mavershang commented Jul 7, 2024 •

edited

Loading

gdhua commented Jul 8, 2024 •

edited

Loading

zeyunie-vecml commented Jul 9, 2024 •

edited

Loading