You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When having a more complex pipeline the run order fails by not being able to identify the first node and then setting the "documents" and query input to empty strings. Nodes are executed multiple times overwriting these wrong intermediary outputs again during run time.
The point of failure is in the _component_has_enough_inputs_to_run method of the pipeline.py, as expected inputs for prompt_builder1 are question, template and template_variables and the input parameters are just question, resulting in the function returning false. Later a different component is being executed with "default" values, which are all None / empty strings. Though the template is being parsed already upon instantiation to the prompt builder and the template_variables just include the question parsed in the run method. So no mismatch between expected and input parameters should be there.
Parsing template and template_variables in the run method resolves this issue (shouldn't be needed though).
Output of the sample pipeline (nodes are executed multiple times and starting with the second llm):
To Reproduce
Run this pipeline and check the execution order
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Pipeline
from dotenv import load_dotenv
import os
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
import logging
from haystack.utils import Secret
logging.basicConfig()
logging.getLogger("haystack.core.pipeline.pipeline").setLevel(logging.DEBUG)
doc_store = InMemoryDocumentStore()
path = "../data/test_folder/"
pathlist = [path+x for x in os.listdir(path)]
converter = TextFileToDocument()
print(f"Documents: {doc_store.count_documents()}")
load_dotenv("ENV_PATH")
openai_api_key = Secret.from_env_var("OPENAI_API_KEY")
prompt_template1 = """
You are a spellchecking system. Check the given query and fill in the corrected query.
Question: {{question}}
Corrected question:
"""
prompt_template2 = """
According to these documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Answer the given question: {{question}}
Answer:
"""
prompt_template3 = """
{% for ans in replies %}
{{ ans }}
{% endfor %}
"""
prompt_builder1 = PromptBuilder(template=prompt_template1)
prompt_builder2 = PromptBuilder(template=prompt_template2)
prompt_builder3 = PromptBuilder(template=prompt_template3)
llm1 = OpenAIGenerator(api_key=openai_api_key)
llm2 = OpenAIGenerator(api_key=openai_api_key)
ranker = TransformersSimilarityRanker(top_k=5)
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=10)
writer = DocumentWriter(document_store=doc_store)
indexing_p = Pipeline()
indexing_p.add_component(name="converter", instance=converter)
indexing_p.add_component(name="splitter", instance=splitter)
indexing_p.add_component(name="DocEmbedder", instance=doc_embedder)
indexing_p.add_component(name="writer", instance=writer)
indexing_p.connect("converter.documents", "splitter")
indexing_p.connect("splitter.documents", "DocEmbedder.documents")
indexing_p.connect("DocEmbedder.documents", "writer.documents")
indexing_p.run({"converter": {"sources": pathlist}})
print(f"Documents: {doc_store.count_documents()}")
pipeline = Pipeline()
pipeline.add_component(name="TextEmbedder", instance=embedder)
pipeline.add_component(name="retriever", instance=retriever)
pipeline.add_component(name="ranker", instance=ranker)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)
pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="llm", instance=llm1)
pipeline.add_component(name="spellchecker", instance=llm2)
pipeline.connect("prompt_builder1", "spellchecker")
pipeline.connect("spellchecker.replies", "prompt_builder3")
pipeline.connect("prompt_builder3", "TextEmbedder.text")
pipeline.connect("prompt_builder3", "ranker.query")
pipeline.connect("TextEmbedder", "retriever.query_embedding")
pipeline.connect("retriever", "ranker")
pipeline.connect("ranker", "prompt_builder2.documents")
pipeline.connect("prompt_builder3", "prompt_builder2.question")
pipeline.connect("prompt_builder2", "llm")
question = "Wha i Acromegaly?"
result = pipeline.run({
"prompt_builder1": {"question": question}})
# print(result)
This problems is caused by a combination of some things. The way we decide which Component to run next, the fact that Components addition order influences the run order and how we treat Components that have only inputs with defaults.
Ideally the fix would change how we decide which Components to run that is independent from the other two factors. And also doesn't break existing use cases.
Describe the bug
When having a more complex pipeline the run order fails by not being able to identify the first node and then setting the "documents" and query input to empty strings. Nodes are executed multiple times overwriting these wrong intermediary outputs again during run time.
The point of failure is in the
_component_has_enough_inputs_to_run
method of the pipeline.py, as expected inputs for prompt_builder1 arequestion
,template
andtemplate_variables
and the input parameters are justquestion
, resulting in the function returningfalse
. Later a different component is being executed with "default" values, which are all None / empty strings. Though the template is being parsed already upon instantiation to the prompt builder and thetemplate_variables
just include thequestion
parsed in the run method. So no mismatch between expected and input parameters should be there.Parsing
template
andtemplate_variables
in the run method resolves this issue (shouldn't be needed though).Output of the sample pipeline (nodes are executed multiple times and starting with the second llm):
To Reproduce
Run this pipeline and check the execution order
test_data.zip
FAQ Check
The text was updated successfully, but these errors were encountered: