Pipeline run order wrong #7985

ju-gu · 2024-07-05T13:05:17Z

Describe the bug
When having a more complex pipeline the run order fails by not being able to identify the first node and then setting the "documents" and query input to empty strings. Nodes are executed multiple times overwriting these wrong intermediary outputs again during run time.

The point of failure is in the _component_has_enough_inputs_to_run method of the pipeline.py, as expected inputs for prompt_builder1 are question, template and template_variables and the input parameters are just question, resulting in the function returning false. Later a different component is being executed with "default" values, which are all None / empty strings. Though the template is being parsed already upon instantiation to the prompt builder and the template_variables just include the question parsed in the run method. So no mismatch between expected and input parameters should be there.

Parsing template and template_variables in the run method resolves this issue (shouldn't be needed though).

Output of the sample pipeline (nodes are executed multiple times and starting with the second llm):

To Reproduce

Run this pipeline and check the execution order

from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Pipeline
from dotenv import load_dotenv
import os
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
import logging
from haystack.utils import Secret


logging.basicConfig()
logging.getLogger("haystack.core.pipeline.pipeline").setLevel(logging.DEBUG)

doc_store = InMemoryDocumentStore()
path = "../data/test_folder/"
pathlist = [path+x for x in os.listdir(path)]
converter = TextFileToDocument()

print(f"Documents: {doc_store.count_documents()}")

load_dotenv("ENV_PATH")
openai_api_key = Secret.from_env_var("OPENAI_API_KEY")

prompt_template1 = """
You are a spellchecking system. Check the given query and fill in the corrected query.

Question: {{question}}
Corrected question: 
"""
prompt_template2 = """
According to these documents:

{% for doc in documents %}
  {{ doc.content }}
{% endfor %}

Answer the given question: {{question}}
Answer:
"""

prompt_template3 = """
{% for ans in replies %}
  {{ ans }}
{% endfor %}
"""

prompt_builder1 = PromptBuilder(template=prompt_template1)
prompt_builder2 = PromptBuilder(template=prompt_template2)
prompt_builder3 = PromptBuilder(template=prompt_template3)

llm1 = OpenAIGenerator(api_key=openai_api_key)
llm2 = OpenAIGenerator(api_key=openai_api_key)

ranker = TransformersSimilarityRanker(top_k=5)
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=10)
writer = DocumentWriter(document_store=doc_store)

indexing_p = Pipeline()
indexing_p.add_component(name="converter", instance=converter)
indexing_p.add_component(name="splitter", instance=splitter)
indexing_p.add_component(name="DocEmbedder", instance=doc_embedder)
indexing_p.add_component(name="writer", instance=writer)

indexing_p.connect("converter.documents", "splitter")
indexing_p.connect("splitter.documents", "DocEmbedder.documents")
indexing_p.connect("DocEmbedder.documents", "writer.documents")

indexing_p.run({"converter": {"sources": pathlist}})


print(f"Documents: {doc_store.count_documents()}")

pipeline = Pipeline()
pipeline.add_component(name="TextEmbedder", instance=embedder)
pipeline.add_component(name="retriever", instance=retriever)
pipeline.add_component(name="ranker", instance=ranker)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)
pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="llm", instance=llm1)
pipeline.add_component(name="spellchecker", instance=llm2)


pipeline.connect("prompt_builder1", "spellchecker")
pipeline.connect("spellchecker.replies", "prompt_builder3")
pipeline.connect("prompt_builder3", "TextEmbedder.text")
pipeline.connect("prompt_builder3", "ranker.query")
pipeline.connect("TextEmbedder", "retriever.query_embedding")
pipeline.connect("retriever", "ranker")
pipeline.connect("ranker", "prompt_builder2.documents")
pipeline.connect("prompt_builder3", "prompt_builder2.question")
pipeline.connect("prompt_builder2", "llm")

question = "Wha i Acromegaly?"
result = pipeline.run({
    "prompt_builder1": {"question": question}})
# print(result)

test_data.zip

FAQ Check

Have you had a look at our new FAQ page?

The text was updated successfully, but these errors were encountered:

silvanocerza · 2024-07-05T15:40:39Z

I briefly investigated by bisecting. The last commit this Pipeline works is badb05b, the bug seems introduced with the commit right after 83d3970.

Seems like the changes to PromptBuilder in #7655 surfaced this bug.

I'm still not sure what's the actual cause and will keep investigating.

silvanocerza · 2024-07-08T15:39:14Z

Temporary workdaournd is adding required_variables in PromptBuilders as done below makes the Pipeline run as expected.

prompt_builder2 = PromptBuilder(template=prompt_template2, required_variables=["documents", "question"])
prompt_builder3 = PromptBuilder(template=prompt_template3, required_variables=["replies"])

Another solution could be changing the order the PromptBuilders are added in the Pipeline:

pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)

This problems is caused by a combination of some things. The way we decide which Component to run next, the fact that Components addition order influences the run order and how we treat Components that have only inputs with defaults.

Ideally the fix would change how we decide which Components to run that is independent from the other two factors. And also doesn't break existing use cases.

Not sure how easy that will be. 😕

wochinge · 2024-07-12T15:47:38Z

Do we have a follow up issue for fixing complex / looping pipelines?

ju-gu added type:bug Something isn't working 2.x Related to Haystack v2.0 labels Jul 5, 2024

julian-risch added the P1 High priority, add to the next sprint label Jul 5, 2024

julian-risch assigned silvanocerza Jul 8, 2024

silvanocerza mentioned this issue Jul 12, 2024

Fix bug in Pipeline.run() executing Components in a wrong and unexpected order #8021

Merged

silvanocerza closed this as completed in #8021 Jul 12, 2024

shadeMe mentioned this issue Jul 15, 2024

Pipeline run logic not robust with cyclic graphs #8024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline run order wrong #7985

Pipeline run order wrong #7985

ju-gu commented Jul 5, 2024 •

edited

Loading

silvanocerza commented Jul 5, 2024

silvanocerza commented Jul 8, 2024

wochinge commented Jul 12, 2024

Pipeline run order wrong #7985

Pipeline run order wrong #7985

Comments

ju-gu commented Jul 5, 2024 • edited Loading

silvanocerza commented Jul 5, 2024

silvanocerza commented Jul 8, 2024

wochinge commented Jul 12, 2024

ju-gu commented Jul 5, 2024 •

edited

Loading