Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline run order wrong #7985

Closed
1 task done
ju-gu opened this issue Jul 5, 2024 · 3 comments · Fixed by #8021
Closed
1 task done

Pipeline run order wrong #7985

ju-gu opened this issue Jul 5, 2024 · 3 comments · Fixed by #8021
Assignees
Labels
2.x Related to Haystack v2.0 P1 High priority, add to the next sprint type:bug Something isn't working

Comments

@ju-gu
Copy link
Member

ju-gu commented Jul 5, 2024

Describe the bug
When having a more complex pipeline the run order fails by not being able to identify the first node and then setting the "documents" and query input to empty strings. Nodes are executed multiple times overwriting these wrong intermediary outputs again during run time.

The point of failure is in the _component_has_enough_inputs_to_run method of the pipeline.py, as expected inputs for prompt_builder1 are question, template and template_variables and the input parameters are just question, resulting in the function returning false. Later a different component is being executed with "default" values, which are all None / empty strings. Though the template is being parsed already upon instantiation to the prompt builder and the template_variables just include the question parsed in the run method. So no mismatch between expected and input parameters should be there.

Parsing template and template_variables in the run method resolves this issue (shouldn't be needed though).

Output of the sample pipeline (nodes are executed multiple times and starting with the second llm):

image

To Reproduce

Run this pipeline and check the execution order

from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Pipeline
from dotenv import load_dotenv
import os
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
import logging
from haystack.utils import Secret


logging.basicConfig()
logging.getLogger("haystack.core.pipeline.pipeline").setLevel(logging.DEBUG)

doc_store = InMemoryDocumentStore()
path = "../data/test_folder/"
pathlist = [path+x for x in os.listdir(path)]
converter = TextFileToDocument()

print(f"Documents: {doc_store.count_documents()}")

load_dotenv("ENV_PATH")
openai_api_key = Secret.from_env_var("OPENAI_API_KEY")

prompt_template1 = """
You are a spellchecking system. Check the given query and fill in the corrected query.

Question: {{question}}
Corrected question: 
"""
prompt_template2 = """
According to these documents:

{% for doc in documents %}
  {{ doc.content }}
{% endfor %}

Answer the given question: {{question}}
Answer:
"""

prompt_template3 = """
{% for ans in replies %}
  {{ ans }}
{% endfor %}
"""

prompt_builder1 = PromptBuilder(template=prompt_template1)
prompt_builder2 = PromptBuilder(template=prompt_template2)
prompt_builder3 = PromptBuilder(template=prompt_template3)

llm1 = OpenAIGenerator(api_key=openai_api_key)
llm2 = OpenAIGenerator(api_key=openai_api_key)

ranker = TransformersSimilarityRanker(top_k=5)
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=10)
writer = DocumentWriter(document_store=doc_store)

indexing_p = Pipeline()
indexing_p.add_component(name="converter", instance=converter)
indexing_p.add_component(name="splitter", instance=splitter)
indexing_p.add_component(name="DocEmbedder", instance=doc_embedder)
indexing_p.add_component(name="writer", instance=writer)

indexing_p.connect("converter.documents", "splitter")
indexing_p.connect("splitter.documents", "DocEmbedder.documents")
indexing_p.connect("DocEmbedder.documents", "writer.documents")

indexing_p.run({"converter": {"sources": pathlist}})


print(f"Documents: {doc_store.count_documents()}")

pipeline = Pipeline()
pipeline.add_component(name="TextEmbedder", instance=embedder)
pipeline.add_component(name="retriever", instance=retriever)
pipeline.add_component(name="ranker", instance=ranker)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)
pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="llm", instance=llm1)
pipeline.add_component(name="spellchecker", instance=llm2)


pipeline.connect("prompt_builder1", "spellchecker")
pipeline.connect("spellchecker.replies", "prompt_builder3")
pipeline.connect("prompt_builder3", "TextEmbedder.text")
pipeline.connect("prompt_builder3", "ranker.query")
pipeline.connect("TextEmbedder", "retriever.query_embedding")
pipeline.connect("retriever", "ranker")
pipeline.connect("ranker", "prompt_builder2.documents")
pipeline.connect("prompt_builder3", "prompt_builder2.question")
pipeline.connect("prompt_builder2", "llm")

question = "Wha i Acromegaly?"
result = pipeline.run({
    "prompt_builder1": {"question": question}})
# print(result)

test_data.zip

FAQ Check

@ju-gu ju-gu added type:bug Something isn't working 2.x Related to Haystack v2.0 labels Jul 5, 2024
@julian-risch julian-risch added the P1 High priority, add to the next sprint label Jul 5, 2024
@silvanocerza
Copy link
Contributor

I briefly investigated by bisecting. The last commit this Pipeline works is badb05b, the bug seems introduced with the commit right after 83d3970.

Seems like the changes to PromptBuilder in #7655 surfaced this bug.

I'm still not sure what's the actual cause and will keep investigating.

@silvanocerza
Copy link
Contributor

Temporary workdaournd is adding required_variables in PromptBuilders as done below makes the Pipeline run as expected.

prompt_builder2 = PromptBuilder(template=prompt_template2, required_variables=["documents", "question"])
prompt_builder3 = PromptBuilder(template=prompt_template3, required_variables=["replies"])

Another solution could be changing the order the PromptBuilders are added in the Pipeline:

pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)

This problems is caused by a combination of some things. The way we decide which Component to run next, the fact that Components addition order influences the run order and how we treat Components that have only inputs with defaults.

Ideally the fix would change how we decide which Components to run that is independent from the other two factors. And also doesn't break existing use cases.

Not sure how easy that will be. 😕

@wochinge
Copy link
Contributor

Do we have a follow up issue for fixing complex / looping pipelines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P1 High priority, add to the next sprint type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants