-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add component SimilarDocumentsRetriever #7733
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 9206339010Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should go in this direction. I see two red flags here:
- A component taking other components in its constructor suggests a design problem somewhere
- There's more code needed for serialisation than the actual
run
Let's take one step back and see what problems we're trying to solve:
- We want to use
document.content
as query - We want to pass more than one query to a retriever
The first problem can be solved with a tiny component taking a list of documents and returning a list of doc.content
. The second is a known issue that sometimes is wrongly referred to as "batching" - we attempted to solve this problem in a generic way but we never had the time to finish, maybe we should invest there?
Yes, it does feel off. Essentially, as you say, it comes about because of the "batching" problem. When trying to do it directly without this wrapper (Route 2 mentioned here) (after using an |
@masci On the path forward, as you say, looks like better to solve the "batching" problem first.
Where is it under discussion? just to know where we can follow and/or contribute. Can't seem to find the right keywords to search. |
Related Issues
Evolved from #5629 and #5666
Proposed Changes:
Addition of a new component
SimilarDocumentsRetriever
.This component retrieves similar documents for each of the given documents for each preset retrievers. So, in a way it's simply a wrapper around multiple retrievers to be run on multiple documents as queries.
Usage Example:
Background
It was conceptualized when considering the addition of
FileSimilarityRetriever
here. There, it's one of the components that come together to provide a file similarity functionality. Route 3 in the demonstrative Colab Notebook there.But this component by itself should possibly be useful for other use-cases too. E.g. finding similar sets of documents to the current output set from a DocSearch pipeline.
How did you test it?
Mostly unit tests, one integration test.
Notes for the reviewer
Open to feedback at all levels, including if there could be another way to have this functionality.
Some concrete big and small things I'm not sure of:
Should this be reformulated as a different component?
List[Document]
, also acceptList[str]
for added flexibility. And rename component to something likeBatchedRetriever
orGroupRetriever
orMultiRetriever
, although each name seems misleading in their own way.Related to the last point, I'm a bit unsure what's the policy for flexible input and output types right now? E.g. accepting
List[Document]
as well asList[str]
or output format (List[List[Document]]
vsList[List[List[Document]]]
or something else) based on a preset init argument.Minor: Unsure what the type hint for the init argument
retrievers
should be? As (afaik) there is no general Retriever interface. Right now it's justList
.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.