feat: map_data support for Nomic Embed Vision #308

zanussbaum · 2024-06-13T22:11:17Z

🚀	This description was created by Ellipsis for commit `c8db85e`

Summary:

Enhanced Nomic Embed Vision with support for image data handling and updated relevant methods and files.

Key points:

Enhanced Nomic Embed Vision with support for image data handling.
Updated nomic/dataset.py to handle image uploads and set modality to image.
Refactored concurrent futures handling in _add_blobs method.
Updated create_index method in AtlasDataset to handle topic_label_field for image modality.
Updated examples/image/map_cifar10.py to use map_data with blobs.
Updated map_data function in nomic/atlas.py to accept blobs.
Updated convert_pyarrow_schema_for_atlas in nomic/data_inference.py to handle _blob_hash.
Bumped version in setup.py to 3.0.36.

Generated with ❤️ by ellipsis.dev

AndriyMulyar · 2024-06-16T21:53:17Z

@ellipsis-dev review this

ellipsis-dev

👍 Looks good to me! Reviewed everything up to ac9a55c in 1 minute and 52 seconds

More details

Looked at 291 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 3 drafted comments based on config settings.

1. nomic/atlas.py:61

Draft comment:
The map_data function now handles image blobs by setting the modality to 'image' and defaulting the embedding model to 'nomic-embed-vision-v1.5' if not specified. This is a significant change as it integrates image handling directly into the existing data mapping function, which previously focused on embeddings and text data. Ensure that the default embedding model and the handling of image data are thoroughly tested, especially since this modality change affects how data is processed and indexed.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.

2. nomic/atlas.py:63

Draft comment:
The introduction of '_blob_hash' as an indexed field for images is a critical change. Ensure that this new field is properly managed across the system, especially in functions that handle data indexing and retrieval. This change could impact how data is stored, retrieved, and indexed in the system.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.

3. nomic/atlas.py:133

Draft comment:
The handling of topic models for image data should be carefully reviewed to ensure that the 'community_description_target_field' is appropriately used to generate meaningful topic labels from image data. This is a new application of the topic modeling feature, and its effectiveness and accuracy in the context of image data should be validated.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.

Workflow ID: wflow_lx0rEhsESDyvCAax

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

❌ Changes requested. Reviewed everything up to e707f94 in 1 minute and 9 seconds

More details

Looked at 312 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_mBvzWO0puHyHuefa

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.

nomic/dataset.py

ellipsis-dev

👍 Looks good to me! Incremental review on 53bd566 in 3 minutes and 51 seconds

More details

Looked at 22 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1494

Draft comment:
It's important to handle exceptions within the ThreadPoolExecutor in _add_blobs to ensure that all threads are properly managed and potential errors are gracefully handled. Consider adding a try-except block around the future.result() call to catch and log exceptions, ensuring that all resources are properly cleaned up.

            try:
                response = future.result()
            except Exception as e:
                logger.error(f"Failed to upload batch: {e}")
                continue

Reason this comment was not posted:
Confidence of 20% on close inspection, compared to threshold of 50%.

Workflow ID: wflow_bwUvy5f5hGNJK5oV

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on ca3746f in 1 minute and 51 seconds

More details

Looked at 12 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. examples/image/map_cifar10.py:34

Draft comment:
It seems the map_data function call is missing the modality parameter which should be set to 'image' for correct handling of image data. Please add this parameter to ensure the data is processed correctly.

atlas.map_data(blobs=images,
               identifier='cifar-50k-image-upload-with-topic',
               data=datums,
               id_field='id',
               modality='image',  # Specify the modality as image
               topic_model={
                     "build_topic_model": True,
                     "community_description_target_field": "label"
                },
)

Reason this comment was not posted:
Based on historical feedback, this comment is too similar to comments previously marked by users as bad.

Workflow ID: wflow_9Td2T99khZS1gesN

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 3 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on fa031db in 1 minute and 47 seconds

More details

Looked at 14 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1376

Draft comment:
The conversion of IDs to strings is a crucial step for ensuring consistency in data handling. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format. If there are specific requirements or constraints on the ID format in other parts of the system, consider documenting these requirements or implementing checks to ensure compatibility.
Reason this comment was not posted:
Confidence changes required: 33%
The code snippet provided from nomic/dataset.py includes a modification to the _add_blobs method where a check is added to ensure that the ids are strings. If they are not, they are converted to strings. This is a necessary step because UUIDs or other types of IDs might not be strings by default, and the system might expect string identifiers for consistency in handling data.

The conversion using a list comprehension is straightforward and should work as expected. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format.

Workflow ID: wflow_HuQX1xVBe90Ah06e

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on 2850c8a in 2 minutes and 29 seconds

More details

Looked at 38 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1151

Draft comment:
The PR description mentions updates to the map_data function to handle blobs and set modality to image, but these changes are not visible in the diff. Please ensure that the necessary changes are included in the PR.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_nrrJNSBZIfy3muzi

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on ca1b60f in 1 minute and 36 seconds

More details

Looked at 22 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/data_inference.py:88

Draft comment:
The PR description mentions updates to handle _blob_hash in convert_pyarrow_schema_for_atlas, but these changes are not visible in the diff. Please ensure that the intended changes are correctly implemented and included in the PR.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_WF3S6McrLoTDG4q3

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on da0411f in 2 minutes and 10 seconds

More details

Looked at 39 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1430

Draft comment:
The schema for the hash table is hardcoded with the field names as 'id' and '_blob_hash'. It's better to use the dynamic field name from 'self.id_field' to maintain consistency and avoid potential errors if the id field is named differently.

        hash_schema = pa.schema([(self.id_field, pa.string()), ("_blob_hash", pa.string())])

Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_89sI8rQv8n81o3n8

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 9 days left in your free trial, upgrade for $20/seat/month or contact us.

* fix topic label field alias bug * version bump

ellipsis-dev

👍 Looks good to me! Incremental review on 3b0715d in 1 minute and 9 seconds

More details

Looked at 27 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/data_inference.py:91

Draft comment:
The removal of validation_alias from topic_label_field in NomicTopicOptions is not explained in the PR description. Ensure this change does not affect other parts of the application where topic_label_field is used or expected to be validated in a specific way.
Reason this comment was not posted:
Confidence changes required: 66%
The PR description mentions that the topic_label_field in NomicTopicOptions class has been updated to remove the validation_alias parameter. This change is reflected in the code. However, the PR does not provide a rationale for this change. It's important to understand why this change was made, as it could affect how the field is used or validated elsewhere in the codebase. Since the PR description does not mention this, it's crucial to ensure that this change does not unintentionally break any functionality or expectations around the topic_label_field.

Workflow ID: wflow_STv01skh1QajWGhs

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on 9978a8a in 47 seconds

More details

Looked at 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. setup.py:11

Draft comment:
The version bump to 3.0.35 is correctly implemented.
Reason this comment was not posted:
Confidence changes required: 0%
The version bump in the setup.py file is correctly updated from 3.0.34 to 3.0.35 as per the PR description. This is a standard practice to increment the version number for new features or significant updates to ensure proper package management and dependency resolution.

Workflow ID: wflow_1CnRjrOO6Hq3GcV2

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

❌ Changes requested. Incremental review on 7501db3 in 1 minute and 36 seconds

More details

Looked at 18 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_hCRhMiUK2b3PDb5h

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

nomic/atlas.py

ellipsis-dev

👍 Looks good to me! Incremental review on 3e9e0f2 in 3 minutes and 6 seconds

More details

Looked at 17 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_Xlqsc6EaPTFt9mEG

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on 5a134db in 2 minutes and 33 seconds

More details

Looked at 24 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1374

Draft comment:
Consider adding more robust error handling in the send_request function within _add_blobs. Currently, it only checks for a status code of 200 and raises an exception otherwise. It would be beneficial to handle different types of HTTP errors and possibly implement a retry mechanism for recoverable errors.
Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_q3uzC1aU5cORAWmw

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on 016b3e1 in 1 minute and 54 seconds

More details

Looked at 25 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. examples/image/map_cifar10.py:26

Draft comment:
Consider using slicing on the dataset to limit the number of iterations instead of using a break statement inside the loop. This would improve code readability and efficiency.
Reason this comment was not posted:
Confidence changes required: 66%
The code snippet provided from examples/image/map_cifar10.py shows the use of the map_data function from the atlas module, which is being updated in this PR to handle image data. The loop iterates over the dataset, appending images and metadata to lists, which are then passed to map_data. The map_data function is called with the blobs parameter to handle image uploads, and a topic model is specified with a label field.

The code is straightforward and seems to align with the PR's intent to enhance image data handling. However, the loop control using if idx >= max_embeddings: break is a bit unusual. It's more common and clearer to use slicing on the dataset directly to limit the number of iterations, which would be more efficient and readable.

Workflow ID: wflow_8DSDdGMKtIQ9IQv7

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

👍 Looks good to me! Incremental review on e286831 in 1 minute and 6 seconds

More details

Looked at 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. nomic/dataset.py:1407

Draft comment:
Increasing the batch size from 20 to 40 in the _add_blobs method could potentially impact performance and error handling. Ensure that the server-side implementation can handle these larger batches efficiently and consider adjusting the number of workers or improving error handling mechanisms to maintain robustness.
Reason this comment was not posted:
Confidence changes required: 66%
The change in the PR increases the batch size from 20 to 40 in the _add_blobs method. This change could potentially impact the performance of the system, especially if the server or network cannot handle larger batch sizes efficiently. It's important to ensure that the server-side implementation can handle these larger batches without issues such as timeouts or memory constraints. Additionally, increasing the batch size without adjusting the number of workers or handling potential failures in batch processing might lead to less robust error handling and increased risk of data loss or corruption.

Workflow ID: wflow_s6SNIOCT3LGl6NgJ

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.

ellipsis-dev

❌ Changes requested. Incremental review on 2b00dc3 in 1 minute and 21 seconds

More details

Looked at 42 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_zK8fFgHYyBc6MZqX

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

nomic/dataset.py

zanussbaum added 2 commits June 13, 2024 17:44

feat: image upload

d019c39

feat: update cifar10 update

ac9a55c

ellipsis-dev bot reviewed Jun 16, 2024

View reviewed changes

zanussbaum added 2 commits June 17, 2024 10:26

style: black isort

8983981

fix: pyright errors

e707f94

zanussbaum marked this pull request as ready for review June 17, 2024 14:38

ellipsis-dev bot reviewed Jun 17, 2024

View reviewed changes

nomic/dataset.py Outdated Show resolved Hide resolved

zanussbaum added 2 commits June 17, 2024 10:51

fix: simplify

675642a

fix: black isort

53bd566

ellipsis-dev bot reviewed Jun 17, 2024

View reviewed changes

zanussbaum added 3 commits June 18, 2024 23:16

Merge branch 'main' into map_image

1e9b55a

style: isort

1af4d18

refactor: remove bytes upload

ca3746f

ellipsis-dev bot reviewed Jun 20, 2024

View reviewed changes

fix: id type casting to str when not str

fa031db

ellipsis-dev bot reviewed Jun 21, 2024

View reviewed changes

fix: communityh_description_target_field -> topic_label_field

2850c8a

ellipsis-dev bot reviewed Jun 21, 2024

View reviewed changes

fix: alias for pydantic 2x

ca1b60f

ellipsis-dev bot reviewed Jun 21, 2024

View reviewed changes

substituting id for self id_field and casting id_field to str

da0411f

ellipsis-dev bot reviewed Jun 24, 2024

View reviewed changes

Topic label field bug alias bug(#310)

3b0715d

* fix topic label field alias bug * version bump

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

chore: version bump

9978a8a

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

zanussbaum added 2 commits June 25, 2024 11:51

style: black isort

f205e5d

fix: pyright

7501db3

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

nomic/atlas.py Outdated Show resolved Hide resolved

zanussbaum added 2 commits June 25, 2024 11:58

fix: ellipsis suggestion

3e9e0f2

Merge branch 'main' into map_image

4977dd3

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

zanussbaum added 2 commits June 25, 2024 12:08

fix: pyright pydantic + pyarrow errors

d02344e

style: black isort

5a134db

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

fix: remove unused bytes

016b3e1

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

fix: increase batch size

e286831

ellipsis-dev bot reviewed Jun 25, 2024

View reviewed changes

zanussbaum added 5 commits July 2, 2024 15:21

Merge branch 'main' into map_image

4ad4af5

fix: allow only blobs to be passed with no metadata

af646e1

chore: better logging for when indexed field and embeddings present

d7a33a7

fix: add id field if not present for blobs

e2be038

fix: better logging for blob upload

2b00dc3

ellipsis-dev bot reviewed Jul 4, 2024

View reviewed changes

nomic/dataset.py Show resolved Hide resolved

fix: map cifar 10

c8db85e

AndriyMulyar approved these changes Jul 9, 2024

View reviewed changes

zanussbaum merged commit e4a1c73 into main Jul 9, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: map_data support for Nomic Embed Vision #308

feat: map_data support for Nomic Embed Vision #308

zanussbaum commented Jun 13, 2024 •

edited by ellipsis-dev bot

Loading

AndriyMulyar commented Jun 16, 2024 •

edited

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

feat: map_data support for Nomic Embed Vision #308

feat: map_data support for Nomic Embed Vision #308

Conversation

zanussbaum commented Jun 13, 2024 • edited by ellipsis-dev bot Loading

Summary:

AndriyMulyar commented Jun 16, 2024 • edited Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

zanussbaum commented Jun 13, 2024 •

edited by ellipsis-dev bot

Loading

AndriyMulyar commented Jun 16, 2024 •

edited

Loading