Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock NumPy 2.0 #6991

Merged
merged 9 commits into from
Jul 12, 2024
Merged

Unblock NumPy 2.0 #6991

merged 9 commits into from
Jul 12, 2024

Conversation

NeilGirdhar
Copy link
Contributor

Fixes #6980

@NeilGirdhar NeilGirdhar mentioned this pull request Jun 22, 2024
2 tasks
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NeilGirdhar
Copy link
Contributor Author

@albertvillanova Any chance we could get this in before the next release? Everything depending on HuggingFace has their NumPy upgrade blocked.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI integration tests for Python 3.10 are failing.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the CI tests for Python 3.8 are OK because they use numpy 1.24.4: numpy 2.0.0 requires Python >= 3.10

@albertvillanova
Copy link
Member

The incompatible libraries are:

  • faiss-cpu 1.8.0.post1 requires numpy<2.0,>=1.0, but you have numpy 2.0.0 which is incompatible.
  • tensorflow 2.16.2 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.0 which is incompatible.
  • transformers 4.42.3 requires numpy<2.0,>=1.17, but you have numpy 2.0.0 which is incompatible.

@NeilGirdhar
Copy link
Contributor Author

Why is it installing numpy 2 if the dependencies don't support it?

@NeilGirdhar
Copy link
Contributor Author

For me, I'm getting:

❯ uv pip install --system "datasets[tests] @ ."
Found existing alias for "uv pip install". You should use: "pipi"
Resolved 119 packages in 934ms
   Built datasets @ file:///Users/neil/src/datasets
Prepared 1 package in 1.28s
Uninstalled 1 package in 10ms
Installed 2 packages in 17ms
 - datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + numpy==1.26.4

@albertvillanova
Copy link
Member

Which version on Python do you have?

@NeilGirdhar
Copy link
Contributor Author

3.12.4 I'll try on 3.10 now.

@albertvillanova
Copy link
Member

Please, note that I obtained the previous incompatible libraries in my local environment, by forcing the update of numpy.

@albertvillanova
Copy link
Member

albertvillanova commented Jul 11, 2024

In the Python 3.10 CI, the situation is different:

> uv pip install --system "datasets[tests] @ ."
...
 + faiss-cpu==1.8.0
...
 + numpy==2.0.0
...
 + tensorflow==2.14.0

See, CI installs:

  • faiss-cpu 1.8.0 instead of 1.8.0.post1
  • tensorflow 2.14.0 instead of 2.16.2
  • transformers 4.41.2 instead of 4.42.3

@albertvillanova
Copy link
Member

albertvillanova commented Jul 11, 2024

The main point is that we cannot support numpy 2.0 until tensorflow and faiss do.

Alternatively, we should ignore/select tests depending on the installed versions.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Jul 11, 2024

Alternatively, we should ignore/select tests depending on the installed versions.

That works.

Alternatively, you could depend on tensorflow >= 2.16.2 (etc.) for the tests?

@albertvillanova
Copy link
Member

Yes, I was thinking of a workaround solution.

The issue I see is that our CI will not test numpy 2.0 indeed.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Jul 11, 2024

The issue I see is that our CI will not test numpy 2.0 indeed.

Right, that's the advantage of the test skipping you wanted, I see your point.

Thing is, it won't be long before tensorflow supports numpy 2.0, and then the situation is resolved and your tests test numpy 2.0. Do you really want to invest a lot of effort into testing numpy 2.0 for a few months benefit?

@albertvillanova
Copy link
Member

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Jul 11, 2024

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

Yes, you're right. I understand you're point, but you could say this for anything that your test dependencies don't support.

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. You could write some Jax tests for example.

That said, blocking numpy 2 isn't a good solution in my opinion. These dependencies are extremely late in supporting Numpy 2. They were supposed to be testing against preview releases over three months ago. I don't think the world should have to wait for them.

@albertvillanova
Copy link
Member

albertvillanova commented Jul 12, 2024

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy.
That is my point. What we cannot do is just blindly support Numpy 2.0 without knowing its consequences. We need to test it:

  • to know if our core code works with it
  • to know what optional libraries are incompatible

For example, while testing locally, I have discovered that librosa is also incompatible with numpy-2.0, due to its dependency on soxr:

@albertvillanova
Copy link
Member

albertvillanova commented Jul 12, 2024

While testing locally, I have also discovered that pytorch does not support Numpy 2.0 on Windows platforms:

@albertvillanova
Copy link
Member

I am adding Numpy 2.0 tests to your PR if you don't mind, before merging this PR.

@NeilGirdhar
Copy link
Contributor Author

Awesome, thank you! Please let me know if I need to do anything.

@albertvillanova
Copy link
Member

Now we test numpy 2.0 in the test_py310_numpy2 CI tests: https://github.com/huggingface/datasets/actions/runs/9907254874/job/27370545495?pr=6991

 + numpy==2.0.0

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@albertvillanova albertvillanova merged commit dfc2b1b into huggingface:main Jul 12, 2024
14 checks passed
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005709 / 0.011353 (-0.005643) 0.003947 / 0.011008 (-0.007061) 0.064407 / 0.038508 (0.025899) 0.029903 / 0.023109 (0.006794) 0.244838 / 0.275898 (-0.031060) 0.268894 / 0.323480 (-0.054586) 0.003200 / 0.007986 (-0.004786) 0.002867 / 0.004328 (-0.001461) 0.050016 / 0.004250 (0.045765) 0.047682 / 0.037052 (0.010629) 0.252186 / 0.258489 (-0.006303) 0.292050 / 0.293841 (-0.001791) 0.030277 / 0.128546 (-0.098270) 0.012283 / 0.075646 (-0.063364) 0.205875 / 0.419271 (-0.213397) 0.037202 / 0.043533 (-0.006331) 0.246045 / 0.255139 (-0.009094) 0.272422 / 0.283200 (-0.010777) 0.020572 / 0.141683 (-0.121111) 1.114343 / 1.452155 (-0.337812) 1.169909 / 1.492716 (-0.322808)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.096612 / 0.018006 (0.078605) 0.303025 / 0.000490 (0.302535) 0.000210 / 0.000200 (0.000010) 0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.019292 / 0.037411 (-0.018119) 0.062548 / 0.014526 (0.048023) 0.076027 / 0.176557 (-0.100530) 0.121752 / 0.737135 (-0.615383) 0.076608 / 0.296338 (-0.219730)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.283900 / 0.215209 (0.068691) 2.829829 / 2.077655 (0.752174) 1.428934 / 1.504120 (-0.075186) 1.316796 / 1.541195 (-0.224399) 1.330012 / 1.468490 (-0.138478) 0.702245 / 4.584777 (-3.882532) 2.380454 / 3.745712 (-1.365259) 2.882881 / 5.269862 (-2.386980) 1.920345 / 4.565676 (-2.645332) 0.077860 / 0.424275 (-0.346415) 0.005295 / 0.007607 (-0.002312) 0.336968 / 0.226044 (0.110924) 3.327808 / 2.268929 (1.058879) 1.781958 / 55.444624 (-53.662666) 1.489412 / 6.876477 (-5.387065) 1.634829 / 2.142072 (-0.507243) 0.787985 / 4.805227 (-4.017243) 0.134397 / 6.500664 (-6.366267) 0.042906 / 0.075469 (-0.032563)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.967647 / 1.841788 (-0.874141) 11.714541 / 8.074308 (3.640233) 9.350228 / 10.191392 (-0.841164) 0.142675 / 0.680424 (-0.537749) 0.014609 / 0.534201 (-0.519592) 0.301970 / 0.579283 (-0.277314) 0.262350 / 0.434364 (-0.172014) 0.342933 / 0.540337 (-0.197404) 0.437321 / 1.386936 (-0.949615)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005622 / 0.011353 (-0.005731) 0.003958 / 0.011008 (-0.007050) 0.050667 / 0.038508 (0.012159) 0.032842 / 0.023109 (0.009733) 0.252292 / 0.275898 (-0.023606) 0.280602 / 0.323480 (-0.042878) 0.004313 / 0.007986 (-0.003673) 0.002870 / 0.004328 (-0.001458) 0.049549 / 0.004250 (0.045299) 0.040448 / 0.037052 (0.003396) 0.270264 / 0.258489 (0.011775) 0.302988 / 0.293841 (0.009147) 0.030840 / 0.128546 (-0.097707) 0.012131 / 0.075646 (-0.063515) 0.060061 / 0.419271 (-0.359211) 0.033025 / 0.043533 (-0.010507) 0.251909 / 0.255139 (-0.003230) 0.275511 / 0.283200 (-0.007689) 0.018399 / 0.141683 (-0.123284) 1.160744 / 1.452155 (-0.291411) 1.188265 / 1.492716 (-0.304452)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.097719 / 0.018006 (0.079712) 0.304389 / 0.000490 (0.303899) 0.000217 / 0.000200 (0.000017) 0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022964 / 0.037411 (-0.014447) 0.076897 / 0.014526 (0.062372) 0.088930 / 0.176557 (-0.087626) 0.128926 / 0.737135 (-0.608209) 0.091049 / 0.296338 (-0.205290)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.285670 / 0.215209 (0.070461) 2.806071 / 2.077655 (0.728416) 1.527161 / 1.504120 (0.023041) 1.410291 / 1.541195 (-0.130903) 1.427071 / 1.468490 (-0.041419) 0.705527 / 4.584777 (-3.879250) 0.926915 / 3.745712 (-2.818797) 2.893078 / 5.269862 (-2.376784) 1.907113 / 4.565676 (-2.658564) 0.077326 / 0.424275 (-0.346949) 0.005182 / 0.007607 (-0.002425) 0.332282 / 0.226044 (0.106237) 3.312889 / 2.268929 (1.043960) 1.853839 / 55.444624 (-53.590785) 1.592013 / 6.876477 (-5.284464) 1.620234 / 2.142072 (-0.521838) 0.776894 / 4.805227 (-4.028333) 0.132411 / 6.500664 (-6.368253) 0.041430 / 0.075469 (-0.034039)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.003468 / 1.841788 (-0.838320) 12.472251 / 8.074308 (4.397943) 10.603243 / 10.191392 (0.411851) 0.132561 / 0.680424 (-0.547863) 0.015790 / 0.534201 (-0.518411) 0.306724 / 0.579283 (-0.272559) 0.125812 / 0.434364 (-0.308552) 0.343782 / 0.540337 (-0.196555) 0.445915 / 1.386936 (-0.941021)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support NumPy 2.0
3 participants