Unblock NumPy 2.0 #6991

NeilGirdhar · 2024-06-22T09:19:53Z

HuggingFaceDocBuilderDev · 2024-06-26T13:49:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

NeilGirdhar · 2024-07-10T17:57:06Z

@albertvillanova Any chance we could get this in before the next release? Everything depending on HuggingFace has their NumPy upgrade blocked.

albertvillanova

The CI integration tests for Python 3.10 are failing.

albertvillanova

Note that the CI tests for Python 3.8 are OK because they use numpy 1.24.4: numpy 2.0.0 requires Python >= 3.10

albertvillanova · 2024-07-11T10:08:26Z

The incompatible libraries are:

faiss-cpu 1.8.0.post1 requires numpy<2.0,>=1.0, but you have numpy 2.0.0 which is incompatible.
tensorflow 2.16.2 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.0 which is incompatible.
transformers 4.42.3 requires numpy<2.0,>=1.17, but you have numpy 2.0.0 which is incompatible.

NeilGirdhar · 2024-07-11T10:21:11Z

Why is it installing numpy 2 if the dependencies don't support it?

NeilGirdhar · 2024-07-11T10:42:34Z

For me, I'm getting:

❯ uv pip install --system "datasets[tests] @ ."
Found existing alias for "uv pip install". You should use: "pipi"
Resolved 119 packages in 934ms
   Built datasets @ file:///Users/neil/src/datasets
Prepared 1 package in 1.28s
Uninstalled 1 package in 10ms
Installed 2 packages in 17ms
 - datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + numpy==1.26.4

albertvillanova · 2024-07-11T11:08:33Z

Which version on Python do you have?

NeilGirdhar · 2024-07-11T11:11:27Z

3.12.4 I'll try on 3.10 now.

albertvillanova · 2024-07-11T11:13:29Z

Please, note that I obtained the previous incompatible libraries in my local environment, by forcing the update of numpy.

albertvillanova · 2024-07-11T11:17:25Z

In the Python 3.10 CI, the situation is different:

for example, they install an older version of tensorflow (2.14.0), where probably the constraint on numpy was not yet implemented. See the details: https://github.com/huggingface/datasets/actions/runs/9879100332/job/27306903343?pr=6991

> uv pip install --system "datasets[tests] @ ."
...
 + faiss-cpu==1.8.0
...
 + numpy==2.0.0
...
 + tensorflow==2.14.0

See, CI installs:

faiss-cpu 1.8.0 instead of 1.8.0.post1
tensorflow 2.14.0 instead of 2.16.2
transformers 4.41.2 instead of 4.42.3

albertvillanova · 2024-07-11T11:19:52Z

~~The main point is that we cannot support numpy 2.0 until tensorflow and faiss do.~~

Alternatively, we should ignore/select tests depending on the installed versions.

NeilGirdhar · 2024-07-11T11:33:40Z

Alternatively, we should ignore/select tests depending on the installed versions.

That works.

Alternatively, you could depend on tensorflow >= 2.16.2 (etc.) for the tests?

albertvillanova · 2024-07-11T11:58:07Z

Yes, I was thinking of a workaround solution.

The issue I see is that our CI will not test numpy 2.0 indeed.

NeilGirdhar · 2024-07-11T11:59:43Z

The issue I see is that our CI will not test numpy 2.0 indeed.

Right, that's the advantage of the test skipping you wanted, I see your point.

Thing is, it won't be long before tensorflow supports numpy 2.0, and then the situation is resolved and your tests test numpy 2.0. Do you really want to invest a lot of effort into testing numpy 2.0 for a few months benefit?

albertvillanova · 2024-07-11T12:24:10Z

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

NeilGirdhar · 2024-07-11T12:37:55Z

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

Yes, you're right. I understand you're point, but you could say this for anything that your test dependencies don't support.

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. You could write some Jax tests for example.

That said, blocking numpy 2 isn't a good solution in my opinion. These dependencies are extremely late in supporting Numpy 2. They were supposed to be testing against preview releases over three months ago. I don't think the world should have to wait for them.

albertvillanova · 2024-07-12T05:10:55Z

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy.
That is my point. What we cannot do is just blindly support Numpy 2.0 without knowing its consequences. We need to test it:

to know if our core code works with it
to know what optional libraries are incompatible

For example, while testing locally, I have discovered that librosa is also incompatible with numpy-2.0, due to its dependency on soxr:

NumPy 2 support, Next release plan dofuuz/python-soxr#28

albertvillanova · 2024-07-12T06:32:44Z

While testing locally, I have also discovered that pytorch does not support Numpy 2.0 on Windows platforms:

Update PyTorch CI to numpy 2.0 pytorch/pytorch#128860

albertvillanova · 2024-07-12T10:06:17Z

I am adding Numpy 2.0 tests to your PR if you don't mind, before merging this PR.

NeilGirdhar · 2024-07-12T10:34:05Z

Awesome, thank you! Please let me know if I need to do anything.

albertvillanova · 2024-07-12T11:56:01Z

Now we test numpy 2.0 in the test_py310_numpy2 CI tests: https://github.com/huggingface/datasets/actions/runs/9907254874/job/27370545495?pr=6991

 + numpy==2.0.0

albertvillanova

Thank you.

github-actions · 2024-07-12T12:11:17Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005709 / 0.011353 (-0.005643)	0.003947 / 0.011008 (-0.007061)	0.064407 / 0.038508 (0.025899)	0.029903 / 0.023109 (0.006794)	0.244838 / 0.275898 (-0.031060)	0.268894 / 0.323480 (-0.054586)	0.003200 / 0.007986 (-0.004786)	0.002867 / 0.004328 (-0.001461)	0.050016 / 0.004250 (0.045765)	0.047682 / 0.037052 (0.010629)	0.252186 / 0.258489 (-0.006303)	0.292050 / 0.293841 (-0.001791)	0.030277 / 0.128546 (-0.098270)	0.012283 / 0.075646 (-0.063364)	0.205875 / 0.419271 (-0.213397)	0.037202 / 0.043533 (-0.006331)	0.246045 / 0.255139 (-0.009094)	0.272422 / 0.283200 (-0.010777)	0.020572 / 0.141683 (-0.121111)	1.114343 / 1.452155 (-0.337812)	1.169909 / 1.492716 (-0.322808)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096612 / 0.018006 (0.078605)	0.303025 / 0.000490 (0.302535)	0.000210 / 0.000200 (0.000010)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019292 / 0.037411 (-0.018119)	0.062548 / 0.014526 (0.048023)	0.076027 / 0.176557 (-0.100530)	0.121752 / 0.737135 (-0.615383)	0.076608 / 0.296338 (-0.219730)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283900 / 0.215209 (0.068691)	2.829829 / 2.077655 (0.752174)	1.428934 / 1.504120 (-0.075186)	1.316796 / 1.541195 (-0.224399)	1.330012 / 1.468490 (-0.138478)	0.702245 / 4.584777 (-3.882532)	2.380454 / 3.745712 (-1.365259)	2.882881 / 5.269862 (-2.386980)	1.920345 / 4.565676 (-2.645332)	0.077860 / 0.424275 (-0.346415)	0.005295 / 0.007607 (-0.002312)	0.336968 / 0.226044 (0.110924)	3.327808 / 2.268929 (1.058879)	1.781958 / 55.444624 (-53.662666)	1.489412 / 6.876477 (-5.387065)	1.634829 / 2.142072 (-0.507243)	0.787985 / 4.805227 (-4.017243)	0.134397 / 6.500664 (-6.366267)	0.042906 / 0.075469 (-0.032563)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.967647 / 1.841788 (-0.874141)	11.714541 / 8.074308 (3.640233)	9.350228 / 10.191392 (-0.841164)	0.142675 / 0.680424 (-0.537749)	0.014609 / 0.534201 (-0.519592)	0.301970 / 0.579283 (-0.277314)	0.262350 / 0.434364 (-0.172014)	0.342933 / 0.540337 (-0.197404)	0.437321 / 1.386936 (-0.949615)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005622 / 0.011353 (-0.005731)	0.003958 / 0.011008 (-0.007050)	0.050667 / 0.038508 (0.012159)	0.032842 / 0.023109 (0.009733)	0.252292 / 0.275898 (-0.023606)	0.280602 / 0.323480 (-0.042878)	0.004313 / 0.007986 (-0.003673)	0.002870 / 0.004328 (-0.001458)	0.049549 / 0.004250 (0.045299)	0.040448 / 0.037052 (0.003396)	0.270264 / 0.258489 (0.011775)	0.302988 / 0.293841 (0.009147)	0.030840 / 0.128546 (-0.097707)	0.012131 / 0.075646 (-0.063515)	0.060061 / 0.419271 (-0.359211)	0.033025 / 0.043533 (-0.010507)	0.251909 / 0.255139 (-0.003230)	0.275511 / 0.283200 (-0.007689)	0.018399 / 0.141683 (-0.123284)	1.160744 / 1.452155 (-0.291411)	1.188265 / 1.492716 (-0.304452)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097719 / 0.018006 (0.079712)	0.304389 / 0.000490 (0.303899)	0.000217 / 0.000200 (0.000017)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022964 / 0.037411 (-0.014447)	0.076897 / 0.014526 (0.062372)	0.088930 / 0.176557 (-0.087626)	0.128926 / 0.737135 (-0.608209)	0.091049 / 0.296338 (-0.205290)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.285670 / 0.215209 (0.070461)	2.806071 / 2.077655 (0.728416)	1.527161 / 1.504120 (0.023041)	1.410291 / 1.541195 (-0.130903)	1.427071 / 1.468490 (-0.041419)	0.705527 / 4.584777 (-3.879250)	0.926915 / 3.745712 (-2.818797)	2.893078 / 5.269862 (-2.376784)	1.907113 / 4.565676 (-2.658564)	0.077326 / 0.424275 (-0.346949)	0.005182 / 0.007607 (-0.002425)	0.332282 / 0.226044 (0.106237)	3.312889 / 2.268929 (1.043960)	1.853839 / 55.444624 (-53.590785)	1.592013 / 6.876477 (-5.284464)	1.620234 / 2.142072 (-0.521838)	0.776894 / 4.805227 (-4.028333)	0.132411 / 6.500664 (-6.368253)	0.041430 / 0.075469 (-0.034039)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.003468 / 1.841788 (-0.838320)	12.472251 / 8.074308 (4.397943)	10.603243 / 10.191392 (0.411851)	0.132561 / 0.680424 (-0.547863)	0.015790 / 0.534201 (-0.518411)	0.306724 / 0.579283 (-0.272559)	0.125812 / 0.434364 (-0.308552)	0.343782 / 0.540337 (-0.196555)	0.445915 / 1.386936 (-0.941021)

NeilGirdhar mentioned this pull request Jun 22, 2024

Support NumPy 2.0 #6980

Closed

2 tasks

Unblock NumPy 2.0

50c756a

NeilGirdhar force-pushed the np2 branch from cd2663d to 50c756a Compare July 10, 2024 17:55

albertvillanova reviewed Jul 11, 2024

View reviewed changes

albertvillanova requested changes Jul 11, 2024

View reviewed changes

.

f20a805

NeilGirdhar force-pushed the np2 branch from 5ac9f27 to f20a805 Compare July 11, 2024 12:02

NeilGirdhar and others added 4 commits July 12, 2024 06:34

Merge branch 'main' into np2

7ac7365

Revert tensorflow min version

a0ae6d7

Add CI tests for numpy2

12e624e

Implement test require_numpy1_on_windows

ccb164a

albertvillanova added 3 commits July 12, 2024 13:38

Mark tests with require_numpy1_on_windows

0b63354

Fix test skip reason

6fb7b1a

Add clarifying comment

14306d7

albertvillanova approved these changes Jul 12, 2024

View reviewed changes

albertvillanova merged commit dfc2b1b into huggingface:main Jul 12, 2024
14 checks passed

albertvillanova mentioned this pull request Jul 12, 2024

Fix tensorflow min version depending on Python version #7045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unblock NumPy 2.0 #6991

Unblock NumPy 2.0 #6991

NeilGirdhar commented Jun 22, 2024

HuggingFaceDocBuilderDev commented Jun 26, 2024

NeilGirdhar commented Jul 10, 2024

albertvillanova left a comment •

edited

Loading

albertvillanova left a comment

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

albertvillanova commented Jul 11, 2024

albertvillanova commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 11, 2024 •

edited

Loading

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 12, 2024 •

edited

Loading

albertvillanova commented Jul 12, 2024 •

edited

Loading

albertvillanova commented Jul 12, 2024

NeilGirdhar commented Jul 12, 2024

albertvillanova commented Jul 12, 2024

albertvillanova left a comment

github-actions bot commented Jul 12, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Unblock NumPy 2.0 #6991

Unblock NumPy 2.0 #6991

Conversation

NeilGirdhar commented Jun 22, 2024

HuggingFaceDocBuilderDev commented Jun 26, 2024

NeilGirdhar commented Jul 10, 2024

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024

albertvillanova commented Jul 11, 2024

albertvillanova commented Jul 11, 2024 • edited Loading

albertvillanova commented Jul 11, 2024 • edited Loading

NeilGirdhar commented Jul 11, 2024 • edited Loading

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024 • edited Loading

albertvillanova commented Jul 11, 2024

NeilGirdhar commented Jul 11, 2024 • edited Loading

albertvillanova commented Jul 12, 2024 • edited Loading

albertvillanova commented Jul 12, 2024 • edited Loading

albertvillanova commented Jul 12, 2024

NeilGirdhar commented Jul 12, 2024

albertvillanova commented Jul 12, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 12, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment •

edited

Loading

albertvillanova commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 11, 2024 •

edited

Loading

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

NeilGirdhar commented Jul 11, 2024 •

edited

Loading

albertvillanova commented Jul 12, 2024 •

edited

Loading

albertvillanova commented Jul 12, 2024 •

edited

Loading