Update features.py to avoid bfloat16 unsupported error #6607

skaulintel · 2024-01-20T00:39:44Z

Let me know if there's any tests I need to clear.

lhoestq · 2024-03-01T16:11:29Z

I think not all torch tensors should be converted to float, what if it's a tensor of integers for example ?
Maybe you can check for the tensor dtype before converting

stoical07 · 2024-05-16T14:23:01Z

@lhoestq Please could this be merged? 🙏

lhoestq

Yes ! sorry for the delay :)

github-actions · 2024-05-17T09:46:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005552 / 0.011353 (-0.005801)	0.003707 / 0.011008 (-0.007301)	0.063794 / 0.038508 (0.025286)	0.031897 / 0.023109 (0.008788)	0.263086 / 0.275898 (-0.012812)	0.281184 / 0.323480 (-0.042296)	0.003183 / 0.007986 (-0.004802)	0.002681 / 0.004328 (-0.001648)	0.050259 / 0.004250 (0.046009)	0.048395 / 0.037052 (0.011342)	0.266925 / 0.258489 (0.008436)	0.298146 / 0.293841 (0.004305)	0.027995 / 0.128546 (-0.100551)	0.010689 / 0.075646 (-0.064957)	0.204956 / 0.419271 (-0.214316)	0.036453 / 0.043533 (-0.007080)	0.255406 / 0.255139 (0.000267)	0.271388 / 0.283200 (-0.011811)	0.019748 / 0.141683 (-0.121935)	1.103926 / 1.452155 (-0.348228)	1.167250 / 1.492716 (-0.325466)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.100483 / 0.018006 (0.082477)	0.307331 / 0.000490 (0.306841)	0.000216 / 0.000200 (0.000016)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018918 / 0.037411 (-0.018493)	0.062569 / 0.014526 (0.048044)	0.074935 / 0.176557 (-0.101621)	0.122590 / 0.737135 (-0.614545)	0.076475 / 0.296338 (-0.219864)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279001 / 0.215209 (0.063792)	2.771630 / 2.077655 (0.693975)	1.439666 / 1.504120 (-0.064454)	1.303422 / 1.541195 (-0.237773)	1.355670 / 1.468490 (-0.112820)	0.576264 / 4.584777 (-4.008513)	2.394868 / 3.745712 (-1.350844)	2.941487 / 5.269862 (-2.328375)	1.808733 / 4.565676 (-2.756943)	0.063691 / 0.424275 (-0.360584)	0.005399 / 0.007607 (-0.002208)	0.335610 / 0.226044 (0.109566)	3.295903 / 2.268929 (1.026974)	1.771836 / 55.444624 (-53.672788)	1.511246 / 6.876477 (-5.365231)	1.535926 / 2.142072 (-0.606147)	0.649020 / 4.805227 (-4.156207)	0.119754 / 6.500664 (-6.380910)	0.043319 / 0.075469 (-0.032150)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.967275 / 1.841788 (-0.874513)	12.358482 / 8.074308 (4.284174)	9.933324 / 10.191392 (-0.258068)	0.133565 / 0.680424 (-0.546859)	0.015650 / 0.534201 (-0.518551)	0.286978 / 0.579283 (-0.292305)	0.262912 / 0.434364 (-0.171451)	0.330335 / 0.540337 (-0.210002)	0.427671 / 1.386936 (-0.959265)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005660 / 0.011353 (-0.005693)	0.003908 / 0.011008 (-0.007101)	0.051874 / 0.038508 (0.013366)	0.033141 / 0.023109 (0.010032)	0.270512 / 0.275898 (-0.005386)	0.296790 / 0.323480 (-0.026690)	0.004335 / 0.007986 (-0.003651)	0.002842 / 0.004328 (-0.001487)	0.078264 / 0.004250 (0.074014)	0.044436 / 0.037052 (0.007384)	0.283230 / 0.258489 (0.024741)	0.318026 / 0.293841 (0.024185)	0.031459 / 0.128546 (-0.097087)	0.010710 / 0.075646 (-0.064937)	0.058152 / 0.419271 (-0.361119)	0.034021 / 0.043533 (-0.009512)	0.269956 / 0.255139 (0.014817)	0.288783 / 0.283200 (0.005583)	0.019246 / 0.141683 (-0.122436)	1.127264 / 1.452155 (-0.324891)	1.169777 / 1.492716 (-0.322939)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.101523 / 0.018006 (0.083516)	0.315120 / 0.000490 (0.314630)	0.000218 / 0.000200 (0.000018)	0.000053 / 0.000054 (-0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023078 / 0.037411 (-0.014333)	0.080021 / 0.014526 (0.065495)	0.089574 / 0.176557 (-0.086982)	0.131258 / 0.737135 (-0.605877)	0.090604 / 0.296338 (-0.205734)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.302197 / 0.215209 (0.086988)	2.980071 / 2.077655 (0.902416)	1.585480 / 1.504120 (0.081360)	1.462904 / 1.541195 (-0.078291)	1.501102 / 1.468490 (0.032612)	0.580342 / 4.584777 (-4.004435)	0.972118 / 3.745712 (-2.773594)	2.930530 / 5.269862 (-2.339331)	1.824132 / 4.565676 (-2.741545)	0.064711 / 0.424275 (-0.359564)	0.005084 / 0.007607 (-0.002523)	0.352693 / 0.226044 (0.126649)	3.522775 / 2.268929 (1.253847)	1.965063 / 55.444624 (-53.479561)	1.679250 / 6.876477 (-5.197226)	1.711691 / 2.142072 (-0.430382)	0.663719 / 4.805227 (-4.141509)	0.119858 / 6.500664 (-6.380806)	0.041744 / 0.075469 (-0.033725)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.017970 / 1.841788 (-0.823817)	12.898917 / 8.074308 (4.824609)	10.244728 / 10.191392 (0.053336)	0.133860 / 0.680424 (-0.546564)	0.016044 / 0.534201 (-0.518157)	0.287543 / 0.579283 (-0.291740)	0.126418 / 0.434364 (-0.307946)	0.394970 / 0.540337 (-0.145368)	0.420455 / 1.386936 (-0.966481)

Update features.py to avoid bfloat16 unsupported error

f213b7a

Update features.py

9a8e267

lhoestq approved these changes May 17, 2024

View reviewed changes

lhoestq merged commit b7d71ff into huggingface:main May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update features.py to avoid bfloat16 unsupported error #6607

Update features.py to avoid bfloat16 unsupported error #6607

skaulintel commented Jan 20, 2024

lhoestq commented Mar 1, 2024

stoical07 commented May 16, 2024

lhoestq left a comment

github-actions bot commented May 17, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Update features.py to avoid bfloat16 unsupported error #6607

Update features.py to avoid bfloat16 unsupported error #6607

Conversation

skaulintel commented Jan 20, 2024

lhoestq commented Mar 1, 2024

stoical07 commented May 16, 2024

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented May 17, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json