Cant Downlaod Common Voice 17.0 hy-AM #6848

mheryerznkanyan · 2024-04-29T10:06:02Z

Describe the bug

I want to download Common Voice 17.0 hy-AM but it returns an error.


The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_name='hfds_config', config_path=None)
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1429: FutureWarning: The repository for mozilla-foundation/common_voice_17_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_17_0
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Reading metadata...: 6180it [00:00, 133224.37it/s]les/s]
Generating train split: 0 examples [00:00, ? examples/s]
HuggingFace datasets failed due to some reason (stack trace below).
For certain datasets (eg: MCV), it may be necessary to login to the huggingface-cli (via `huggingface-cli login`).
Once logged in, you need to set `use_auth_token=True` when calling this script.

Traceback error for reference :

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'sentence_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/nemo/scripts/speech_recognition/convert_hf_dataset_to_nemo.py", line 358, in main
    dataset = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Steps to reproduce the bug

from datasets import load_dataset

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "hy-AM")

Expected behavior

It works fine with common_voice_16_1

Environment info

datasets version: 2.18.0
Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
Python version: 3.11.6
huggingface_hub version: 0.22.2
PyArrow version: 15.0.2
Pandas version: 2.2.2
fsspec version: 2024.2.0

The text was updated successfully, but these errors were encountered:

SalomonKisters · 2024-05-13T06:09:29Z

Same issue here.

jerome-white mentioned this issue May 4, 2024

DataFilesNotFoundError for datasets in the open-llm-leaderboard #6866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant Downlaod Common Voice 17.0 hy-AM #6848

Cant Downlaod Common Voice 17.0 hy-AM #6848

mheryerznkanyan commented Apr 29, 2024 •

edited

SalomonKisters commented May 13, 2024

Cant Downlaod Common Voice 17.0 hy-AM #6848

Cant Downlaod Common Voice 17.0 hy-AM #6848

Comments

mheryerznkanyan commented Apr 29, 2024 • edited

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

SalomonKisters commented May 13, 2024

mheryerznkanyan commented Apr 29, 2024 •

edited