Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant Downlaod Common Voice 17.0 hy-AM #6848

Open
mheryerznkanyan opened this issue Apr 29, 2024 · 1 comment
Open

Cant Downlaod Common Voice 17.0 hy-AM #6848

mheryerznkanyan opened this issue Apr 29, 2024 · 1 comment

Comments

@mheryerznkanyan
Copy link

mheryerznkanyan commented Apr 29, 2024

Describe the bug

I want to download Common Voice 17.0 hy-AM but it returns an error.


The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_name='hfds_config', config_path=None)
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1429: FutureWarning: The repository for mozilla-foundation/common_voice_17_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_17_0
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Reading metadata...: 6180it [00:00, 133224.37it/s]les/s]
Generating train split: 0 examples [00:00, ? examples/s]
HuggingFace datasets failed due to some reason (stack trace below).
For certain datasets (eg: MCV), it may be necessary to login to the huggingface-cli (via `huggingface-cli login`).
Once logged in, you need to set `use_auth_token=True` when calling this script.

Traceback error for reference :

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'sentence_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/nemo/scripts/speech_recognition/convert_hf_dataset_to_nemo.py", line 358, in main
    dataset = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Steps to reproduce the bug

from datasets import load_dataset

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "hy-AM")

Expected behavior

It works fine with common_voice_16_1

Environment info

  • datasets version: 2.18.0
  • Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
  • Python version: 3.11.6
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.2
  • fsspec version: 2024.2.0
@SalomonKisters
Copy link

Same issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants