Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFilesNotFoundError for datasets in the open-llm-leaderboard #6866

Closed
jerome-white opened this issue May 4, 2024 · 3 comments
Closed
Assignees

Comments

@jerome-white
Copy link

Describe the bug

When trying to get config names or load any dataset within the open-llm-leaderboard ecosystem (open-llm-leaderboard/details_) I receive the DataFilesNotFoundError. For the last month or so I've been loading datasets from the leaderboard almost everyday; yesterday was the first time I started seeing this.

Steps to reproduce the bug

This snippet has three cells:

  1. Loads the modules
  2. Tries to get config names
  3. Tries to load the dataset

I've chosen "davidkim205"'s Rhea-72b-v0.5 model because it is one of the best performers on the leaderboard should likely have no dataset issues:

In [1]: from datasets import load_dataset, get_dataset_config_names

In [2]: get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea
   ...: -72b-v0.5")
---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[2], line 1
----> 1 get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5")

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/inspect.py:347, in get_dataset_config_names(path, revision, download_config, download_mode, dynamic_modules_path, data_files, **download_kwargs)
    291 def get_dataset_config_names(
    292     path: str,
    293     revision: Optional[Union[str, Version]] = None,
   (...)
    298     **download_kwargs,
    299 ):
    300     """Get the list of available config names for a particular dataset.
    301 
    302     Args:
   (...)
    345     ```
    346     """
--> 347     dataset_module = dataset_module_factory(
    348         path,
    349         revision=revision,
    350         download_config=download_config,
    351         download_mode=download_mode,
    352         dynamic_modules_path=dynamic_modules_path,
    353         data_files=data_files,
    354         **download_kwargs,
    355     )
    356     builder_cls = get_dataset_builder_class(dataset_module, dataset_name=os.path.basename(path))
    357     return list(builder_cls.builder_configs.keys()) or [
    358         dataset_module.builder_kwargs.get("config_name", builder_cls.DEFAULT_CONFIG_NAME or "default")
    359     ]

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1821, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   1812     return LocalDatasetModuleFactoryWithScript(
   1813         combined_path,
   1814         download_mode=download_mode,
   1815         dynamic_modules_path=dynamic_modules_path,
   1816         trust_remote_code=trust_remote_code,
   1817     ).get_module()
   1818 elif os.path.isdir(path):
   1819     return LocalDatasetModuleFactoryWithoutScript(
   1820         path, data_dir=data_dir, data_files=data_files, download_mode=download_mode
-> 1821     ).get_module()
   1822 # Try remotely
   1823 elif is_relative_path(path) and path.count("/") <= 1:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1039, in LocalDatasetModuleFactoryWithoutScript.get_module(self)
   1033     patterns = get_data_patterns(base_path)
   1034 data_files = DataFilesDict.from_patterns(
   1035     patterns,
   1036     base_path=base_path,
   1037     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   1038 )
-> 1039 module_name, default_builder_kwargs = infer_module_for_data_files(
   1040     data_files=data_files,
   1041     path=self.path,
   1042 )
   1043 data_files = data_files.filter_extensions(_MODULE_TO_EXTENSIONS[module_name])
   1044 # Collect metadata files if the module supports them

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:597, in infer_module_for_data_files(data_files, path, download_config)
    595     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    596 if not module_name:
--> 597     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    598 return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5

In [3]: data = load_dataset("open-llm-leaderboard/details_davidkim205__Rhea-72b-
   ...: v0.5", "harness_winogrande_5")
---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[3], line 1
----> 1 data = load_dataset("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5", "harness_winogrande_5")

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:2587, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2582 verification_mode = VerificationMode(
   2583     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   2584 )
   2586 # Create a dataset builder
-> 2587 builder_instance = load_dataset_builder(
   2588     path=path,
   2589     name=name,
   2590     data_dir=data_dir,
   2591     data_files=data_files,
   2592     cache_dir=cache_dir,
   2593     features=features,
   2594     download_config=download_config,
   2595     download_mode=download_mode,
   2596     revision=revision,
   2597     token=token,
   2598     storage_options=storage_options,
   2599     trust_remote_code=trust_remote_code,
   2600     _require_default_config_name=name is None,
   2601     **config_kwargs,
   2602 )
   2604 # Return iterable dataset in case of streaming
   2605 if streaming:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:2259, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   2257     download_config = download_config.copy() if download_config else DownloadConfig()
   2258     download_config.storage_options.update(storage_options)
-> 2259 dataset_module = dataset_module_factory(
   2260     path,
   2261     revision=revision,
   2262     download_config=download_config,
   2263     download_mode=download_mode,
   2264     data_dir=data_dir,
   2265     data_files=data_files,
   2266     cache_dir=cache_dir,
   2267     trust_remote_code=trust_remote_code,
   2268     _require_default_config_name=_require_default_config_name,
   2269     _require_custom_configs=bool(config_kwargs),
   2270 )
   2271 # Get dataset builder class from the processing script
   2272 builder_kwargs = dataset_module.builder_kwargs

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1821, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   1812     return LocalDatasetModuleFactoryWithScript(
   1813         combined_path,
   1814         download_mode=download_mode,
   1815         dynamic_modules_path=dynamic_modules_path,
   1816         trust_remote_code=trust_remote_code,
   1817     ).get_module()
   1818 elif os.path.isdir(path):
   1819     return LocalDatasetModuleFactoryWithoutScript(
   1820         path, data_dir=data_dir, data_files=data_files, download_mode=download_mode
-> 1821     ).get_module()
   1822 # Try remotely
   1823 elif is_relative_path(path) and path.count("/") <= 1:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1039, in LocalDatasetModuleFactoryWithoutScript.get_module(self)
   1033     patterns = get_data_patterns(base_path)
   1034 data_files = DataFilesDict.from_patterns(
   1035     patterns,
   1036     base_path=base_path,
   1037     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   1038 )
-> 1039 module_name, default_builder_kwargs = infer_module_for_data_files(
   1040     data_files=data_files,
   1041     path=self.path,
   1042 )
   1043 data_files = data_files.filter_extensions(_MODULE_TO_EXTENSIONS[module_name])
   1044 # Collect metadata files if the module supports them

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:597, in infer_module_for_data_files(data_files, path, download_config)
    595     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    596 if not module_name:
--> 597     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    598 return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5

Expected behavior

No exceptions from get_dataset_config_names or load_dataset

Environment info

  • datasets version: 2.19.0
  • Platform: Linux-6.5.0-1018-aws-aarch64-with-glibc2.35
  • Python version: 3.11.8
  • huggingface_hub version: 0.23.0
  • PyArrow version: 16.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1
@albertvillanova albertvillanova self-assigned this May 6, 2024
@albertvillanova
Copy link
Member

Hi @jerome-white, thnaks for reporting.

However, I cannot reproduce your issue:

>>> from datasets import get_dataset_config_names

>>> get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5")
['harness_arc_challenge_25',
 'harness_gsm8k_5',
 'harness_hellaswag_10',
 'harness_hendrycksTest_5',
 'harness_hendrycksTest_abstract_algebra_5',
 'harness_hendrycksTest_anatomy_5',
 'harness_hendrycksTest_astronomy_5',
 'harness_hendrycksTest_business_ethics_5',
 'harness_hendrycksTest_clinical_knowledge_5',
 'harness_hendrycksTest_college_biology_5',
 'harness_hendrycksTest_college_chemistry_5',
 'harness_hendrycksTest_college_computer_science_5',
 'harness_hendrycksTest_college_mathematics_5',
 'harness_hendrycksTest_college_medicine_5',
 'harness_hendrycksTest_college_physics_5',
 'harness_hendrycksTest_computer_security_5',
 'harness_hendrycksTest_conceptual_physics_5',
 'harness_hendrycksTest_econometrics_5',
 'harness_hendrycksTest_electrical_engineering_5',
 'harness_hendrycksTest_elementary_mathematics_5',
 'harness_hendrycksTest_formal_logic_5',
 'harness_hendrycksTest_global_facts_5',
 'harness_hendrycksTest_high_school_biology_5',
 'harness_hendrycksTest_high_school_chemistry_5',
 'harness_hendrycksTest_high_school_computer_science_5',
 'harness_hendrycksTest_high_school_european_history_5',
 'harness_hendrycksTest_high_school_geography_5',
 'harness_hendrycksTest_high_school_government_and_politics_5',
 'harness_hendrycksTest_high_school_macroeconomics_5',
 'harness_hendrycksTest_high_school_mathematics_5',
 'harness_hendrycksTest_high_school_microeconomics_5',
 'harness_hendrycksTest_high_school_physics_5',
 'harness_hendrycksTest_high_school_psychology_5',
 'harness_hendrycksTest_high_school_statistics_5',
 'harness_hendrycksTest_high_school_us_history_5',
 'harness_hendrycksTest_high_school_world_history_5',
 'harness_hendrycksTest_human_aging_5',
 'harness_hendrycksTest_human_sexuality_5',
 'harness_hendrycksTest_international_law_5',
 'harness_hendrycksTest_jurisprudence_5',
 'harness_hendrycksTest_logical_fallacies_5',
 'harness_hendrycksTest_machine_learning_5',
 'harness_hendrycksTest_management_5',
 'harness_hendrycksTest_marketing_5',
 'harness_hendrycksTest_medical_genetics_5',
 'harness_hendrycksTest_miscellaneous_5',
 'harness_hendrycksTest_moral_disputes_5',
 'harness_hendrycksTest_moral_scenarios_5',
 'harness_hendrycksTest_nutrition_5',
 'harness_hendrycksTest_philosophy_5',
 'harness_hendrycksTest_prehistory_5',
 'harness_hendrycksTest_professional_accounting_5',
 'harness_hendrycksTest_professional_law_5',
 'harness_hendrycksTest_professional_medicine_5',
 'harness_hendrycksTest_professional_psychology_5',
 'harness_hendrycksTest_public_relations_5',
 'harness_hendrycksTest_security_studies_5',
 'harness_hendrycksTest_sociology_5',
 'harness_hendrycksTest_us_foreign_policy_5',
 'harness_hendrycksTest_virology_5',
 'harness_hendrycksTest_world_religions_5',
 'harness_truthfulqa_mc_0',
 'harness_winogrande_5',
 'results']

Maybe it was just a temporary issue...

@jerome-white
Copy link
Author

Maybe it was just a temporary issue...

Perhaps. I've changed my workflow to use the hub's HfFileSystem, so for now this is no longer a blocker for me. I'll reopen the issue if that changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants