Unexpected keyword argument 'hf' when downloading CSV dataset from S3 #6598

dguenms · 2024-01-16T15:16:01Z

Describe the bug

I receive this error message when using load_dataset with "csv" path and dataset_files=s3://...:

TypeError: Session.__init__() got an unexpected keyword argument 'hf'

I found a similar issue here: https://stackoverflow.com/questions/77596258/aws-issue-load-dataset-from-s3-fails-with-unexpected-keyword-argument-error-in

Full stacktrace:

.../site-packages/datasets/load.py:2549: in load_dataset
    builder_instance.download_and_prepare(
.../site-packages/datasets/builder.py:1005: in download_and_prepare
    self._download_and_prepare(
.../site-packages/datasets/builder.py:1078: in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
.../site-packages/datasets/packaged_modules/csv/csv.py:147: in _split_generators
    data_files = dl_manager.download_and_extract(self.config.data_files)
.../site-packages/datasets/download/download_manager.py:562: in download_and_extract
    return self.extract(self.download(url_or_urls))
.../site-packages/datasets/download/download_manager.py:426: in download
    downloaded_path_or_paths = map_nested(
.../site-packages/datasets/utils/py_utils.py:466: in map_nested
    mapped = [
.../site-packages/datasets/utils/py_utils.py:467: in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
.../site-packages/datasets/utils/py_utils.py:387: in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
.../site-packages/datasets/utils/py_utils.py:387: in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
.../site-packages/datasets/utils/py_utils.py:370: in _single_map_nested
    return function(data_struct)
.../site-packages/datasets/download/download_manager.py:451: in _download
    out = cached_path(url_or_filename, download_config=download_config)
.../site-packages/datasets/utils/file_utils.py:188: in cached_path
    output_path = get_from_cache(
...1/site-packages/datasets/utils/file_utils.py:511: in get_from_cache
    response = fsspec_head(url, storage_options=storage_options)
.../site-packages/datasets/utils/file_utils.py:316: in fsspec_head
    fs, _, paths = fsspec.get_fs_token_paths(url, storage_options=storage_options)
.../site-packages/fsspec/core.py:622: in get_fs_token_paths
    fs = filesystem(protocol, **inkwargs)
.../site-packages/fsspec/registry.py:290: in filesystem
    return cls(**storage_options)
.../site-packages/fsspec/spec.py:79: in __call__
    obj = super().__call__(*args, **kwargs)
.../site-packages/s3fs/core.py:187: in __init__
    self.s3 = self.connect()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <s3fs.core.S3FileSystem object at 0x1500a1310>, refresh = True

    def connect(self, refresh=True):
        """
        Establish S3 connection object.
    
        Parameters
        ----------
        refresh : bool
            Whether to create new session/client, even if a previous one with
            the same parameters already exists. If False (default), an
            existing one will be used if possible
        """
        if refresh is False:
            # back compat: we store whole FS instance now
            return self.s3
        anon, key, secret, kwargs, ckwargs, token, ssl = (
            self.anon, self.key, self.secret, self.kwargs,
            self.client_kwargs, self.token, self.use_ssl)
    
        if not self.passed_in_session:
>           self.session = botocore.session.Session(**self.kwargs)
E           TypeError: Session.__init__() got an unexpected keyword argument 'hf'

Steps to reproduce the bug

Assuming a valid CSV file located at s3://bucket/data.csv
Run the below code:

storage_options = {
    "key": "...",
    "secret": "...",
    "client_kwargs": {
        "endpoint_url": "...",
    }
}
load_dataset("csv", data_files="s3://bucket/data.csv", storage_options=storage_options)

Encountered in version 2.16.1 but also reproduced in 2.16.0 and 2.15.0.

Note: I encountered this in a unit test using a moto mock for S3, however since the error occurs before the session is instantiated, it should not be the issue.

Expected behavior

No exception is raised, the boto3 session is created successfully, and the CSV file is downloaded successfully and returned as a dataset.

===

After some research I found that DownloadConfig has a __post_init__ method that always forces this value to be set in its storage_options, even though in case of an S3 location the storage options get passed on to the S3 Session which does not expect this parameter. I assume this parameter is needed when reading from the huggingface hub and should not be set in this context.

Unfortunately there is nothing the user can do to work around it. Even if you manually do something like:

download_config = DownloadConfig()
del download_config.storage_options["hf"]
load_dataset("csv", data_files="s3://bucket/data.csv", download_config=download_config)

the library will still reinsert this parameter when download_config = self.download_config.copy() in line 418 of download_manager.py (DownloadManager.download).

Therefore load_dataset currently cannot be used to read a dataset in CSV format from an S3 location.

Environment info

datasets version: 2.16.1
Platform: macOS-14.2.1-arm64-arm-64bit
Python version: 3.11.7
huggingface_hub version: 0.20.2
PyArrow version: 14.0.2
Pandas version: 2.1.4
fsspec version: 2023.10.0

The text was updated successfully, but these errors were encountered:

deepbot86 · 2024-01-18T00:44:56Z

I am facing similar issue while reading a csv file from s3. Wondering if somebody has found a workaround.

ChenchaoZhao · 2024-01-26T22:53:33Z

same thing happened to other formats like parquet

SinaTavakoli · 2024-02-04T10:55:26Z

I am facing similar issue while reading a parquet file from s3.
i try with every version between 2.14 to 2.16.1 but it dosen't work

pandaczm · 2024-02-05T05:42:58Z

Re-define the DownloadConfig might work:

class ReviseDownloadConfig(DownloadConfig):
    def __post_init__(self, use_auth_token):
        if use_auth_token != "deprecated":
            warnings.warn(
                "'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.\n"
                f"You can remove this warning by passing 'token={use_auth_token}' instead.",
                FutureWarning,
            )
            self.token = use_auth_token

    def copy(self):
        return self.__class__(**{k: copy.deepcopy(v) for k, v in self.__dict__.items()})

downloadconfig = ReviseDownloadConfig()

charlescearl · 2024-03-09T02:32:00Z

Re-define the DownloadConfig might work:

class ReviseDownloadConfig(DownloadConfig):
    def __post_init__(self, use_auth_token):
        if use_auth_token != "deprecated":
            warnings.warn(
                "'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.\n"
                f"You can remove this warning by passing 'token={use_auth_token}' instead.",
                FutureWarning,
            )
            self.token = use_auth_token

This seemed to work for me.

ChenchaoZhao · 2024-03-22T02:08:05Z

use pandas and then convert to Dataset

psorianom · 2024-04-24T13:41:34Z

I am currently facing the same issue while using a custom loading script with files located in a remote S3 instance. I was using the download_custom functionality but now it is deprecated mentioning that I should use the native S3 loading, which is not working.

As stated before, the library forces the existence of a hf key in the storage_options variable, which is not accepted by s3fs :

.../site-packages/s3fs/core.py", line 516, in set_session
    self.session = aiobotocore.session.AioSession(**self.kwargs)
TypeError: __init__() got an unexpected keyword argument 'hf'.

Meanwhile, if my storage_options var stays like:

{'key': '...',
 'secret': '...',
 'client_kwargs': {'endpoint_url': '...'}}

it works alright.

matstrand linked a pull request May 3, 2024 that will close this issue

Issue 6598: load_dataset broken for data_files on s3 #6862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected keyword argument 'hf' when downloading CSV dataset from S3 #6598

Unexpected keyword argument 'hf' when downloading CSV dataset from S3 #6598

dguenms commented Jan 16, 2024

deepbot86 commented Jan 18, 2024

ChenchaoZhao commented Jan 26, 2024

SinaTavakoli commented Feb 4, 2024

pandaczm commented Feb 5, 2024 •

edited

charlescearl commented Mar 9, 2024

ChenchaoZhao commented Mar 22, 2024

psorianom commented Apr 24, 2024 •

edited

Unexpected keyword argument 'hf' when downloading CSV dataset from S3 #6598

Unexpected keyword argument 'hf' when downloading CSV dataset from S3 #6598

Comments

dguenms commented Jan 16, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

deepbot86 commented Jan 18, 2024

ChenchaoZhao commented Jan 26, 2024

SinaTavakoli commented Feb 4, 2024

pandaczm commented Feb 5, 2024 • edited

charlescearl commented Mar 9, 2024

ChenchaoZhao commented Mar 22, 2024

psorianom commented Apr 24, 2024 • edited

pandaczm commented Feb 5, 2024 •

edited

psorianom commented Apr 24, 2024 •

edited