Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] an error in get_dataset_size about args.shards #287

Open
ChrisZhangyu opened this issue Jan 11, 2024 · 0 comments
Open

[BUG] an error in get_dataset_size about args.shards #287

ChrisZhangyu opened this issue Jan 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ChrisZhangyu
Copy link

In get_dataset_size function we want to get the dir path of data file.

shards_list = list(braceexpand.braceexpand(shards))
dir_path = os.path.dirname(shards[0])

But the shards in args is like this:

 parser.add_argument(
        "--mmc4_shards",
        type=str,
        default=mmc4_data_path,
        help="path to c4 shards, this should be a glob pattern such as /path/to/shards/shard-{0000..0999}.tar",
    )

We can not get the path(e.g /path/to/shards) when call os.path.dirname(shards[0]) as shards is a string.
Should it be changed to shard_list[0], like this:

shards_list = list(braceexpand.braceexpand(shards))
dir_path = os.path.dirname(shards_list[0])
@ChrisZhangyu ChrisZhangyu added the bug Something isn't working label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant