Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the error loading large parquet file #300

Open
peiliu0408 opened this issue Nov 10, 2023 · 13 comments
Open

the error loading large parquet file #300

peiliu0408 opened this issue Nov 10, 2023 · 13 comments

Comments

@peiliu0408
Copy link

peiliu0408 commented Nov 10, 2023

as mention on here, failed to load the LA_RLHF.parquet (about 22GB) file, which is downloaded from shared Onedrive.

Error msg: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2368257792.

is there special way or python package required to load this large (image base64) parquet file?

@Luodian
Copy link
Owner

Luodian commented Nov 10, 2023

ohh I am not sure why this happens, my pandas version is 2.1.2. I am using pandas.read_parquet to open parquet files.

@peiliu0408
Copy link
Author

The same error msg showing again. I have tried on ubuntu20.04, and get the same error.

截屏2023-11-10 16 27 31 I am wondering there are some errors happens when downloading.

@peiliu0408
Copy link
Author

peiliu0408 commented Nov 10, 2023

Or, can you give a llava-RLHF dataset download link, which is census with your Onedrive file? I just create a new image parquet using the same format.

@Luodian
Copy link
Owner

Luodian commented Nov 10, 2023

Can you try with LLAVAR or LRV? I uploaded them as well.

@Luodian
Copy link
Owner

Luodian commented Nov 10, 2023

LLAVA-RLHF seems correct at my side. I will update it but the previous version should be correct. It's weird.

@peiliu0408
Copy link
Author

thanks a lot.

@peiliu0408
Copy link
Author

Can you try with LLAVAR or LRV? I uploaded them as well.

this two parquet file could be opened correctly.

@peiliu0408
Copy link
Author

LLAVA-RLHF seems correct at my side. I will update it but the previous version should be correct. It's weird.

I am sure that LLaVA-RLHF shared in Onedrive checkpoint is damaged, while all rest could be loaded correctly.

@Luodian
Copy link
Owner

Luodian commented Nov 13, 2023

I am putting an updated LA_RLHF.parquet file from our server (this is supposedly correctly working for our runs) to OneDrive. May take a few hours, stay tuned tmr maybe. Thanks!

@peiliu0408
Copy link
Author

I am putting an updated LA_RLHF.parquet file from our server (this is supposedly correctly working for our runs) to OneDrive. May take a few hours, stay tuned tmr maybe. Thanks!

thanks a lot.

@311dada
Copy link

311dada commented Nov 24, 2023

The same error msg showing again. I have tried on ubuntu20.04, and get the same error.

截屏2023-11-10 16 27 31 I am wondering there are some errors happens when downloading.

Still not correct

@tensorboy
Copy link

you need use dask:


import dask.dataframe as dd
import json
import pandas as pd

# Load the JSON data
json_file_path = "LA.json"
with open(json_file_path, "r") as f:
    data_dict = json.load(f)

# Convert the dictionary to a Dask DataFrame
ddf = dd.from_pandas(pd.DataFrame.from_dict(data_dict, orient="index", columns=["base64"]), npartitions=10)

# Convert to Parquet
parquet_file_path = 'LA.parquet'
ddf.to_parquet(parquet_file_path, engine="pyarrow")


ddf = dd.read_parquet(parquet_file_path, engine="pyarrow")
search_value = 'LA_IMG_000000377944'
filtered_ddf = ddf.loc[search_value].compute()

which solved this problem.

@Luodian
Copy link
Owner

Luodian commented Dec 10, 2023

as mention on here, failed to load the LA_RLHF.parquet (about 22GB) file, which is downloaded from shared Onedrive.

Error msg: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2368257792.

is there special way or python package required to load this large (image base64) parquet file?

You can see current code to see if it helps, we changed it to iteratively loading.

parquet_file = pq.ParquetFile(cur_images_path)
dfs = [] # List to hold the DataFrames of each batch
for batch in parquet_file.iter_batches(batch_size=1000): # Adjust batch_size as needed
batch_df = batch.to_pandas()
dfs.append(batch_df)
cur_df = pd.concat(dfs, ignore_index=True) # Concatenate all DataFrames
self.images.append(cur_df)
loaded_images_path.add(cur_images_path)

Previously on both of my 2A100 and 8A100 instances, I could directly load >100GB parquet. But it's weird that I cant do it on another 8*A100-40G instance...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants