Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] TypeError: Object of type int64 is not JSON serializable when converting pandas to arrow table #41625

Closed
djouallah opened this issue May 12, 2024 · 4 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@djouallah
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I am getting an error when converting from Pandas to arrow, I added a reproducible example here

https://colab.research.google.com/drive/1uPOv8qyj5xW4XfrnkLtZtYIaaQfhupjG#scrollTo=7USg9dd-1ivc

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title TypeError: Object of type int64 is not JSON serializable when converting pandas to arrow table [Python] TypeError: Object of type int64 is not JSON serializable when converting pandas to arrow table May 14, 2024
@AlenkaF
Copy link
Member

AlenkaF commented May 20, 2024

Hi, thank you for opening an issue @djouallah!

I have been able to reproduce on my dev environment. For next time, it will be much easier to help if you present a simple reproducible example. The google colab you have linked has lots (lots!) of code not connected to the issue and I was very reluctant at first to download files and manipulate them but did so after taking time and checking the source and all the code.
Also, the possibility to actually get an answer on your issue will be higher with a simple example ;)

Here is a on I created that shows the issue:

>>> import pyarrow as pa

>>> data = {'UNIT': ["DUNIT", "DUNIT", "DUNIT", "DUNIT"],
...         'version'   : [1, 1, 3, 3]}
>>> df = pd.DataFrame(data)
>>> df.index = df['version']
>>> df.columns.name = np.int64(142564)     ------> The issue is here, numpy int64 column index name
>>> df
142564    UNIT  version
version                
1        DUNIT        1
1        DUNIT        1
3        DUNIT        3
3        DUNIT        3

>>> pa.Table.from_pandas(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 4559, in pyarrow.lib.Table.from_pandas
    arrays, schema, n_rows = dataframe_to_arrays(
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 635, in dataframe_to_arrays
    pandas_metadata = construct_metadata(
                      ^^^^^^^^^^^^^^^^^^^
  File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 257, in construct_metadata
    b'pandas': json.dumps({
               ^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

The code worked if I remove the column name

>>> df.columns.name = None
>>> pa.Table.from_pandas(df)
pyarrow.Table
UNIT: string
version: int64
__index_level_0__: int64
----
UNIT: [["DUNIT","DUNIT","DUNIT","DUNIT"]]
version: [[1,1,3,3]]
__index_level_0__: [[1,1,3,3]]

It would have also worked if python int type would have been used instead of numpy.int64.

@AlenkaF
Copy link
Member

AlenkaF commented May 20, 2024

The conclusion is that the issue can be fixed by renaming the column index name or removing it completely.

I do not think checking for numpy types in the column index names and converting them to a python type is something that would fit here. As I do not think this is a bug I will change the label type to usage.

cc @jorisvandenbossche

@AlenkaF AlenkaF added Type: usage Issue is a user question and removed Type: bug labels May 20, 2024
@djouallah
Copy link
Author

@AlenkaF thanks a lot

@AlenkaF
Copy link
Member

AlenkaF commented May 20, 2024

Closing. Feel free to reopen in case there is any further questions!

@AlenkaF AlenkaF closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants