Improve performance of JSON loader #6867

albertvillanova · 2024-05-04T15:04:16Z

As reported by @natolambert, loading regular JSON files with datasets shows poor performance.

The cause is that we use the json Python standard library instead of other faster libraries. See my old comment: #2638 (review)

There are benchmarks that compare different JSON packages, with the Standard Library one among the worst performant:

https://github.com/ultrajson/ultrajson#benchmarks

https://github.com/ijl/orjson#performance

I remember having a discussion about this and it was decided that it was better not to include an additional dependency on a 3rd-party library.

However:

We already depend on pandas and pandas depends on ujson: so we have an indirect dependency on ujson
Even if the above were not the case, we always could include ujson as an optional extra dependency, and check at runtime if it is installed to decide which library to use, either json or ujson

The text was updated successfully, but these errors were encountered:

natolambert · 2024-05-05T19:42:37Z

Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.

albertvillanova · 2024-05-10T07:57:52Z

Hi @natolambert, could you please give some examples of JSON files to benchmark?

Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:

{
  "chat_template": "tulu",
  "id": [30, 34, 35,...],
  "model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
  "model_type": "Seq. Classifier",
  "results": [1, 1, 1, ...],
  "scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
  "scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
  "subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
  "text_chosen": ["<s>[INST] How do I detail a...",...],
  "text_rejected": ["<s>[INST] How do I detail a...",...]
}

Note that "records" orient should be a list (not a dict) with each row as one item of the list:

[
  {"chat_template": "tulu", "id": 30,... },
  {"chat_template": "tulu", "id": 34,... },
  ...
]

natolambert · 2024-05-13T22:28:33Z

We use a mix (which is a mess), here's an example with the records orient
https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json

There are more in that folder, ~40mb maybe?

natolambert · 2024-05-13T22:29:00Z

@albertvillanova here's a snippet so you don't need to click

{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        0
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.076171875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        1
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.87890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        2
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.287109375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        3
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 1.6337890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        4
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 5.27734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        5
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.0625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        6
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.29296875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        7
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 6.77734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        8
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.853515625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        9
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.86328125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        10
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        11
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.70703125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        12
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.45703125
}

albertvillanova · 2024-05-14T07:10:38Z

Thanks again for your feedback, @natolambert.

However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.

Anyway, for JSON-Lines, I would expect that datasets and pandas have the same performance for JSON Lines files, as both use pyarrow under the hood...

A proper JSON file in records orient should be a list (a JSON array): the first character should be [.

Anyway, I am generating a JSON file from your JSON-Lines file to test performance.

albertvillanova added the enhancement New feature or request label May 4, 2024

albertvillanova self-assigned this May 4, 2024

albertvillanova mentioned this issue May 6, 2024

Use pandas ujson in JSON loader to improve performance #6874

Merged

albertvillanova closed this as completed in #6874 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of JSON loader #6867

Improve performance of JSON loader #6867

albertvillanova commented May 4, 2024

natolambert commented May 5, 2024

albertvillanova commented May 10, 2024 •

edited

natolambert commented May 13, 2024 •

edited

natolambert commented May 13, 2024

albertvillanova commented May 14, 2024 •

edited

Improve performance of JSON loader #6867

Improve performance of JSON loader #6867

Comments

albertvillanova commented May 4, 2024

natolambert commented May 5, 2024

albertvillanova commented May 10, 2024 • edited

natolambert commented May 13, 2024 • edited

natolambert commented May 13, 2024

albertvillanova commented May 14, 2024 • edited

albertvillanova commented May 10, 2024 •

edited

natolambert commented May 13, 2024 •

edited

albertvillanova commented May 14, 2024 •

edited