Initial implementation of DaftDataFrameEngine #3457

jaychia · 2023-07-06T23:11:29Z

Code Pull Requests

This PR introduces the DaftDataFrameEngine, which is a DataFrameEngine implementation that is backed by Daft

This has several advantages:

Daft is easy to use locally as well as remotely
Daft is fast and provides native kernels such as those for URLs, images and tensors
Daft runs natively on Ray
Daft has a simpler API than the current distributed dataframe abstraction (Dask) because it has no concept of indexing, which complicates many operations in Ludwig such as joins etc

arnavgarg1 · 2023-07-07T00:36:46Z

Nice! This looks like a great starting point - let me do the work we discussed on my end and (1) cleanup unused functions and (2) move pandas/python specific logic behind DF engine functions where possible

jaychia · 2023-07-07T00:48:55Z

Nice! This looks like a great starting point - let me do the work we discussed on my end and (1) cleanup unused functions and (2) move pandas/python specific logic behind DF engine functions where possible

I added a shim class on top of the Daft Dataframe here to ease the transition!

These shim classes are the "external facing" classes of the DaftDataEngine, rather than just a naked daft.DataFrame and daft.Expression object.

The shim classes:

DaftShimDataframe: primarily enables the __setitem__ method to support operations in Ludwig that make strong assumptions about the underlying Dataframe abstraction supporting the df["foo"] = series syntax
DaftShimSeries: the Daft "series" is actually just the daft.Expression, which is "anonymous" and only takes on an identity ("this is the Series I refer to") when it is applied on a Dataframe. Some places in the ludwig code run functions directly on Series objects and have strong assumptions about the Series object supporting a Pandas API (e.g. series.astype(np.int32), scalar = series.map_partitions(func)), so this could be a good layer to help us with that abstraction, or at least to help us eventually migrate off places where that happens.

github-actions · 2023-07-07T02:23:26Z

Unit Test Results

  4 files ±  0 4 suites ±0 17s ⏱️ - 55m 38s
  6 tests - 28 0 ✔️ - 29   5 💤 ±0 0 ❌ ±0 1 🔥 +1
18 runs - 50 0 ✔️ - 58 14 💤 +4 0 ❌ ±0 4 🔥 +4

For more details on these errors, see this check.

Results for commit 2a8e399. ± Comparison against base commit 60f1416.

This pull request removes 33 and adds 5 tests. Note that renamed tests count towards both.

tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-1]
tests.integration_tests.test_cli ‑ test_train_cli_horovod
tests.integration_tests.test_experiment ‑ test_experiment_model_resume_distributed[horovod]
…

tests.ludwig.automl.test_base_config
tests.ludwig.automl.test_utils
tests.ludwig.backend.test_ray
tests.ludwig.data.dataframe.test_daft
tests.ludwig.data.test_ray_data

This pull request removes 4 skipped tests and adds 4 skipped tests. Note that renamed tests count towards both.

tests.integration_tests.test_horovod ‑ test_horovod_gpu_memory_limit
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[ames_housing.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[mercedes_benz_greener.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[sarcos.ecd.yaml]

tests.ludwig.automl.test_base_config
tests.ludwig.automl.test_utils
tests.ludwig.backend.test_ray
tests.ludwig.data.test_ray_data

♻️ This comment has been updated with latest results.

jaychia · 2023-07-07T04:18:56Z

Quick question @arnavgarg1 - do you know how .map_partitions and .map_batches (on the DataFrameEngines) are different?

Based on the function signature/naming it seems like the intention is for .map_partitions to run on Series, and .map_batches to run on Dataframes. However, it seems in practice they’ve been used fairly interchangeably, and surprisingly without any bugs?

Dask implementation of map_batches says it runs on a Series, but the Pandas and Modin ones say it runs on Dataframes
Naming of map_partitions on the Dask, Pandas and Modin implementations seem to agree that it takes in a Series, but in practice it is actually used on Dataframes sometimes.
- In prompt.py
- In split.py
- In ray.py

It appears they are both used fairly interchangeably. I wonder if y'all wanted to maybe consolidate them?

Ideally also if they don't run on arbitrary Pandas dictionaries, and instead users of the API must specify the names of the columns that they need to map over, it would make query optimization more effective! A black-box "map this function over the entire dataframe" leaves Daft with no choice but to think that your query requires every single column, even if in practice your Python function only uses one specific column.

Jay Chia added 3 commits July 6, 2023 16:08

Initial work without map functionality

d006033

Cleanup unneeded variables in file

9c7aea4

Add todos for implementation as NotImplementedError

3e8b97d

arnavgarg1 self-requested a review July 7, 2023 00:35

Add shim classes for Daft dataframes and series

3434256

Skip reduce_objects implementation

630e736

Jay Chia added 3 commits July 6, 2023 21:26

Cleanup code, left TODO for implementations of UDFs

e5f38a0

Rename to LudwigDaftDataframe/Series

5a28b40

Clean up typing of function signatures

36db991

jaychia force-pushed the jay/daft-dataframe-engine branch from 69810f9 to 36db991 Compare July 7, 2023 17:46

Add unit test

2a8e399

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of DaftDataFrameEngine #3457

Initial implementation of DaftDataFrameEngine #3457

jaychia commented Jul 6, 2023 •

edited

arnavgarg1 commented Jul 7, 2023

jaychia commented Jul 7, 2023 •

edited

github-actions bot commented Jul 7, 2023 •

edited

jaychia commented Jul 7, 2023

Initial implementation of DaftDataFrameEngine #3457

Are you sure you want to change the base?

Initial implementation of DaftDataFrameEngine #3457

Conversation

jaychia commented Jul 6, 2023 • edited

Code Pull Requests

arnavgarg1 commented Jul 7, 2023

jaychia commented Jul 7, 2023 • edited

github-actions bot commented Jul 7, 2023 • edited

Unit Test Results

jaychia commented Jul 7, 2023

jaychia commented Jul 6, 2023 •

edited

jaychia commented Jul 7, 2023 •

edited

github-actions bot commented Jul 7, 2023 •

edited