Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of DaftDataFrameEngine #3457

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 6, 2023

Code Pull Requests

This PR introduces the DaftDataFrameEngine, which is a DataFrameEngine implementation that is backed by Daft

This has several advantages:

  1. Daft is easy to use locally as well as remotely
  2. Daft is fast and provides native kernels such as those for URLs, images and tensors
  3. Daft runs natively on Ray
  4. Daft has a simpler API than the current distributed dataframe abstraction (Dask) because it has no concept of indexing, which complicates many operations in Ludwig such as joins etc

@arnavgarg1 arnavgarg1 self-requested a review July 7, 2023 00:35
@arnavgarg1
Copy link
Contributor

Nice! This looks like a great starting point - let me do the work we discussed on my end and (1) cleanup unused functions and (2) move pandas/python specific logic behind DF engine functions where possible

@jaychia
Copy link
Contributor Author

jaychia commented Jul 7, 2023

Nice! This looks like a great starting point - let me do the work we discussed on my end and (1) cleanup unused functions and (2) move pandas/python specific logic behind DF engine functions where possible

I added a shim class on top of the Daft Dataframe here to ease the transition!

These shim classes are the "external facing" classes of the DaftDataEngine, rather than just a naked daft.DataFrame and daft.Expression object.

The shim classes:

  1. DaftShimDataframe: primarily enables the __setitem__ method to support operations in Ludwig that make strong assumptions about the underlying Dataframe abstraction supporting the df["foo"] = series syntax
  2. DaftShimSeries: the Daft "series" is actually just the daft.Expression, which is "anonymous" and only takes on an identity ("this is the Series I refer to") when it is applied on a Dataframe. Some places in the ludwig code run functions directly on Series objects and have strong assumptions about the Series object supporting a Pandas API (e.g. series.astype(np.int32), scalar = series.map_partitions(func)), so this could be a good layer to help us with that abstraction, or at least to help us eventually migrate off places where that happens.

@github-actions
Copy link

github-actions bot commented Jul 7, 2023

Unit Test Results

  4 files  ±  0  4 suites  ±0   17s ⏱️ - 55m 38s
  6 tests  - 28  0 ✔️  - 29    5 💤 ±0  0 ±0  1 🔥 +1 
18 runs   - 50  0 ✔️  - 58  14 💤 +4  0 ±0  4 🔥 +4 

For more details on these errors, see this check.

Results for commit 2a8e399. ± Comparison against base commit 60f1416.

This pull request removes 33 and adds 5 tests. Note that renamed tests count towards both.
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-1]
tests.integration_tests.test_cli ‑ test_train_cli_horovod
tests.integration_tests.test_experiment ‑ test_experiment_model_resume_distributed[horovod]
…
tests.ludwig.automl.test_base_config
tests.ludwig.automl.test_utils
tests.ludwig.backend.test_ray
tests.ludwig.data.dataframe.test_daft
tests.ludwig.data.test_ray_data
This pull request removes 4 skipped tests and adds 4 skipped tests. Note that renamed tests count towards both.
tests.integration_tests.test_horovod ‑ test_horovod_gpu_memory_limit
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[ames_housing.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[mercedes_benz_greener.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[sarcos.ecd.yaml]
tests.ludwig.automl.test_base_config
tests.ludwig.automl.test_utils
tests.ludwig.backend.test_ray
tests.ludwig.data.test_ray_data

♻️ This comment has been updated with latest results.

@jaychia
Copy link
Contributor Author

jaychia commented Jul 7, 2023

Quick question @arnavgarg1 - do you know how .map_partitions and .map_batches (on the DataFrameEngines) are different?

Based on the function signature/naming it seems like the intention is for .map_partitions to run on Series, and .map_batches to run on Dataframes. However, it seems in practice they’ve been used fairly interchangeably, and surprisingly without any bugs?

  1. Dask implementation of map_batches says it runs on a Series, but the Pandas and Modin ones say it runs on Dataframes
  2. Naming of map_partitions on the Dask, Pandas and Modin implementations seem to agree that it takes in a Series, but in practice it is actually used on Dataframes sometimes.

It appears they are both used fairly interchangeably. I wonder if y'all wanted to maybe consolidate them?

Ideally also if they don't run on arbitrary Pandas dictionaries, and instead users of the API must specify the names of the columns that they need to map over, it would make query optimization more effective! A black-box "map this function over the entire dataframe" leaves Daft with no choice but to think that your query requires every single column, even if in practice your Python function only uses one specific column.

@jaychia jaychia force-pushed the jay/daft-dataframe-engine branch from 69810f9 to 36db991 Compare July 7, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants