Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed type inference #54

Open
wants to merge 2 commits into
base: staging
Choose a base branch
from
Open

Distributed type inference #54

wants to merge 2 commits into from

Conversation

pedrofluxa
Copy link
Contributor

@pedrofluxa pedrofluxa commented Oct 18, 2023

The current implementation of type_infer is not suitable to be used in distributed compute environments
(i.e. non-scalable); currently, type_infer can only be executed in a single node and needs to load data into memory. This makes type_infer unsuitable to analyze large datasets that do not fit in RAM.

The internal workings of type_infer allow for a relatively straightforward implementation to allow execution in distributed compute environments; one could use sub-sets of columns (each subset loaded into a different worker) to infer the data types, and then apply something like a voting mechanism to choose a type over the other.

The voting mechanism shall be aware of data type hierarchy. For example, consider the the case of having 4 workers: worker 1 identifies a subset of a column to be of type text while workers 2, 3, and 4 identify the rest of the subsets as being of type integer. Because text is a more general data type than integer (one level higher in the data type hierarchy), the entire column should be casted as text instead of integer, even if there are more votes for the former. It might be important to mention that the current implementation does not handle this situation which might seem like an edge-case but it is likely very common.

The proposed implementation shall use torch.disitrbuted to distribute the work across nodes. Because torch is a heavy dependency, this capability shall be available only if the user install type_infer by running

pip install type_infer[distributed]

All of the distributed modules should be encapsulated into a sub-module called distributed to avoid breaking the already existing code-base.

@pedrofluxa pedrofluxa self-assigned this Oct 18, 2023
@pedrofluxa pedrofluxa added the enhancement New feature or request label Oct 18, 2023
… + classifier. So far the system nails it for airline_delays dataset just fine, while being trained with only 3000 rows of data (used_car_price, airline_sentiment and individual_household_power_compsution).
@paxcema paxcema changed the title Make type_infer scalable Distributed type inference Oct 18, 2023
@paxcema paxcema mentioned this pull request Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: to review
Development

Successfully merging this pull request may close these issues.

None yet

1 participant