Distributed type inference #54

pedrofluxa · 2023-10-18T09:01:27Z

The current implementation of type_infer is not suitable to be used in distributed compute environments
(i.e. non-scalable); currently, type_infer can only be executed in a single node and needs to load data into memory. This makes type_infer unsuitable to analyze large datasets that do not fit in RAM.

The internal workings of type_infer allow for a relatively straightforward implementation to allow execution in distributed compute environments; one could use sub-sets of columns (each subset loaded into a different worker) to infer the data types, and then apply something like a voting mechanism to choose a type over the other.

The voting mechanism shall be aware of data type hierarchy. For example, consider the the case of having 4 workers: worker 1 identifies a subset of a column to be of type text while workers 2, 3, and 4 identify the rest of the subsets as being of type integer. Because text is a more general data type than integer (one level higher in the data type hierarchy), the entire column should be casted as text instead of integer, even if there are more votes for the former. It might be important to mention that the current implementation does not handle this situation which might seem like an edge-case but it is likely very common.

The proposed implementation shall use torch.disitrbuted to distribute the work across nodes. Because torch is a heavy dependency, this capability shall be available only if the user install type_infer by running

pip install type_infer[distributed]

All of the distributed modules should be encapsulated into a sub-module called distributed to avoid breaking the already existing code-base.

… + classifier. So far the system nails it for airline_delays dataset just fine, while being trained with only 3000 rows of data (used_car_price, airline_sentiment and individual_household_power_compsution).

fixed a few typos in docs

8dedb63

pedrofluxa requested a review from paxcema October 18, 2023 09:01

pedrofluxa self-assigned this Oct 18, 2023

pedrofluxa added the enhancement New feature or request label Oct 18, 2023

very preliminar implementation of automatic type inference using BERT…

f476345

… + classifier. So far the system nails it for airline_delays dataset just fine, while being trained with only 3000 rows of data (used_car_price, airline_sentiment and individual_household_power_compsution).

paxcema changed the title ~~Make type_infer scalable~~ Distributed type inference Oct 18, 2023

paxcema mentioned this pull request Dec 21, 2023

Add distributed module #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed type inference #54

Distributed type inference #54

pedrofluxa commented Oct 18, 2023 •

edited

Distributed type inference #54

Are you sure you want to change the base?

Distributed type inference #54

Conversation

pedrofluxa commented Oct 18, 2023 • edited

pedrofluxa commented Oct 18, 2023 •

edited