Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Test Split - Post Merge Review #1362

Open
wants to merge 5 commits into
base: Test_split_review
Choose a base branch
from

Conversation

aolfat
Copy link
Contributor

@aolfat aolfat commented Feb 29, 2024

Description

Type of change

Does this correspond to an open issue?

Select type(s) of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have fixed any merge conflicts

@aolfat aolfat changed the title Team review Training Test Split - Post Merge Review Feb 29, 2024
TEST = 2; // Client is requesting test data
}

message TrainingTestSplitResponse {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't the happiest with this,

in retrospect and initalized probably isn't necessary.
iterator done means first iterator is done and it's a response we send so the client knows to finish iterating on the first iterator but keep the stream open; i'll add a comment

if random_state is None:
random_state = 0

train, test = TrainingSetTestSplit(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry point to splitting



@dataclass
class TrainingSetSplitDetails:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core Python logic


return self

def send_request(self, request_type):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comment here and return value.

def __iter__(self):
return self

def __next__(self) -> Tuple[np.ndarray, np.ndarray]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core iterator logic that interacts with the backend

if isTestFinished && isTrainFinished {
// If both iterators are finished, we can close the stream
serv.Logger.Infow("Both iterators are finished, closing stream")
return nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning nil closes the stream

}
}

func (serv *FeatureServer) handleSplitInitializeRequest(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually might not end up needing this

@@ -82,13 +81,170 @@ func (serv *FeatureServer) TrainingData(req *pb.TrainingDataRequest, stream pb.F
return nil
}

func (serv *FeatureServer) TrainingTestSplit(stream pb.Feature_TrainingTestSplitServer) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core logic I would like to be reviewed

@@ -287,6 +287,7 @@ type OfflineStore interface {
CreateTrainingSet(TrainingSetDef) error
UpdateTrainingSet(TrainingSetDef) error
GetTrainingSet(id ResourceID) (TrainingSetIterator, error)
GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need eyes on this

@@ -954,6 +955,56 @@ func (store *clickHouseOfflineStore) CreateTrainingSet(def TrainingSetDef) error
return nil
}

func (store *clickHouseOfflineStore) CreateTrainingTestSplit(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what I really really want a review on

Squashed commit of the following:

commit 0c88d69
Author: Ali Olfat <ali@featureform.com>
Date:   Wed Feb 28 17:45:31 2024 -0800

    remove unused function

commit 2d39369
Author: Ali Olfat <ali@featureform.com>
Date:   Wed Feb 28 17:01:38 2024 -0800

    even more clean up

commit 0ec53b5
Author: Ali Olfat <ali@featureform.com>
Date:   Wed Feb 28 16:55:05 2024 -0800

    small one

commit 59caf8d
Author: Ali Olfat <ali@featureform.com>
Date:   Wed Feb 28 16:52:02 2024 -0800

    some more clean up

commit bfa318c
Author: Ali Olfat <ali@featureform.com>
Date:   Sun Feb 18 18:00:34 2024 -0800

    move client to a separate file and refactor
logger.Errorw("Failed to get training set iterator", "Error", err)
return err
}
defer dropViews()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not totally convinced I need this yet:

What i was thinking

dataset = get_training_set(ts.name, ts.var)
train, test = dataset.training_test_split(random_state=0)

if you don't drop the views after these are consumed, its extra stuff in their provider -- which really isn't a big deal -- and also running dataset.training_test_split(random_state=0) won't be a new random set

I could probably get rid of the view dropping

Copy link
Contributor

@ahmadnazeri ahmadnazeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments; overall great work!

One question, can the user convert train and test into pandas dataframe? if yes, can we include it in docs?

@@ -1,3 +1,4 @@
import featureform.resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually, there is an order for imports

raise ValueError("test_size must be between 0 and 1")
if train_size > 1 or train_size < 0:
raise ValueError("train_size must be between 0 and 1")
if test_size != 0 and train_size != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can remove this conditional by moving the nested conditional to the both at the first level.

        if test_size == 0 and train_size != 0:
             test_size = 1 - train_size
         if test_size != 0 and train_size == 0:
             train_size = 1 - test_size

         if test_size + train_size != 1:
             raise ValueError("test_size + train_size must equal 1")

         return test_size, train_size

break

# Process and store the row data
from featureform.serving import Row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this import is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circular imports

@@ -0,0 +1,237 @@
from featureform.serving import Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import ordering



def response(req_type, iterator_done):
if req_type == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, what the numbers mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shows it right under

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i fixed it

if err != nil {
return
}
for set.Next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some sort of check?

@@ -91,3 +96,106 @@ func createClickHouseDatabase(c pc.ClickHouseConfig) error {
}
return nil
}

func TestTrainingSet(t *testing.T) {
t.Skip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are getting skipped?

return err
}
default:
if err := serv.handleSplitDataRequest(stream, req, &trainIter, &testIter, &isTestFinished, &isTrainFinished, logger); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is kind of weird that all these variables are getting updated within this method especially for the isTestFinished and isTrainFinished

@@ -780,6 +781,11 @@ func (store *memoryOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetItera
}
return data.(trainingRows).Iterator(), nil
}

func (store *memoryOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) {
return nil, nil, nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have a not implemented too?

@@ -2264,6 +2264,10 @@ func (spark *SparkOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetIterat
return fileStoreGetTrainingSet(id, spark.Store, spark.Logger)
}

func (spark *SparkOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) {
return nil, nil, nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a not implemented or rather not supported for Spark error message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants