Training Test Split - Post Merge Review #1362

aolfat · 2024-02-29T01:37:24Z

Description

Type of change

Does this correspond to an open issue?

Select type(s) of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Checklist:

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have fixed any merge conflicts

aolfat · 2024-02-29T01:38:56Z

proto/serving.proto

+  TEST = 2;      // Client is requesting test data
+}
+
+message TrainingTestSplitResponse {


wasn't the happiest with this,

in retrospect and initalized probably isn't necessary.
iterator done means first iterator is done and it's a response we send so the client knows to finish iterating on the first iterator but keep the stream open; i'll add a comment

aolfat · 2024-02-29T01:39:28Z

client/src/featureform/serving.py

+        if random_state is None:
+            random_state = 0
+
+        train, test = TrainingSetTestSplit(


entry point to splitting

aolfat · 2024-02-29T01:39:46Z

client/src/featureform/training_test_split.py

+
+
+@dataclass
+class TrainingSetSplitDetails:


Core Python logic

aolfat · 2024-02-29T01:41:03Z

client/src/featureform/training_test_split.py

+
+        return self
+
+    def send_request(self, request_type):


nit: add comment here and return value.

aolfat · 2024-02-29T01:41:45Z

client/src/featureform/training_test_split.py

+    def __iter__(self):
+        return self
+
+    def __next__(self) -> Tuple[np.ndarray, np.ndarray]:


core iterator logic that interacts with the backend

aolfat · 2024-02-29T01:42:25Z

serving/serving.go

+		if isTestFinished && isTrainFinished {
+			// If both iterators are finished, we can close the stream
+			serv.Logger.Infow("Both iterators are finished, closing stream")
+			return nil


returning nil closes the stream

aolfat · 2024-02-29T01:42:45Z

serving/serving.go

+	}
+}
+
+func (serv *FeatureServer) handleSplitInitializeRequest(


I actually might not end up needing this

aolfat · 2024-02-29T01:43:19Z

serving/serving.go

@@ -82,13 +81,170 @@ func (serv *FeatureServer) TrainingData(req *pb.TrainingDataRequest, stream pb.F
 	return nil
 }

+func (serv *FeatureServer) TrainingTestSplit(stream pb.Feature_TrainingTestSplitServer) error {


core logic I would like to be reviewed

aolfat · 2024-02-29T01:43:45Z

provider/offline.go

@@ -287,6 +287,7 @@ type OfflineStore interface {
 	CreateTrainingSet(TrainingSetDef) error
 	UpdateTrainingSet(TrainingSetDef) error
 	GetTrainingSet(id ResourceID) (TrainingSetIterator, error)
+	GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error)


need eyes on this

aolfat · 2024-02-29T01:44:12Z

provider/clickhouse.go

@@ -954,6 +955,56 @@ func (store *clickHouseOfflineStore) CreateTrainingSet(def TrainingSetDef) error
 	return nil
 }

+func (store *clickHouseOfflineStore) CreateTrainingTestSplit(


this is what I really really want a review on

Squashed commit of the following: commit 0c88d69 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 17:45:31 2024 -0800 remove unused function commit 2d39369 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 17:01:38 2024 -0800 even more clean up commit 0ec53b5 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 16:55:05 2024 -0800 small one commit 59caf8d Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 16:52:02 2024 -0800 some more clean up commit bfa318c Author: Ali Olfat <ali@featureform.com> Date: Sun Feb 18 18:00:34 2024 -0800 move client to a separate file and refactor

aolfat · 2024-02-29T02:04:39Z

serving/serving.go

+		logger.Errorw("Failed to get training set iterator", "Error", err)
+		return err
+	}
+	defer dropViews()


not totally convinced I need this yet:

What i was thinking

dataset = get_training_set(ts.name, ts.var)
train, test = dataset.training_test_split(random_state=0)

if you don't drop the views after these are consumed, its extra stuff in their provider -- which really isn't a big deal -- and also running dataset.training_test_split(random_state=0) won't be a new random set

I could probably get rid of the view dropping

ahmadnazeri

added a few comments; overall great work!

One question, can the user convert train and test into pandas dataframe? if yes, can we include it in docs?

ahmadnazeri · 2024-02-29T02:05:51Z

client/src/featureform/client.py

@@ -1,3 +1,4 @@
+import featureform.resources


nit: usually, there is an order for imports

ahmadnazeri · 2024-02-29T02:13:04Z

client/src/featureform/serving.py

+            raise ValueError("test_size must be between 0 and 1")
+        if train_size > 1 or train_size < 0:
+            raise ValueError("train_size must be between 0 and 1")
+        if test_size != 0 and train_size != 0:


nit: you can remove this conditional by moving the nested conditional to the both at the first level.

if test_size == 0 and train_size != 0: test_size = 1 - train_size if test_size != 0 and train_size == 0: train_size = 1 - test_size if test_size + train_size != 1: raise ValueError("test_size + train_size must equal 1") return test_size, train_size

ahmadnazeri · 2024-02-29T02:17:32Z

client/src/featureform/training_test_split.py

+                break
+
+            # Process and store the row data
+            from featureform.serving import Row


any reason this import is here?

circular imports

ahmadnazeri · 2024-02-29T02:18:53Z

client/tests/test_train_test_split.py

@@ -0,0 +1,237 @@
+from featureform.serving import Dataset


nit: import ordering

ahmadnazeri · 2024-02-29T02:19:21Z

client/tests/test_train_test_split.py

+
+
+def response(req_type, iterator_done):
+    if req_type == 0:


no idea, what the numbers mean

it shows it right under

but i fixed it

ahmadnazeri · 2024-02-29T02:32:23Z

provider/clickhouse_test.go

+	if err != nil {
+		return
+	}
+	for set.Next() {


should there be some sort of check?

ahmadnazeri · 2024-02-29T02:32:42Z

provider/clickhouse_test.go

@@ -91,3 +96,106 @@ func createClickHouseDatabase(c pc.ClickHouseConfig) error {
 	}
 	return nil
 }
+
+func TestTrainingSet(t *testing.T) {
+	t.Skip()


tests are getting skipped?

ahmadnazeri · 2024-02-29T02:37:45Z

serving/serving.go

+				return err
+			}
+		default:
+			if err := serv.handleSplitDataRequest(stream, req, &trainIter, &testIter, &isTestFinished, &isTrainFinished, logger); err != nil {


it is kind of weird that all these variables are getting updated within this method especially for the isTestFinished and isTrainFinished

ahmadnazeri · 2024-02-29T02:39:09Z

provider/offline.go

@@ -780,6 +781,11 @@ func (store *memoryOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetItera
 	}
 	return data.(trainingRows).Iterator(), nil
 }
+
+func (store *memoryOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) {
+	return nil, nil, nil, nil


should this have a not implemented too?

ahmadnazeri · 2024-02-29T02:40:22Z

provider/spark.go

@@ -2264,6 +2264,10 @@ func (spark *SparkOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetIterat
 	return fileStoreGetTrainingSet(id, spark.Store, spark.Logger)
 }

+func (spark *SparkOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) {
+	return nil, nil, nil, nil


should we add a not implemented or rather not supported for Spark error message?

aolfat added 3 commits February 28, 2024 17:35

Proto changes

becbc20

Client side changes

c00adc6

changes to api main

99368da

aolfat had a problem deploying to Integration testing February 29, 2024 01:37 — with GitHub Actions Error

aolfat changed the title ~~Team review~~ Training Test Split - Post Merge Review Feb 29, 2024

aolfat commented Feb 29, 2024

View reviewed changes

client/src/featureform/training_test_split.py

@dataclass

class TrainingSetSplitDetails:

Copy link

Contributor Author

aolfat Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core Python logic

aolfat commented Feb 29, 2024

View reviewed changes

serving/serving.go

}

}

func (serv *FeatureServer) handleSplitInitializeRequest(

Copy link

Contributor Author

aolfat Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually might not end up needing this

aolfat commented Feb 29, 2024

View reviewed changes

aolfat added 2 commits February 28, 2024 17:45

misc unimportant changes

2177eb7

aolfat force-pushed the Team_Review branch from 2c7bb5b to 2177eb7 Compare February 29, 2024 01:46

aolfat had a problem deploying to Integration testing February 29, 2024 01:46 — with GitHub Actions Failure

aolfat commented Feb 29, 2024

View reviewed changes

ahmadnazeri reviewed Feb 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Test Split - Post Merge Review #1362

Training Test Split - Post Merge Review #1362

aolfat commented Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

ahmadnazeri left a comment

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

aolfat Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

aolfat Feb 29, 2024

aolfat Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

ahmadnazeri Feb 29, 2024

Training Test Split - Post Merge Review #1362

Are you sure you want to change the base?

Training Test Split - Post Merge Review #1362

Conversation

aolfat commented Feb 29, 2024

Description

Type of change

Does this correspond to an open issue?

Select type(s) of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

if you don't drop the views after these are consumed, its extra stuff in their provider -- which really isn't a big deal -- and also running dataset.training_test_split(random_state=0) won't be a new random set

ahmadnazeri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment