Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Multi covariates setting and scaling problem using own data #2350

Closed
ALH84007 opened this issue Apr 25, 2024 · 3 comments
Closed
Labels
question Further information is requested

Comments

@ALH84007
Copy link

ALH84007 commented Apr 25, 2024

I imported the data and converted it into time series, set the target and covariates, divided the training set, test set, and validation set, and normalized them separately. It is not clear to me how to divide the training set, test set, and validation set and scaling them after stacking or concatenating more than two covariates.
If I set:
past_covariates = concatenate([A, B, C, D], axis=1)
stacked_covariates = past_covariates1.stack([past_covariates2, past_covariates3, ..., past_covariatesN])
how to divide and scale past covariates?

Below is my code, the division and standardization of training sets, test sets and validation sets also seem troublesome, please give me some suggestions.

train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

file_path = 'D:/XXX.csv'
df = pd.read_csv(file_path)

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp', inplace=True)

# TODO: multivariate time series
# target_column_names = ['algal', 'chlorophyll']
target_column_names = ['chlorophyll']
covariate_columns_names = ['TEM', 'PH', 'DO', 'conductivity', 'turbidity', 'PV', 'AN', 'TP', 'TN']

target_series = TimeSeries.from_dataframe(df, value_cols=target_column_names, freq='H')
covariate_series = TimeSeries.from_dataframe(df, value_cols=covariate_columns_names, freq='H')

# time series split
train_target, temp_target = target_series.split_before(train_ratio)
val_target, test_target = temp_target.split_before(val_ratio / (1 - train_ratio))
train_covariates, temp_covariates = covariate_series.split_before(train_ratio)
val_covariates, test_covariates = temp_covariates.split_before(val_ratio / (1 - train_ratio))
# time series scaled
scaler_target = Scaler()
scaler_covariates = Scaler()

target_scaled = scaler_target.fit_transform(target_series)

train_target_scaled = scaler_target.fit_transform(train_target)
val_target_scaled = scaler_target.transform(val_target)
model_target_scaled = concatenate([train_target_scaled, val_target_scaled])
test_target_scaled = scaler_target.transform(test_target)

train_covariates_scaled = scaler_covariates.fit_transform(train_covariates)
val_covariates_scaled = scaler_covariates.transform(val_covariates)
model_covariates_scaled = concatenate([train_covariates_scaled, val_covariates_scaled])
test_covariates_scaled = scaler_covariates.transform(test_covariates)
all_covariates_scaled = concatenate([model_covariates_scaled, test_covariates_scaled])

# plot
train_target_scaled.plot(label="training")
val_target_scaled.plot(label="validation")
test_target_scaled.plot(label="test")
plt.show()
@madtoinou madtoinou added the question Further information is requested label Apr 25, 2024
@madtoinou
Copy link
Collaborator

Hi @ALH84007,

Your code looks great; you fit the Scaler on the training split of the target and then, apply it to the validation and test sets before concatenating them together.

Having a multivariate covariates does not change anything, the Scaler will process them individually (independently from the others components ranges) so you can keep your code as it is. Not sure to understand what your problem is here?

@ALH84007
Copy link
Author

ALH84007 commented Apr 25, 2024

Hi @ALH84007,

Your code looks great; you fit the Scaler on the training split of the target and then, apply it to the validation and test sets before concatenating them together.↳

Having a multivariate covariates does not change anything, the Scaler will process them individually (independently from the others components ranges) so you can keep your code as it is. Not sure to understand what your problem is here?↳

Thank you for your reply. I was wondering if I need to stack or concatenate covariates, and after stacking can I still divide and standardize them according to the existing code. Now I got it according to your reply~

@madtoinou
Copy link
Collaborator

If the new covariates can be considered as new components, and not "temporal continuation" of existing components, you indeed need to stack them.

The code will continue to work as long as the new covariates (components) are added before fitting the scaler for the first time (otherwise, it will complain about the dimensions of the series).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants