Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look-Ahead Bias in Generated Features #2731

Closed
Nasser-Alkhulaifi opened this issue May 20, 2024 · 5 comments
Closed

Look-Ahead Bias in Generated Features #2731

Nasser-Alkhulaifi opened this issue May 20, 2024 · 5 comments

Comments

@Nasser-Alkhulaifi
Copy link

Nasser-Alkhulaifi commented May 20, 2024

Hi,

I've noticed that some of the generated features by Featuretools exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?

Example:
In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.

Questions:

  1. Why does this look-ahead bias exist in the generated features?
  2. Am I using the tool incorrectly?
  3. Is there a specific setting or method I am missing to avoid this issue?

Thank you.

FT

`import featuretools as ft
import pandas as pd
from featuretools.primitives import list_primitives

df = pd.read_csv(r"xxxxxxxC.csv")
df['DateTime'] = pd.to_datetime(df['DateTime'])

Create an EntitySet
es = ft.EntitySet(id="data")

Add the DataFrame to the EntitySet
es = es.add_dataframe(dataframe_name="df", dataframe=df, index="index", make_index=True, time_index="DateTime")

List all available primitives
primitives = list_primitives()
agg_primitives = primitives[primitives['type'] == 'aggregation']['name'].tolist()
trans_primitives = primitives[primitives['type'] == 'transform']['name'].tolist()

Run to create new features
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="df",
agg_primitives=agg_primitives, # Use all aggregation primitives
trans_primitives=trans_primitives # Use all transformation primitives
)

feature_matrix`

@thehomebrewnerd
Copy link
Contributor

@Nasser-Alkhulaifi By default Featuretools is going to attempt to generate features from every column in the input dataframe that you provide. It has no way of knowing that it shouldn't generate features for a given column that is present in the data unless you instruct it to ignore the column.

There are multiple ways you can handle this based on your particular problem. For example, you can simply drop the column from your dataframe before creating the EntitySet. You could also use the ignore_columns argument when calling ft.dfs to tell Featuretools to not generate features from that column.

@Nasser-Alkhulaifi
Copy link
Author

@thehomebrewnerd thank you for your quick response!

I appreciate your suggestions on how to exclude columns to avoid look-ahead bias. However, my concern is not about ignoring specific columns. My issue is related to the inherent look-ahead bias in the generated features and the lack of appropriate shifting to avoid this bias.

As you know, in time series forecasting, it is crucial to ensure that the features used for prediction do not include future information relative to the target variable. This means that features such as lags, rolling statistical windows ets. need to be shifted appropriately so that only past data up to the prediction time is used!

From my observations, some of the generated features by Featuretools include exact values that correspond to the same row in the target variable (y_train). This introduces look-ahead bias and leads to data leakage, as the model gets access to future information that would not be available at prediction time!

For instance, consider using lag1 as a feature (which must be shifted one step back) to avoid being on the same row/index as the target variable y_train. The first row of any feature generated from the target variable should have NaN for this feature because it has been shifted and can't be used at prediction time (t0) as this information won't be known!

image

Does my point make sense? Is this clear to you and can it be added as a feature?

To put it simply, we need to avoid aligning any features that have information that won't be available at forecasting time on the same row/index as the target variable. I know I can work around this after the new dataframe of generated features is created, but I'm looking for a method or setting in Featuretools that ensures only past data is considered when creating features for time-series forecasting tasks.

Thank you!

@thehomebrewnerd
Copy link
Contributor

@Nasser-Alkhulaifi Yes, your point makes sense. Featuretools has a set of primitives for creating features for time-series problems. Take a look at this guide for more information: https://featuretools.alteryx.com/en/stable/guides/time_series.html

@Nasser-Alkhulaifi
Copy link
Author

Thank you @thehomebrewnerd

@thehomebrewnerd
Copy link
Contributor

Closing this for now. Feel free to reopen if you encounter additional problems or find behavior in Featuretools that seems incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants