Look-Ahead Bias in Generated Features #2731

Nasser-Alkhulaifi · 2024-05-20T13:26:57Z

Hi,

I've noticed that some of the generated features by Featuretools exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?

Example:
In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.

Questions:

Why does this look-ahead bias exist in the generated features?
Am I using the tool incorrectly?
Is there a specific setting or method I am missing to avoid this issue?

Thank you.

`import featuretools as ft
import pandas as pd
from featuretools.primitives import list_primitives

df = pd.read_csv(r"xxxxxxxC.csv")
df['DateTime'] = pd.to_datetime(df['DateTime'])

Create an EntitySet
es = ft.EntitySet(id="data")

Add the DataFrame to the EntitySet
es = es.add_dataframe(dataframe_name="df", dataframe=df, index="index", make_index=True, time_index="DateTime")

List all available primitives
primitives = list_primitives()
agg_primitives = primitives[primitives['type'] == 'aggregation']['name'].tolist()
trans_primitives = primitives[primitives['type'] == 'transform']['name'].tolist()

Run to create new features
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="df",
agg_primitives=agg_primitives, # Use all aggregation primitives
trans_primitives=trans_primitives # Use all transformation primitives
)

feature_matrix`

The text was updated successfully, but these errors were encountered:

thehomebrewnerd · 2024-05-20T14:40:26Z

@Nasser-Alkhulaifi By default Featuretools is going to attempt to generate features from every column in the input dataframe that you provide. It has no way of knowing that it shouldn't generate features for a given column that is present in the data unless you instruct it to ignore the column.

There are multiple ways you can handle this based on your particular problem. For example, you can simply drop the column from your dataframe before creating the EntitySet. You could also use the ignore_columns argument when calling ft.dfs to tell Featuretools to not generate features from that column.

Nasser-Alkhulaifi · 2024-05-20T15:17:05Z

@thehomebrewnerd thank you for your quick response!

I appreciate your suggestions on how to exclude columns to avoid look-ahead bias. However, my concern is not about ignoring specific columns. My issue is related to the inherent look-ahead bias in the generated features and the lack of appropriate shifting to avoid this bias.

As you know, in time series forecasting, it is crucial to ensure that the features used for prediction do not include future information relative to the target variable. This means that features such as lags, rolling statistical windows ets. need to be shifted appropriately so that only past data up to the prediction time is used!

From my observations, some of the generated features by Featuretools include exact values that correspond to the same row in the target variable (y_train). This introduces look-ahead bias and leads to data leakage, as the model gets access to future information that would not be available at prediction time!

For instance, consider using lag1 as a feature (which must be shifted one step back) to avoid being on the same row/index as the target variable y_train. The first row of any feature generated from the target variable should have NaN for this feature because it has been shifted and can't be used at prediction time (t0) as this information won't be known!

Does my point make sense? Is this clear to you and can it be added as a feature?

To put it simply, we need to avoid aligning any features that have information that won't be available at forecasting time on the same row/index as the target variable. I know I can work around this after the new dataframe of generated features is created, but I'm looking for a method or setting in Featuretools that ensures only past data is considered when creating features for time-series forecasting tasks.

Thank you!

thehomebrewnerd · 2024-05-20T15:22:47Z

@Nasser-Alkhulaifi Yes, your point makes sense. Featuretools has a set of primitives for creating features for time-series problems. Take a look at this guide for more information: https://featuretools.alteryx.com/en/stable/guides/time_series.html

Nasser-Alkhulaifi · 2024-05-21T09:21:12Z

Thank you @thehomebrewnerd

thehomebrewnerd · 2024-05-21T18:56:58Z

Closing this for now. Feel free to reopen if you encounter additional problems or find behavior in Featuretools that seems incorrect.

thehomebrewnerd closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look-Ahead Bias in Generated Features #2731

Look-Ahead Bias in Generated Features #2731

Nasser-Alkhulaifi commented May 20, 2024 •

edited

thehomebrewnerd commented May 20, 2024

Nasser-Alkhulaifi commented May 20, 2024

thehomebrewnerd commented May 20, 2024

Nasser-Alkhulaifi commented May 21, 2024

thehomebrewnerd commented May 21, 2024

Look-Ahead Bias in Generated Features #2731

Look-Ahead Bias in Generated Features #2731

Comments

Nasser-Alkhulaifi commented May 20, 2024 • edited

thehomebrewnerd commented May 20, 2024

Nasser-Alkhulaifi commented May 20, 2024

thehomebrewnerd commented May 20, 2024

Nasser-Alkhulaifi commented May 21, 2024

thehomebrewnerd commented May 21, 2024

Nasser-Alkhulaifi commented May 20, 2024 •

edited