Data science knowledge/tricks

pandas

Time since last having seen a particular value. Remove the .shift() if you're not doing target encoding.

>>> import pandas as pd

>>> df = pd.DataFrame([
...     (1, 'cloudy'),
...     (2, 'cloudy'),
...     (3, 'sunny'),
...     (4, 'sunny'),
...     (5, 'cloudy'),
...     (6, 'sunny')
... ], columns=['time', 'location'])

>>> (df['time'] - df['time'].groupby(df['location'].shift().eq('cloudy').cumsum()).transform('first'))
0    0
1    0
2    0
3    1
4    2
5    0
Name: time, dtype: int64

Ensemble

(not too sure about the exact vocabulary)

Blending is averaging predictions
Bagging is averaging predictions with models trained on different folds with replacement
Pasting is the same as bagging but without replacement
Bumping is when a model is trained on different folds and the one that performs the best on the original dataset is kept
Stacking is training a model on predictions made by other models

Missing values

Replace by mean, median or most frequent value
Random Forest imputation

Feature engineering

Temporal features

Day of week, hours, minutes, are cyclic ordinal features; cosine and sine transforms should be used to express the cycle. See this StackEchange discussion.

from math import sin, pi

hours = list(range(24))

hours_cos = [cos(pi * h / 24) for h in hours]
hours_sin = [sin(pi * h / 24) for h in hours]

Binning continuous variables

Minimum description length principle (entropy)

Encoding categorical variables

One-hot encoding
Target encoding
Feature embedding

Adstock transformation

Use adstock transformation to take into account lag effects when measure marketing campaign impacts.

advertising = [6, 27, 0, 0, 20, 0, 20] # Marketing campaign intensities

for i in range(1, len(advertising)):
    advertising[i] += advertising[i-1] * 0.5

print(advertising)

>>> [6, 30.0, 15.0, 7.5, 23.75, 11.875, 25.9375]

Dealing with unbalanced classes

Read this
Try under-sampling if there is a lot of data
Try over-sampling if there is not a lot of data
Alway under/over-sample on the training set. Don't apply it on the entire set before doing a train/test split, if you do duplicates will exist between the two sets and the scores will be skewed
Instead of predicting a class predict a probability and use a manual threshold to increase/reduce precision and recall as you wish
Use weights/costs
Limit the over-represented class

Timeseries forecasting

Use time series cross-validation (explanatory diagram here)

Spectral analysis for uncovering recurring patterns

Kaggle tricks

Adversarial validation can help making relevant cross-validation splits
Pseudo-labeling by augmenting the training set with part of the labeled test set

Target transformation

Using log on the target, training, and then using exp is naive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHEAT_SHEET.md

CHEAT_SHEET.md

Data science knowledge/tricks

pandas

Ensemble

Missing values

Feature engineering

Temporal features

Binning continuous variables

Encoding categorical variables

Adstock transformation

Dealing with unbalanced classes

Timeseries forecasting

Spectral analysis for uncovering recurring patterns

Kaggle tricks

Target transformation

Files

CHEAT_SHEET.md

Latest commit

History

CHEAT_SHEET.md

File metadata and controls

Data science knowledge/tricks

pandas

Ensemble

Missing values

Feature engineering

Temporal features

Binning continuous variables

Encoding categorical variables

Adstock transformation

Dealing with unbalanced classes

Timeseries forecasting

Spectral analysis for uncovering recurring patterns

Kaggle tricks

Target transformation