Skip to content

Latest commit

 

History

History
115 lines (74 loc) · 3.73 KB

CHEAT_SHEET.md

File metadata and controls

115 lines (74 loc) · 3.73 KB

Data science knowledge/tricks

pandas

Time since last having seen a particular value. Remove the .shift() if you're not doing target encoding.

>>> import pandas as pd

>>> df = pd.DataFrame([
...     (1, 'cloudy'),
...     (2, 'cloudy'),
...     (3, 'sunny'),
...     (4, 'sunny'),
...     (5, 'cloudy'),
...     (6, 'sunny')
... ], columns=['time', 'location'])

>>> (df['time'] - df['time'].groupby(df['location'].shift().eq('cloudy').cumsum()).transform('first'))
0    0
1    0
2    0
3    1
4    2
5    0
Name: time, dtype: int64

Ensemble

(not too sure about the exact vocabulary)

  • Blending is averaging predictions
  • Bagging is averaging predictions with models trained on different folds with replacement
  • Pasting is the same as bagging but without replacement
  • Bumping is when a model is trained on different folds and the one that performs the best on the original dataset is kept
  • Stacking is training a model on predictions made by other models

Missing values

Feature engineering

Temporal features

Day of week, hours, minutes, are cyclic ordinal features; cosine and sine transforms should be used to express the cycle. See this StackEchange discussion.

from math import sin, pi

hours = list(range(24))

hours_cos = [cos(pi * h / 24) for h in hours]
hours_sin = [sin(pi * h / 24) for h in hours]

Binning continuous variables

Encoding categorical variables

Adstock transformation

Use adstock transformation to take into account lag effects when measure marketing campaign impacts.

advertising = [6, 27, 0, 0, 20, 0, 20] # Marketing campaign intensities

for i in range(1, len(advertising)):
    advertising[i] += advertising[i-1] * 0.5

print(advertising)
>>> [6, 30.0, 15.0, 7.5, 23.75, 11.875, 25.9375]

Dealing with unbalanced classes

  • Read this
  • Try under-sampling if there is a lot of data
  • Try over-sampling if there is not a lot of data
  • Alway under/over-sample on the training set. Don't apply it on the entire set before doing a train/test split, if you do duplicates will exist between the two sets and the scores will be skewed
  • Instead of predicting a class predict a probability and use a manual threshold to increase/reduce precision and recall as you wish
  • Use weights/costs
  • Limit the over-represented class

Timeseries forecasting

Spectral analysis for uncovering recurring patterns

Kaggle tricks

  • Adversarial validation can help making relevant cross-validation splits
  • Pseudo-labeling by augmenting the training set with part of the labeled test set

Target transformation

  • Using log on the target, training, and then using exp is naive