Machine Learning#

from cider.datastore import DataStore
from cider.ml import Learner

Initialize data store object, then learner, automatically loading feature file produced by featurizer, along with file of data labels, and merging features to labels.

# This path should point to your cider installation, where configs and data for this demo are located.
from pathlib import Path
cider_installation_directory = Path('../../cider')

datastore = DataStore(config_file_path_string= cider_installation_directory / 'configs' / 'config_quickstart.yml')
learner = Learner(datastore=datastore, clean_folders=True)
Number of observations with features: 1000 (1000 unique)
Number of observations with labels: 50 (50 unique)
Number of matched observations: 50 (50 unique)
                                                                                

Experiment quickly with untuned models to get a sense of accuracy. Lasso, Ridge, random forest, and gradient boosting models are implemented natively, other models can be implemented by hand.

lasso_scores = learner.untuned_model('lasso')
randomforest_scores = learner.untuned_model('randomforest')
print('LASSO', lasso_scores)
print('Random Forest', randomforest_scores)
LASSO {'train_r2': '1.00 (0.00)', 'test_r2': '-0.19 (0.48)', 'train_rmse': '6.65 (-3.02)', 'test_rmse': '15665.32 (-1859.46)'}
Random Forest {'train_r2': '0.84 (0.01)', 'test_r2': '-0.02 (0.12)', 'train_rmse': '6200.40 (-319.50)', 'test_rmse': '14937.78 (-1482.62)'}

Fine-tune a gradient boosting model, tuning hyperparameters over cross validation, and produce predictions for all labeled observations out-of-sample over cross-validation. Also generate predictions for all subscribers in the feature dataset.

gradientboosting_scores = learner.tuned_model('gradientboosting')
print('Gradient Boosting (Tuned)', gradientboosting_scores)
learner.oos_predictions('gradientboosting', kind='tuned')
learner.population_predictions('gradientboosting', kind='tuned')
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
Gradient Boosting (Tuned) {'train_r2': '0.80 (0.03)', 'test_r2': '0.07 (0.12)', 'train_rmse': '6808.46 (-409.84)', 'test_rmse': '14259.12 (-1920.92)'}
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
name predicted
0 dsBHAdXrrk 67249.768675
1 JGPCbfDGes 69289.940765
2 dYwshzRseD 85109.109600
3 ygMEXUQDbn 82018.137664
4 YtvkGlMWwe 76983.875546
... ... ...
5 amzyXHglBx 83987.844414
6 zZkqaZFAtz 82984.849245
7 uXZrufHOmE 87731.514440
8 dJSvXqUVSY 76191.160752
9 YosNCLWrFL 85338.034626

1000 rows × 2 columns

Evaluate the model’s accuracy. Produce a scatterplot of true vs. predicted values with a LOESS fit and a bar plot of the most important features. Generate a table showing the targeting accuracy, precision, and recall of the predictions for nine hypothetical targeting scenarios (targeting between 10% and 90% of the population).

learner.scatter_plot('gradientboosting', kind='tuned')
learner.feature_importances_plot('gradientboosting', kind='tuned')
learner.targeting_table('gradientboosting', kind='tuned')
../_images/df2223b2c50f4cabb153cab22e195534134e74fac0d7536d96e50306912f623d.png ../_images/55fc51d64fda5244690d51680fa82c57f361d15698ccf63013a77215f73f0e49.png
Proportion of Population Targeted Pearson Spearman AUC Accuracy Precision Recall
0 10% 0.56 0.5 0.73 93% 65% 65%
1 20% 0.56 0.5 0.73 81% 53% 53%
2 30% 0.56 0.5 0.73 71% 52% 52%
3 40% 0.56 0.5 0.73 67% 59% 59%
4 50% 0.56 0.5 0.73 65% 65% 65%
5 60% 0.56 0.5 0.73 69% 74% 74%
6 70% 0.56 0.5 0.73 79% 85% 85%
7 80% 0.56 0.5 0.73 75% 84% 84%
8 90% 0.56 0.5 0.73 84% 91% 91%