Recent Posts

PyCaret

 




PyCaret


PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment.


Install PyCaret

pip install pycaret


setup()

This is the first thing you HAVE to initialize and sets up all of the data transformations that you might use on your models. The only necessary parameters are data and target.

Classification

clf1 = setup(data = train, 

             target = 'Survived',

             numeric_imputation = 'mean',

             categorical_features = ['Sex','Embarked'], 

             ignore_features = ['Name','Ticket','Cabin'],

             silent = True)



target: What value are we trying to predict

numeric_imputation: If we're missing numerical values, what do we replace them with

categorical_features: Which features (columns) are categorical

ignore_features: What features would you like to ignore

silent: When true, confirmation of data types is not necessary, preprocessing will be performed automatically

Regression

reg = setup(data = train, 

             target = 'SalePrice',

             numeric_imputation = 'mean',

             categorical_features = ['MSZoning','Exterior1st','Exterior2nd','KitchenQual','Functional','SaleType',

                                     'Street','LotShape','LandContour','LotConfig','LandSlope','Neighborhood',   

                                     'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl',    

                                     'MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond',   

                                     'BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir',   

                                     'Electrical','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive',

                                     'SaleCondition']  , 

             ignore_features = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','Utilities'],

             normalize = True,

             silent = True)


target: What value are we trying to predict

numeric_imputation: If we're missing numerical values, what do we replace them with

categorical_features: Which features (columns) are categorical

ignore_features: What features would you like to ignore

normalize: Normalizes your numerical values. You can define the normalization method using normalize_method

silent: When true, confirmation of data types is not necessary, preprocessing will be performed automatically

compare_models()

Classification



Sample Output

  • Accuracy - How many classifications were accurate

  • **AUC** - The area under the ROC curve which plots True Positive Rates vs False Positive Rates

    • AUC is scale-invariant because it measures the rankings of predictions, not absolute values

    • It's classification-threshold-invariant because it measures the quality of predictions irrespective of what classification threshold is used

      • You don't always want this if the classification-threshold is important

  • Recall - Also known as the True Positive Rate it is the number of true positives divided by the true positives + false negatives

    • $TPR = \frac{TP}{TP + FN}$

  • Precision - The proportion of identified positive cases that were correctly identified

    • $Precision = \frac{TP}{TP+FP}$

  • F1 - Harmonic mean of the precision and recall

    • $2*\frac{precision * recall}{precision + recall}$

  • Kappa - Compared the Observed Accuracy with the Expected Accuracy (random chance)

    • $kappa = \frac{observed acc - expected acc}{1 - expectedacc}$

  • MCC - Produces a high score only if the prediction obtained good results in all four confusion matrix categories

    • $MCC=\frac{TPTN-FPFN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$

Regression

compare_models()




Example output of Regression Compare Models

  • MAE - Mean Absolute Error. This does not penalize large errors.

  • MSE - Mean Squared Error. This is penalizes large errors

  • RMSE - Root Mean Squared Error. This penalizes large errors.

  • R2 - Measures the strength of the relationship between the independent variables and the dependent variables by measuring the part of the variance of the dependent variable explained by the independent variable

  • RMSLE - Root Mean Squared Log Error

  • MAPE - Mean Absolute Percentage Error

create_model()

Classification

lgbm  = create_model('lightgbm')


Regression

lb = create_model('lightgbm') # CatBoost was not available for some reason


tune_model()

Tunes the hyperparameters of a model and scores it using Stratified Cross Validation

Classification

tuned_lightgbm = tune_model(lgbm) # You have to pass in a model object, not the name of a model like he does in his workbook


Regression

tuned_lbb = tune_model(lb)


plot_model()

Classification

Learning Curve

plot_model(estimator = tuned_lightgbm, plot = 'learning')

AUC Curve

plot_model(estimator = tuned_lightgbm, plot = 'auc')




Confusion Matrix

plot_model(estimator = tuned_lightgbm, plot = 'confusion_matrix')




Feature Importance

plot_model(estimator = tuned_lightgbm, plot = 'feature')




Regression

evaluate_model()

Classification

evaluate_model(tuned_lightgbm)


Regression

evaluate_model(tuned_lb)

interpret_model()

The interpret_model() method helps you analyze a model by telling you what's important for the model. It will output SHAP (SHapely Additive exPlanations) values into a graph.

Classification

interpret_model(tuned_lightgbm)




Regression

interpret_model(tuned_lb)




Summary Plot

interpret_model(tuned_lb, plot = 'reason', observation = 10)


predict_model()

Classification

predict_model(tuned_lightgbm, data=test)


Regression


predictions = predict_model(tuned_cb, data = test)

__________________________________________________________________________________

Google Colab LINK

Kaggle LINK

Official Site here




No comments

If you have any doubts, Please let me know