PyCaret
PyCaret
PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment.
Install PyCaret
pip install pycaret
setup()
This is the first thing you HAVE to initialize and sets up all of the data transformations that you might use on your models. The only necessary parameters are data and target.
Classification
clf1 = setup(data = train,
target = 'Survived',
numeric_imputation = 'mean',
categorical_features = ['Sex','Embarked'],
ignore_features = ['Name','Ticket','Cabin'],
silent = True)
target: What value are we trying to predict
numeric_imputation: If we're missing numerical values, what do we replace them with
categorical_features: Which features (columns) are categorical
ignore_features: What features would you like to ignore
silent: When true, confirmation of data types is not necessary, preprocessing will be performed automatically
Regression
reg = setup(data = train,
target = 'SalePrice',
numeric_imputation = 'mean',
categorical_features = ['MSZoning','Exterior1st','Exterior2nd','KitchenQual','Functional','SaleType',
'Street','LotShape','LandContour','LotConfig','LandSlope','Neighborhood',
'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl',
'MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond',
'BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir',
'Electrical','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive',
'SaleCondition'] ,
ignore_features = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','Utilities'],
normalize = True,
silent = True)
target: What value are we trying to predict
numeric_imputation: If we're missing numerical values, what do we replace them with
categorical_features: Which features (columns) are categorical
ignore_features: What features would you like to ignore
normalize: Normalizes your numerical values. You can define the normalization method using normalize_method
silent: When true, confirmation of data types is not necessary, preprocessing will be performed automatically
compare_models()
Classification
Sample Output
Accuracy - How many classifications were accurate
**AUC** - The area under the ROC curve which plots True Positive Rates vs False Positive Rates
AUC is scale-invariant because it measures the rankings of predictions, not absolute values
It's classification-threshold-invariant because it measures the quality of predictions irrespective of what classification threshold is used
You don't always want this if the classification-threshold is important
Recall - Also known as the True Positive Rate it is the number of true positives divided by the true positives + false negatives
$TPR = \frac{TP}{TP + FN}$
Precision - The proportion of identified positive cases that were correctly identified
$Precision = \frac{TP}{TP+FP}$
F1 - Harmonic mean of the precision and recall
$2*\frac{precision * recall}{precision + recall}$
Kappa - Compared the Observed Accuracy with the Expected Accuracy (random chance)
$kappa = \frac{observed acc - expected acc}{1 - expectedacc}$
MCC - Produces a high score only if the prediction obtained good results in all four confusion matrix categories
$MCC=\frac{TPTN-FPFN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$
Regression
compare_models()
Example output of Regression Compare Models
MAE - Mean Absolute Error. This does not penalize large errors.
MSE - Mean Squared Error. This is penalizes large errors
RMSE - Root Mean Squared Error. This penalizes large errors.
R2 - Measures the strength of the relationship between the independent variables and the dependent variables by measuring the part of the variance of the dependent variable explained by the independent variable
RMSLE - Root Mean Squared Log Error
MAPE - Mean Absolute Percentage Error
create_model()
Classification
lgbm = create_model('lightgbm')
Regression
lb = create_model('lightgbm') # CatBoost was not available for some reason
tune_model()
Tunes the hyperparameters of a model and scores it using Stratified Cross Validation
Classification
tuned_lightgbm = tune_model(lgbm) # You have to pass in a model object, not the name of a model like he does in his workbook
Regression
tuned_lbb = tune_model(lb)
plot_model()
Classification
Learning Curve
plot_model(estimator = tuned_lightgbm, plot = 'learning')
AUC Curve
plot_model(estimator = tuned_lightgbm, plot = 'auc')
Confusion Matrix
plot_model(estimator = tuned_lightgbm, plot = 'confusion_matrix')
Feature Importance
plot_model(estimator = tuned_lightgbm, plot = 'feature')
Regression
evaluate_model()
Classification
evaluate_model(tuned_lightgbm)
Regression
evaluate_model(tuned_lb)
interpret_model()
The interpret_model() method helps you analyze a model by telling you what's important for the model. It will output SHAP (SHapely Additive exPlanations) values into a graph.
Classification
interpret_model(tuned_lightgbm)
Regression
interpret_model(tuned_lb)
Summary Plot
interpret_model(tuned_lb, plot = 'reason', observation = 10)
predict_model()
Classification
predict_model(tuned_lightgbm, data=test)
Regression
predictions = predict_model(tuned_cb, data = test)
__________________________________________________________________________________
Google Colab LINK
Kaggle LINK
Official Site here
No comments
If you have any doubts, Please let me know