Bodhi Tree Blogs: Machine Learning

StatQuest is a youtube channel and following are some notes from that...

1. ML Intro
https://youtu.be/Gv9_4yMHFhI

1. fit blank line in data to show the trend or make prediction
2. so ML is all about making predictions & classifications

you can put line or squiggle (which will connect all training data..) -- but all imp is how good your prediction is.. whether its a line or squiggle... model should run fine on testing data...

Once u got a model.. find the difference between real & predictions using testing data.. use it for both straight line and squiggle .... whatever model gives less difference between real & predictions.. choose that one..

SO ---
Fitting the training data well but making poor predictions, is called the BIAS-VARIANCE TRADEOFF

Its all about which model will work good for our testing data..

2. Cross Validation
https://youtu.be/fSytzGwwBVw

How do we decide which ML method to use? for that Cross Validation is used for.

Divide the data 75% - 25% for training & testing..but is that correct? now that's where cross validation comes for rescue.. it takes all combinations to validation the data..

Four-fold cross validation
Leave one out cross validation
In practice they use Ten Fold Cross Validation.... How to use it?

Tuning Parameter ?? what's that?

3. Confusion Matrix
https://youtu.be/Kdsp6soqA7o

Heart Disease prediction ...

U need to summarize how each method worked on testing data.. and for that Confusion Matrix is used

Has Heart Disease Doesnt have heart Disease

Has Heart Dieses True Positive False Positive

Dont have H Die False Negative True Negative

Calculate this confusion matrix for all different ML models.. whatever confusion is works good.. choose that one....

Favorite Movie

Troll2 Gore Police Coll as Ice

Troll 2

Gore Police

Coll as Ice

So, size of confusion matrix depends on what all things u r gng to predict.. so if u want to choose 2 things then its 2 * 2, if its 5 then 5 * 5 or if its 40 then 40 * 40

4. Sensitivity & Specificity
https://youtu.be/vP06aMoz4v8

Confusion matrix with 2 rows & 2 cols

Rows are prediction
Cols are truth

ACTUALS

Has Heart Disease Doesnt have heart Disease

Has Heart Dieses True Positive False Positive
PRED
Dont have H Die False Negative True Negative

Sensitivity -- what % of patient with hearth diseases are correctly identified .. or
Specificity -- what % of patient without heart diseases are correctly identified

Following values are from Logistic regression model

ACTUALS

Has Heart Disease Doesnt have heart Disease

Has Heart Dieses 139 20
PRED
Dont have H Die 32 112

Sensitivity = 139/(139+32) = 0.81

Specificity = 112/(112+20) = 0.85

Following values are from Random Forest model

ACTUALS

Has Heart Disease Doesn't have heart Disease

Has Heart Dieses 142 22
PRED
Dont have H Die 29 110

Sensitivity = 142/(142+29) = 0.83

Specificity = 110/(110+22) = 0.83

---------------3 or bigger confusion matrix

Favorite Movie

ACTUAL
Troll2 Gore Police Coll as Ice

Troll 2 12 102 93

PRED Gore Police 112 23 77

Coll as Ice 83 92 17

Sensitivity for Troll2 = True Positive for Troll2 / (True Positive for Troll2 + False Negative for Troll2)

Sensitivity = 12/(12 + (112+83)) = 0.06

Specificity for Troll2 = True -ve for troll2 / True -ve for troll2 + false positive for troll2

Specificity = (23+77+92+17)/((23+77+92+17) + (102+93)) = 0.52

and same goes for Gore Police and Coll as Ice

Use sensitivity & specificity to identify which ML method needs to be used to correctly identify the data...or for prediction..

5. Bias & Variance
https://youtu.be/EuBBz3bI-aA

Height Vs Weight of Mice

1. Linear Regression or Least Square -- it adds a straight line in data set..
here.. since it's doesn't touch all data points [as u need curved line to touch all data points ] so it can't capture true relationship between weight & height and that inability in model [in this case linear regression] is called bias..

2. Another ML method might add a squiggly line.. which has very less bias.. for Training data..

Now, we measure the distances from the fit lines to the data, square them (since there are -ve distances) and add them up.. but his we have to do for testing data..and not for training data..this difference in fits is called variables [on test or train data]

On training data.. Squiggly lines will win.. but it will fail on Testing data... and that what we call as Over Fit in ML..

In M, ideal model has low bias and can accurately model the true relationship.. and it also has low variability, by producing consistent predictions across different datasets...

Three commonly used methods for finding the sweet spot between simple and complicated models are .. regularization, boosting and bagging

5. ROC & AUC
https://youtu.be/4jRBRDbJemM

Obese Vs Weight

Logistic Regression with Classification Threshold 0.5

Now, create a confusion matrix on test results of logistic reg model

Actual
Is Obese Is Not Obese
Is Obese 3 1
Predicted
Is not Obese 1 3

and then calculate sensitivity and specificity

now, if we decide different threshold instead of 0.5.. based on u want to choose false +ve or false -ves

e.g. has ebola or doesnt have.. its imp not to miss any ebola.. so how do we consider which threshold to set?
u can create diff confuction matrix.. but that would be too tedious

So, instead use ROC [receiver operator characteristic] graph...

So, basically calculate Sensitivity & (1-specificity) for each threshold and plot that on map..

anything on middle line has same proportion for true +ve and false +ve..

Area Under the Curve [AUC]

Here.. Logistic Regression ROC covers Random Forrest ROC.. so AUC is for LR - RF area

Precision = True +ve / (True+Ve + False +Ve)

ROC will basically used to identify which threshold needs to be use out of many
AUC will help to choose model based on different ROC graphs

https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

Bodhi Tree Blogs

Saturday, February 1, 2020

Machine Learning

2 comments:

All about CSS

Pages

Followers