StatQuest is a youtube channel and following are some notes from that...
1. ML Intro
https://youtu.be/Gv9_4yMHFhI
1. fit blank line in data to show the trend or make prediction
2. so ML is all about making predictions & classifications
you can put line or squiggle (which will connect all training data..) -- but all imp is how good your prediction is.. whether its a line or squiggle... model should run fine on testing data...
Once u got a model.. find the difference between real & predictions using testing data.. use it for both straight line and squiggle .... whatever model gives less difference between real & predictions.. choose that one..
SO ---
Fitting the training data well but making poor predictions, is called the BIAS-VARIANCE TRADEOFF
Its all about which model will work good for our testing data..
2. Cross Validation
https://youtu.be/fSytzGwwBVw
How do we decide which ML method to use? for that Cross Validation is used for.
Divide the data 75% - 25% for training & testing..but is that correct? now that's where cross validation comes for rescue.. it takes all combinations to validation the data..
Four-fold cross validation
Leave one out cross validation
In practice they use Ten Fold Cross Validation.... How to use it?
Tuning Parameter ?? what's that?
3. Confusion Matrix
https://youtu.be/Kdsp6soqA7o
Heart Disease prediction ...
U need to summarize how each method worked on testing data.. and for that Confusion Matrix is used
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses True Positive False Positive
Dont have H Die False Negative True Negative
Calculate this confusion matrix for all different ML models.. whatever confusion is works good.. choose that one....
Favorite Movie
Troll2 Gore Police Coll as Ice
Troll 2
Gore Police
Coll as Ice
So, size of confusion matrix depends on what all things u r gng to predict.. so if u want to choose 2 things then its 2 * 2, if its 5 then 5 * 5 or if its 40 then 40 * 40
4. Sensitivity & Specificity
https://youtu.be/vP06aMoz4v8
Confusion matrix with 2 rows & 2 cols
Rows are prediction
Cols are truth
ACTUALS
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses True Positive False Positive
PRED
Dont have H Die False Negative True Negative
Sensitivity -- what % of patient with hearth diseases are correctly identified .. or
Specificity -- what % of patient without heart diseases are correctly identified
Following values are from Logistic regression model
ACTUALS
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses 139 20
PRED
Dont have H Die 32 112
Sensitivity = 139/(139+32) = 0.81
Specificity = 112/(112+20) = 0.85
Following values are from Random Forest model
ACTUALS
Has Heart Disease Doesn't have heart Disease
Has Heart Dieses 142 22
PRED
Dont have H Die 29 110
Sensitivity = 142/(142+29) = 0.83
Specificity = 110/(110+22) = 0.83
---------------3 or bigger confusion matrix
Favorite Movie
ACTUAL
Troll2 Gore Police Coll as Ice
Troll 2 12 102 93
PRED Gore Police 112 23 77
Coll as Ice 83 92 17
Sensitivity for Troll2 = True Positive for Troll2 / (True Positive for Troll2 + False Negative for Troll2)
Sensitivity = 12/(12 + (112+83)) = 0.06
Specificity for Troll2 = True -ve for troll2 / True -ve for troll2 + false positive for troll2
Specificity = (23+77+92+17)/((23+77+92+17) + (102+93)) = 0.52
and same goes for Gore Police and Coll as Ice
Use sensitivity & specificity to identify which ML method needs to be used to correctly identify the data...or for prediction..
5. Bias & Variance
https://youtu.be/EuBBz3bI-aA
Height Vs Weight of Mice
1. Linear Regression or Least Square -- it adds a straight line in data set..
here.. since it's doesn't touch all data points [as u need curved line to touch all data points ] so it can't capture true relationship between weight & height and that inability in model [in this case linear regression] is called bias..
2. Another ML method might add a squiggly line.. which has very less bias.. for Training data..
Now, we measure the distances from the fit lines to the data, square them (since there are -ve distances) and add them up.. but his we have to do for testing data..and not for training data..this difference in fits is called variables [on test or train data]
On training data.. Squiggly lines will win.. but it will fail on Testing data... and that what we call as Over Fit in ML..
In M, ideal model has low bias and can accurately model the true relationship.. and it also has low variability, by producing consistent predictions across different datasets...
Three commonly used methods for finding the sweet spot between simple and complicated models are .. regularization, boosting and bagging
5. ROC & AUC
https://youtu.be/4jRBRDbJemM
Obese Vs Weight
Logistic Regression with Classification Threshold 0.5
Now, create a confusion matrix on test results of logistic reg model
Actual
Is Obese Is Not Obese
Is Obese 3 1
Predicted
Is not Obese 1 3
and then calculate sensitivity and specificity
now, if we decide different threshold instead of 0.5.. based on u want to choose false +ve or false -ves
e.g. has ebola or doesnt have.. its imp not to miss any ebola.. so how do we consider which threshold to set?
u can create diff confuction matrix.. but that would be too tedious
So, instead use ROC [receiver operator characteristic] graph...
So, basically calculate Sensitivity & (1-specificity) for each threshold and plot that on map..
anything on middle line has same proportion for true +ve and false +ve..
Area Under the Curve [AUC]
Here.. Logistic Regression ROC covers Random Forrest ROC.. so AUC is for LR - RF area
Precision = True +ve / (True+Ve + False +Ve)
ROC will basically used to identify which threshold needs to be use out of many
AUC will help to choose model based on different ROC graphs
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
1. ML Intro
https://youtu.be/Gv9_4yMHFhI
1. fit blank line in data to show the trend or make prediction
2. so ML is all about making predictions & classifications
you can put line or squiggle (which will connect all training data..) -- but all imp is how good your prediction is.. whether its a line or squiggle... model should run fine on testing data...
Once u got a model.. find the difference between real & predictions using testing data.. use it for both straight line and squiggle .... whatever model gives less difference between real & predictions.. choose that one..
SO ---
Fitting the training data well but making poor predictions, is called the BIAS-VARIANCE TRADEOFF
Its all about which model will work good for our testing data..
2. Cross Validation
https://youtu.be/fSytzGwwBVw
How do we decide which ML method to use? for that Cross Validation is used for.
Divide the data 75% - 25% for training & testing..but is that correct? now that's where cross validation comes for rescue.. it takes all combinations to validation the data..
Four-fold cross validation
Leave one out cross validation
In practice they use Ten Fold Cross Validation.... How to use it?
Tuning Parameter ?? what's that?
3. Confusion Matrix
https://youtu.be/Kdsp6soqA7o
Heart Disease prediction ...
U need to summarize how each method worked on testing data.. and for that Confusion Matrix is used
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses True Positive False Positive
Dont have H Die False Negative True Negative
Calculate this confusion matrix for all different ML models.. whatever confusion is works good.. choose that one....
Favorite Movie
Troll2 Gore Police Coll as Ice
Troll 2
Gore Police
Coll as Ice
So, size of confusion matrix depends on what all things u r gng to predict.. so if u want to choose 2 things then its 2 * 2, if its 5 then 5 * 5 or if its 40 then 40 * 40
4. Sensitivity & Specificity
https://youtu.be/vP06aMoz4v8
Confusion matrix with 2 rows & 2 cols
Rows are prediction
Cols are truth
ACTUALS
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses True Positive False Positive
PRED
Dont have H Die False Negative True Negative
Sensitivity -- what % of patient with hearth diseases are correctly identified .. or
Specificity -- what % of patient without heart diseases are correctly identified
Following values are from Logistic regression model
ACTUALS
Has Heart Disease Doesnt have heart Disease
Has Heart Dieses 139 20
PRED
Dont have H Die 32 112
Sensitivity = 139/(139+32) = 0.81
Specificity = 112/(112+20) = 0.85
Following values are from Random Forest model
ACTUALS
Has Heart Disease Doesn't have heart Disease
Has Heart Dieses 142 22
PRED
Dont have H Die 29 110
Sensitivity = 142/(142+29) = 0.83
Specificity = 110/(110+22) = 0.83
---------------3 or bigger confusion matrix
Favorite Movie
ACTUAL
Troll2 Gore Police Coll as Ice
Troll 2 12 102 93
PRED Gore Police 112 23 77
Coll as Ice 83 92 17
Sensitivity for Troll2 = True Positive for Troll2 / (True Positive for Troll2 + False Negative for Troll2)
Sensitivity = 12/(12 + (112+83)) = 0.06
Specificity for Troll2 = True -ve for troll2 / True -ve for troll2 + false positive for troll2
Specificity = (23+77+92+17)/((23+77+92+17) + (102+93)) = 0.52
and same goes for Gore Police and Coll as Ice
Use sensitivity & specificity to identify which ML method needs to be used to correctly identify the data...or for prediction..
5. Bias & Variance
https://youtu.be/EuBBz3bI-aA
Height Vs Weight of Mice
1. Linear Regression or Least Square -- it adds a straight line in data set..
here.. since it's doesn't touch all data points [as u need curved line to touch all data points ] so it can't capture true relationship between weight & height and that inability in model [in this case linear regression] is called bias..
2. Another ML method might add a squiggly line.. which has very less bias.. for Training data..
Now, we measure the distances from the fit lines to the data, square them (since there are -ve distances) and add them up.. but his we have to do for testing data..and not for training data..this difference in fits is called variables [on test or train data]
On training data.. Squiggly lines will win.. but it will fail on Testing data... and that what we call as Over Fit in ML..
In M, ideal model has low bias and can accurately model the true relationship.. and it also has low variability, by producing consistent predictions across different datasets...
Three commonly used methods for finding the sweet spot between simple and complicated models are .. regularization, boosting and bagging
5. ROC & AUC
https://youtu.be/4jRBRDbJemM
Obese Vs Weight
Logistic Regression with Classification Threshold 0.5
Now, create a confusion matrix on test results of logistic reg model
Actual
Is Obese Is Not Obese
Is Obese 3 1
Predicted
Is not Obese 1 3
and then calculate sensitivity and specificity
now, if we decide different threshold instead of 0.5.. based on u want to choose false +ve or false -ves
e.g. has ebola or doesnt have.. its imp not to miss any ebola.. so how do we consider which threshold to set?
u can create diff confuction matrix.. but that would be too tedious
So, instead use ROC [receiver operator characteristic] graph...
So, basically calculate Sensitivity & (1-specificity) for each threshold and plot that on map..
anything on middle line has same proportion for true +ve and false +ve..
Area Under the Curve [AUC]
Here.. Logistic Regression ROC covers Random Forrest ROC.. so AUC is for LR - RF area
Precision = True +ve / (True+Ve + False +Ve)
ROC will basically used to identify which threshold needs to be use out of many
AUC will help to choose model based on different ROC graphs
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
Thank you for your blog.Really looking forward to read more.
ReplyDeleteSAS training
structs online training
structs training
Webmethods online training
Webmethods training
Wise package studio online training
Wise package studio training
Python Django online training
Python Django training
nice post.
ReplyDeleteMachine Learning Training in Hyderabad
MAchine Learning Course in India