Saturday, February 1, 2020

Machine Learning

StatQuest is a youtube channel and following are some notes from that...

1. ML Intro
https://youtu.be/Gv9_4yMHFhI

1. fit blank line in data to show the trend or make prediction
2. so ML is all about making predictions & classifications

you can put line or squiggle (which will connect all training data..) -- but all imp is how good your prediction is.. whether its a line or squiggle... model should run fine on testing data...

Once u got a model.. find the difference between real & predictions using testing data.. use it for both straight line and squiggle .... whatever model gives less difference between real & predictions.. choose that one..

SO ---
Fitting the training data well but making poor predictions, is called the BIAS-VARIANCE TRADEOFF

Its all about which model will work good for our testing data..


2. Cross Validation
https://youtu.be/fSytzGwwBVw

How do we decide which ML method to use? for that Cross Validation is used for.

Divide the data 75% - 25% for training & testing..but is that correct? now that's where cross validation comes for rescue.. it takes all combinations to validation the data..

Four-fold cross validation
Leave one out cross validation
In practice they use Ten Fold Cross Validation.... How to use it?

Tuning Parameter ?? what's that?

3. Confusion Matrix
https://youtu.be/Kdsp6soqA7o

Heart Disease prediction ...

U need to summarize how each method worked on testing data.. and for that Confusion Matrix is used

                                    Has Heart Disease                        Doesnt have heart Disease

Has Heart Dieses             True Positive                                   False Positive

Dont have H Die             False Negative                                  True Negative       


Calculate this confusion matrix for all different ML models.. whatever confusion is works good.. choose that one....


Favorite Movie

                                         Troll2                                 Gore Police                          Coll as Ice

Troll 2               

Gore Police

Coll as Ice

So, size of confusion matrix depends on what all things u r gng to predict.. so if u want to choose 2 things then its 2 * 2, if its 5 then 5 * 5 or if its 40 then 40 * 40

4. Sensitivity & Specificity
https://youtu.be/vP06aMoz4v8

Confusion matrix with 2 rows & 2 cols

Rows are prediction
Cols are truth
                      
                                                                          ACTUALS

                                                 Has Heart Disease                        Doesnt have heart Disease

            Has Heart Dieses             True Positive                                   False Positive
PRED
            Dont have H Die             False Negative                                  True Negative     


Sensitivity -- what % of patient with hearth diseases are correctly identified .. or
Specificity -- what % of patient without heart diseases are correctly identified

Following values are from Logistic regression model

                                                                          ACTUALS

                                                 Has Heart Disease                        Doesnt have heart Disease

            Has Heart Dieses             139                                                   20
PRED
            Dont have H Die             32                                                     112   


Sensitivity = 139/(139+32) = 0.81

Specificity = 112/(112+20) = 0.85


Following values are from Random Forest model

                                                                          ACTUALS

                                                 Has Heart Disease                        Doesn't have heart Disease

            Has Heart Dieses             142                                                  22
PRED
            Dont have H Die              29                                                   110 


Sensitivity = 142/(142+29) = 0.83

Specificity = 110/(110+22) = 0.83

---------------3 or bigger confusion matrix

Favorite Movie

                                                               ACTUAL
                                               Troll2                       Gore Police                  Coll as Ice

                Troll 2                      12                                102                                93

PRED      Gore Police              112                               23                                 77

                Coll as Ice                 83                                92                                 17


Sensitivity for Troll2 = True Positive for Troll2 / (True Positive for Troll2 + False Negative for Troll2)

Sensitivity = 12/(12 + (112+83)) = 0.06

Specificity for Troll2 = True -ve for troll2 / True -ve for troll2 + false positive for troll2

Specificity = (23+77+92+17)/((23+77+92+17) + (102+93)) = 0.52

and same goes for Gore Police and Coll as Ice


Use sensitivity & specificity to identify which ML method needs to be used to correctly identify the data...or for prediction..

5. Bias & Variance
https://youtu.be/EuBBz3bI-aA

Height Vs Weight of Mice

1. Linear Regression or Least Square -- it adds a straight line in data set..
here.. since it's doesn't touch all data points [as u need curved line to touch all data points ] so it can't capture true relationship between weight & height and that inability in model [in this case linear regression] is called bias..

2. Another ML method might add a squiggly line.. which has very less bias.. for Training data..

Now, we measure the distances from the fit lines to the data, square them (since there are -ve distances) and add them up..  but his we have to do for testing data..and not for training data..this difference in fits is called variables [on test or train data]

On training data.. Squiggly lines will win.. but it will fail on Testing data... and that what we call as Over Fit in ML..

In M, ideal model has low bias and can accurately model the true relationship.. and it also has low variability, by producing consistent predictions across different datasets...

Three commonly used methods for finding the sweet spot between simple and complicated models are .. regularization, boosting and bagging


5. ROC & AUC
https://youtu.be/4jRBRDbJemM

Obese Vs Weight

Logistic Regression with Classification Threshold 0.5

Now, create a confusion matrix on test results of logistic reg model

                                                            Actual
                                                   Is Obese                 Is Not Obese
                        Is Obese               3                             1
Predicted
                        Is not Obese         1                             3

and then calculate sensitivity and specificity


now, if we decide different threshold instead of 0.5.. based on u want to choose false +ve or false -ves

e.g. has ebola or doesnt have.. its imp not to miss any ebola.. so how do we consider which threshold to set?
 u can create diff confuction matrix.. but that would be too tedious

So, instead use ROC [receiver operator characteristic] graph...

So, basically calculate Sensitivity & (1-specificity) for each threshold and plot that on map..


anything on middle line has same proportion for true +ve and false +ve..

Area Under the Curve [AUC]

Here.. Logistic Regression ROC covers Random Forrest ROC.. so AUC is for LR - RF area

Precision = True +ve / (True+Ve + False +Ve)

ROC will basically used to identify which threshold needs to be use out of many
AUC will help to choose model based on different ROC graphs

https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
















2 comments:

All about CSS

From book HTML & CSS - Design and Build Websites - Jon Duckett CSS works by associating rules with HTML elements. These rules govern how...