Thursday, 3 October 2013

Logistic Regression in Mahout

Logistic Regression(LR) is a type of regression analysis used for the prediction of the probability of occurence of an event. It uses several predictors which may be either numerical or categorical.
It refers specifically to the problem in which dependent variable is dichotomous. i.e.
Predict whether a patient has a given disease or not,whether user will buy a product or not... etc

It can be implemented in Mahout as well as in R. Here we'll talk about Mahout implementation.
Mahout implementation uses Stochastic Gradient Descent(SGD) on all large training data sets.

Following are the steps to run LR:

# To train the model -
It produces a model based on training data that can be used to classify dataset of specific format. It takes training dataset as input and uses it to produce the target model.


$MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 60 --input $MAHOUT_HOME/examples/src/main/resources/donut.csv  --features 100 --output output/donutmodel.model --target color  --categories 2 --predictors  x y xx xy yy a b c --types n


"input" :  training data
"output" : path to the file where model will be written.
"target" : dependent variable which is to be predicted
"categories" : number of unique possible values that target can be assigned
"predictors" : list of field names that are to be used to predict target variable
"types" : datatypes for the items in predictor list
"passes" : number of passes over the input data
"features" : size of internal feature vector
"lambda" : amount of co-efficient decay to use
"rate" : initial learning rate

It'll give output like this and one model file will be generated on the given location:


Running on hadoop, using /home/hadoop/hadoop-0.20.203.0/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /data/dataAnalytics/mahout-distribution-0.7-CUSTOM/mahout-examples-0.7-job.jar
13/10/03 11:02:48 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
100
color ~ 6.214*Intercept Term + 0.894*a + -1.255*b + -26.279*c + 4.623*x + -5.436*xx + 3.050*xy + 6.001*y + -6.190*yy
      Intercept Term 6.21450
                   a 0.89445
                   b -1.25489
                   c -26.27914
                   x 4.62344
                  xx -5.43578
                  xy 3.04982
                   y 6.00145
                  yy -6.19029
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     4.623441607     0.000000000     0.000000000     6.214498855     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -5.435784604     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000   -26.279139691     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -1.254893124     0.000000000     0.000000000    -6.190291596     0.000000000     0.894450921     0.000000000     3.049819437     0.000000000     0.000000000     0.000000000     0.000000000     6.001446962     0.000000000     0.000000000
13/10/03 11:02:48 INFO driver.MahoutDriver: Program took 616 ms (Minutes: 0.010266666666666667)

# To test the model : 
We have generated the model in the first step. Now We'll use that to test our system to see, how accurate it is to classify things.


$MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input $MAHOUT_HOME/examples/src/main/resources/donut-test.csv --model output/donutmodel.model --auc –confusion


Output would be like this:

Running on hadoop, using /home/hadoop/hadoop-0.20.203.0/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /data/dataAnalytics/mahout-distribution-0.7-CUSTOM/mahout-examples-0.7-job.jar
13/10/03 11:03:13 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.RunLogistic.props found on classpath, will use command-line arguments only
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -3.4], [-4.8, -0.1]]
13/10/03 11:03:14 INFO driver.MahoutDriver: Program took 130 ms (Minutes: 0.0021666666666666666)


where  AUC : Area under curve. It ranges from 0 to 1. A value of 0 means it wasn't able to classify the input correctly and a value of 1 means that it was able to classify records correctly. Accordingly, we can see how our model is working.

confusion : it will give you confusion matrix, from where you can see the prediction.

Now we can predict our test data from above generated model and can answer the question.

So start using LR for solving your problems!!!!

No comments:

Post a Comment