Image Source
About dataset:
I took this dataset from kaggle(https://www.kaggle.com/mig555/mushroom-classification/data) though it was originally contributed to the UCI Machine Learning repository nearly 30 years ago.
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like “leaflets three, let it be’’ for Poisonous Oak and Ivy.
Problem & Approach:
To develop a binary classifier to predict which mushroom is poisonous & which is edible. I will build a Naive Bayes classifier for prediction after basic EDA of data. Later I will also test Decision Tree & Random Forest models on this dataset.
Lets check data structure
str(mushroom)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk.root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil.type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
Renaming all entities
As you might have noticed all data entities are named by initials only. Lets convert these to proper names for clarity & also convert all attributes to factors as all attributes are categorical here.
colnames(mushroom) <- c("Edibility", "CapShape", "CapSurface",
"CapColor", "Bruises", "Odor",
"GillAttachment", "GillSpacing", "GillSize",
"GillColor", "StalkShape", "StalkRoot",
"StalkSurfaceAboveRing", "StalkSurfaceBelowRing", "StalkColorAboveRing",
"StalkColorBelowRing", "VeilType", "VeilColor",
"RingNumber", "RingType", "SporePrintColor",
"Population", "Habitat")
mushroom <- mushroom %>% map_df(function(.x) as.factor(.x))
levels(mushroom$Edibility) <- c("edible", "poisonous")
levels(mushroom$CapShape) <- c("bell", "conical", "flat", "knobbed", "sunken", "convex")
levels(mushroom$CapColor) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$CapSurface) <- c("fibrous", "grooves", "scaly", "smooth")
levels(mushroom$Bruises) <- c("no", "yes")
levels(mushroom$Odor) <- c("almond", "creosote", "foul", "anise", "musty", "none", "pungent", "spicy", "fishy")
levels(mushroom$GillAttachment) <- c("attached", "free")
levels(mushroom$GillSpacing) <- c("close", "crowded")
levels(mushroom$GillSize) <- c("broad", "narrow")
levels(mushroom$GillColor) <- c("buff", "red", "gray", "chocolate", "black", "brown", "orange",
"pink", "green", "purple", "white", "yellow")
levels(mushroom$StalkShape) <- c("enlarging", "tapering")
levels(mushroom$StalkRoot) <- c("missing", "bulbous", "club", "equal", "rooted")
levels(mushroom$StalkSurfaceAboveRing) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$StalkSurfaceBelowRing) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$StalkColorAboveRing) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$StalkColorBelowRing) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$VeilType) <- "partial"
levels(mushroom$VeilColor) <- c("brown", "orange", "white", "yellow")
levels(mushroom$RingNumber) <- c("none", "one", "two")
levels(mushroom$RingType) <- c("evanescent", "flaring", "large", "none", "pendant")
levels(mushroom$SporePrintColor) <- c("buff", "chocolate", "black", "brown", "orange",
"green", "purple", "white", "yellow")
levels(mushroom$Population) <- c("abundant", "clustered", "numerous", "scattered", "several", "solitary")
levels(mushroom$Habitat) <- c("wood", "grasses", "leaves", "meadows", "paths", "urban", "waste")
Lets check few records from dataset now
head(mushroom) %>% kable("html") %>%
kable_styling()
Edibility | CapShape | CapSurface | CapColor | Bruises | Odor | GillAttachment | GillSpacing | GillSize | GillColor | StalkShape | StalkRoot | StalkSurfaceAboveRing | StalkSurfaceBelowRing | StalkColorAboveRing | StalkColorBelowRing | VeilType | VeilColor | RingNumber | RingType | SporePrintColor | Population | Habitat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
poisonous | convex | scaly | brown | yes | pungent | free | close | narrow | black | enlarging | equal | smooth | smooth | purple | purple | partial | white | one | pendant | black | scattered | urban |
edible | convex | scaly | yellow | yes | almond | free | close | broad | black | enlarging | club | smooth | smooth | purple | purple | partial | white | one | pendant | brown | numerous | grasses |
edible | bell | scaly | white | yes | anise | free | close | broad | brown | enlarging | club | smooth | smooth | purple | purple | partial | white | one | pendant | brown | numerous | meadows |
poisonous | convex | smooth | white | yes | pungent | free | close | narrow | brown | enlarging | equal | smooth | smooth | purple | purple | partial | white | one | pendant | black | scattered | urban |
edible | convex | scaly | gray | no | none | free | crowded | broad | black | tapering | equal | smooth | smooth | purple | purple | partial | white | one | evanescent | brown | abundant | grasses |
edible | convex | smooth | yellow | yes | almond | free | close | broad | brown | enlarging | club | smooth | smooth | purple | purple | partial | white | one | pendant | black | numerous | grasses |
Lets check the structure of data now
str(mushroom)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8124 obs. of 23 variables:
## $ Edibility : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ CapShape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ CapSurface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ...
## $ CapColor : Factor w/ 10 levels "buff","cinnamon",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ Bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ Odor : Factor w/ 9 levels "almond","creosote",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ GillAttachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
## $ GillSpacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ GillSize : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ GillColor : Factor w/ 12 levels "buff","red","gray",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ StalkShape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ StalkRoot : Factor w/ 5 levels "missing","bulbous",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ StalkSurfaceAboveRing: Factor w/ 4 levels "fibrous","silky",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ StalkSurfaceBelowRing: Factor w/ 4 levels "fibrous","silky",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ StalkColorAboveRing : Factor w/ 10 levels "buff","cinnamon",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ StalkColorBelowRing : Factor w/ 10 levels "buff","cinnamon",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ VeilType : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
## $ VeilColor : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ RingNumber : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
## $ RingType : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ SporePrintColor : Factor w/ 9 levels "buff","chocolate",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ Population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ Habitat : Factor w/ 7 levels "wood","grasses",..: 6 2 4 6 2 2 4 4 2 4 ...
Lets find out more about each category of each attribute
summary(mushroom)
## Edibility CapShape CapSurface CapColor Bruises
## edible :4208 bell : 452 fibrous:2320 brown :2284 no :4748
## poisonous:3916 conical: 4 grooves: 4 gray :1840 yes:3376
## flat :3152 scaly :2556 red :1500
## knobbed: 828 smooth :3244 yellow :1072
## sunken : 32 white :1040
## convex :3656 buff : 168
## (Other): 220
## Odor GillAttachment GillSpacing GillSize
## none :3528 attached: 210 close :6812 broad :5612
## foul :2160 free :7914 crowded:1312 narrow:2512
## spicy : 576
## fishy : 576
## almond : 400
## anise : 400
## (Other): 484
## GillColor StalkShape StalkRoot StalkSurfaceAboveRing
## buff :1728 enlarging:3516 missing:2480 fibrous: 552
## pink :1492 tapering :4608 bulbous:3776 silky :2372
## white :1202 club : 556 smooth :5176
## brown :1048 equal :1120 scaly : 24
## gray : 752 rooted : 192
## chocolate: 732
## (Other) :1170
## StalkSurfaceBelowRing StalkColorAboveRing StalkColorBelowRing
## fibrous: 600 purple :4464 purple :4384
## silky :2304 green :1872 green :1872
## smooth :4936 gray : 576 gray : 576
## scaly : 284 brown : 448 brown : 512
## buff : 432 buff : 432
## pink : 192 pink : 192
## (Other): 140 (Other): 156
## VeilType VeilColor RingNumber RingType
## partial:8124 brown : 96 none: 36 evanescent:2776
## orange: 96 one :7488 flaring : 48
## white :7924 two : 600 large :1296
## yellow: 8 none : 36
## pendant :3968
##
##
## SporePrintColor Population Habitat
## white :2388 abundant : 384 wood :3148
## brown :1968 clustered: 340 grasses:2148
## black :1872 numerous : 400 leaves : 832
## chocolate:1632 scattered:1248 meadows: 292
## green : 72 several :4040 paths :1144
## buff : 48 solitary :1712 urban : 368
## (Other) : 144 waste : 192
Checking Frequency of mushroom classes
freq <- function(x){table(x)/length(x)*100}
freq(mushroom$Edibility)
## x
## edible poisonous
## 51.79714 48.20286
Poisonous & Edible classes are almost balanced.
Bar Charts comparing Edibility across all mushroom features
Classifying Mushrooms:
Creating Train Test Splits
I will take 70% (5386 mushrooms) sample data for training & 30% (2438 mushrooms) for testing.
set.seed(2)
s=sample(1:nrow(mushroom),0.7*nrow(mushroom))
mush_train=mushroom[s,]
mush_test=mushroom[-s,]
mush_test1<- mush_test[, -1]
Creating Model using Naive Bayes Classifier
Naive Bayes classifier is based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
model <- naiveBayes(Edibility ~. , data = mush_train)
Predicting Mushroom Class on Testset
Lets test our model on remaining 30% test data
pred <- predict(model, mush_test1)
Model Evaluation
confusionMatrix(pred,mush_test$Edibility)
## Confusion Matrix and Statistics
##
## Reference
## Prediction edible poisonous
## edible 1245 147
## poisonous 9 1037
##
## Accuracy : 0.936
## 95% CI : (0.9256, 0.9454)
## No Information Rate : 0.5144
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8715
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9928
## Specificity : 0.8758
## Pos Pred Value : 0.8944
## Neg Pred Value : 0.9914
## Prevalence : 0.5144
## Detection Rate : 0.5107
## Detection Prevalence : 0.5710
## Balanced Accuracy : 0.9343
##
## 'Positive' Class : edible
##
In case of mushroom classification few False Negatives are tolerable but even a single False Positive can take someones life. We measure these as Sensitivity & Specificity.
We are getting Sensitivity(True Positive Rate) of 99.28% which is good as it represent our prediction for edible mushrooms & only .7% False negatives(9 Mushrooms).
But Specificity(True Negative Rate) or our ability to classify Poisonous mushrooms is 87.58%, which is not so good as more then 10% Poisonous mushrooms may get identified as Edible. This model have 147 False Positives which is not acceptable. so lets try a decision tree based model now.
Creating a Decision Tree based Classifier
Decision tree is a type of supervised learning algorithm works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
tree.model <- rpart(Edibility ~ .,data = mush_train,method = "class",parms = list(split ="information"))
#prp(tree.model)
#summary(tree.model)
rpart.plot(tree.model,extra = 3,fallen.leaves = T)
tree.predict <- predict(tree.model, mush_test[,-c(1)], type = "class")
confusionMatrix(mush_test$Edibility, tree.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction edible poisonous
## edible 1254 0
## poisonous 16 1168
##
## Accuracy : 0.9934
## 95% CI : (0.9894, 0.9962)
## No Information Rate : 0.5209
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9869
## Mcnemar's Test P-Value : 0.0001768
##
## Sensitivity : 0.9874
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9865
## Prevalence : 0.5209
## Detection Rate : 0.5144
## Detection Prevalence : 0.5144
## Balanced Accuracy : 0.9937
##
## 'Positive' Class : edible
##
This model gives us an ideal Specificity of 1 but Decision Trees are prone to overfitting & so model may perform poorly on test data. Also algorithm have chosen only 2 attributes Order & SporePrintColor for making this prediction which might be questionable. Lets also try a final model called Random Forest which is robust & often gives a more generalizable model.
Creating Random Forest Classifier
Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
set.seed(1234)
rf_model <- randomForest(Edibility~.,data=mush_train, importance = TRUE, ntree = 1000)
rf_model
##
## Call:
## randomForest(formula = Edibility ~ ., data = mush_train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## edible poisonous class.error
## edible 2954 0 0
## poisonous 0 2732 0
varImpPlot(rf_model)
Random forest of 1000 trees have also identified Odor & SporePrintColor as most important variables. Lets check its predictions on test data now.
rf_prediction <- predict(rf_model, mush_test[,-c(1)])
confusionMatrix(mush_test$Edibility, rf_prediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction edible poisonous
## edible 1254 0
## poisonous 0 1184
##
## Accuracy : 1
## 95% CI : (0.9985, 1)
## No Information Rate : 0.5144
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5144
## Detection Rate : 0.5144
## Detection Prevalence : 0.5144
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : edible
##
We are getting 100% Accuracy, Sensitivity & Specificity now. I am not
sure if this is correct. Please share your feedback on errors &
improvements. Thanks for reading this notebook.