The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, such as length, height, weight etc. can be used to predict the age (or rings).
Data comes from an original (non-machine-learning) study:
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) “The Population Biology of Abalone (_Haliotis_species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait“, Sea Fisheries Division, Technical
Report No. 48 (ISSN 1034-3288)
Source: UCI Machine Learning Repository
Number of Instances: 4177
Number of Attributes: 5
Attribute information: Given is the attribute name, attribute type, the measurement unit and a brief description.
Name | Data Type | Description |
---|---|---|
Sex | nominal | M, F, and I (infant) |
Length (mm) | continuous | Longest shell measurement |
Diameter (mm) | continuous | perpendicular to length |
Height (mm) | continuous | with meat in shell |
Whole weight (grams) | continuous | whole abalone |
Rings | integer | +1.5 gives the age in years |
We will solve this prediction problem in six steps:
A. Prepare
B. Validate
C. Explore
D. Fit the model
E. Analyse
We have divided the entire dataset into two data files:
test.data
which contains 500 (~12%) randomly selected observations
main.data
which contains the remaining 3677 (~88%) observations
Reading the data from main.data
and test.data
into a data frames called main.df
and test.df
respectively and printing the first 5 lines of each data frame.
main.df = read.table("main.data", header = TRUE)
head(main.df, n = 5)
## sex length diameter height weight rings
## 1 M 0.455 0.365 0.095 0.5140 15
## 2 M 0.350 0.265 0.090 0.2255 7
## 3 M 0.440 0.365 0.125 0.5160 10
## 4 I 0.330 0.255 0.080 0.2050 7
## 5 I 0.425 0.300 0.095 0.3515 8
test.df = read.table("test.data", header = TRUE)
head(test.df, n = 5)
## sex length diameter height weight rings
## 1 F 0.530 0.420 0.135 0.6770 9
## 2 F 0.530 0.415 0.150 0.7775 20
## 3 M 0.490 0.380 0.135 0.5415 11
## 4 M 0.450 0.320 0.100 0.3810 9
## 5 M 0.575 0.425 0.140 0.8635 11
Here we try to identify any mistakes in data using summary and plots. One of the way to handle errorneous observations is to remove them (used here), others include substituting using mean.
As per the data, the categorical variable Sex has three possible values M, F, I. Let’s check the same:
summary(main.df$sex)
## F I M
## 1154 1191 1332
summary(test.df$sex)
## F I M
## 153 151 196
Looks fine.
summary(main.df[,-c(1,6)])
## length diameter height weight
## Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020
## 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4430
## Median :0.545 Median :0.4250 Median :0.1400 Median :0.7950
## Mean :0.524 Mean :0.4078 Mean :0.1395 Mean :0.8286
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1540
## Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255
pairs20x(main.df[,-c(1,6)])
summary(test.df[,-c(1,6)])
## length diameter height weight
## Min. :0.1100 Min. :0.0900 Min. :0.0200 Min. :0.0080
## 1st Qu.:0.4537 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4339
## Median :0.5500 Median :0.4300 Median :0.1425 Median :0.8255
## Mean :0.5240 Mean :0.4082 Mean :0.1397 Mean :0.8301
## 3rd Qu.:0.6100 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1413
## Max. :0.7750 Max. :0.5950 Max. :0.2350 Max. :2.3235
pairs20x(test.df[,-c(1,6)])
Looks like some problem with values of height
variable in main.df
as the pairs plot shows couple of influential height points. test.df
looks fine.
length
is the longest measurementmain.df[(which(main.df$height > main.df$length | main.df$diameter > main.df$length)),]
## sex length diameter height weight rings
## 1065 I 0.185 0.375 0.12 0.4645 6
## 1799 F 0.455 0.355 1.13 0.5940 8
test.df[which(test.df$height > test.df$length | test.df$diameter > test.df$length),]
## [1] sex length diameter height weight rings
## <0 rows> (or 0-length row.names)
Found two points in main.df
that violate this condition. test.df
looks fine.
length
, diameter
, height
, weight
should be greater than zeromain.df[(which(main.df$length <= 0 | main.df$diameter <= 0 | main.df$height <= 0 | main.df$weight <= 0)),]
## sex length diameter height weight rings
## 1109 I 0.430 0.34 0 0.428 8
## 3521 I 0.315 0.23 0 0.134 6
test.df[(which(test.df$length <= 0 | test.df$diameter <= 0 | test.df$height <= 0 | test.df$weight <= 0)),]
## [1] sex length diameter height weight rings
## <0 rows> (or 0-length row.names)
No issues here.
Removing observations 1065, 1109, 1799 and 3521
from main.df
as they violate the constraints 3.1. and 3.2. and plotting the data again.
newmain.df = main.df[-c(1065, 1109 ,1799, 3521),]
pairs20x(length ~ height + diameter, data=newmain.df )
newmain.df[which(newmain.df$height > 0.5),]
## sex length diameter height weight rings
## 1249 M 0.705 0.565 0.515 2.21 10
Observation 1249
seems to be another suspicious data point, as the weight
is very less compared to its dimensions, however it does statisfy the conditions 3.1. & 3.2., so we can keep 1249
in newmain.df
for now and just remove 1065, 1109, 1799 and 3521
.
Let us explore relationships between variables conditional on the level of another variable.
library(lattice)
xyplot(rings~height|sex,data=newmain.df,col="#FF000030",
panel=function(...){panel.xyplot(...)
panel.grid(h=5, v= 4, col="blue")})
There seems to be a linear relationship between height and rings among all three groups, its almost same in Male and Females, so it makes us think that there are inherently two groups Adults and Infants.
xyplot(length~weight|sex,data=newmain.df,col="#FF000030",
panel=function(...){panel.xyplot(...)
panel.grid(h=3, v= 4, col="blue")})
Males and Females have lower slope than infants, infants have steeper slope, here the relationship seems to be quadratic (curve). Again, Males and Females have almost similar relationship which makes us think that there are inherently two groups Adults and Infants.
We will fit three models to estimate the number of rings based on the values of the remaining variables.
rings.lm = lm (rings ~ ., data = newmain.df)
plot(rings.lm,which=1)
# 1249 looks influential
cooks20x(rings.lm)
# Observation 1247 (= data point 1249 of main.df) have Cook's distance greater than 0.4, that was initial suspect as well.
# Updating model by removing influential point
newmain.df[1247,]
## sex length diameter height weight rings
## 1249 M 0.705 0.565 0.515 2.21 10
new.df = newmain.df[-1247,]
rings.lm2 = lm (rings ~ . , data = new.df)
plot(rings.lm2, which=1)
# EOV check seems to be satisfied
cooks20x(rings.lm2)