Problem Statement

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, such as length, height, weight etc. can be used to predict the age (or rings).

Data Description

Data comes from an original (non-machine-learning) study:

Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) “The Population Biology of Abalone (_Haliotis_species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait“, Sea Fisheries Division, Technical

Report No. 48 (ISSN 1034-3288)


Source: UCI Machine Learning Repository


Number of Instances: 4177


Number of Attributes: 5


Attribute information: Given is the attribute name, attribute type, the measurement unit and a brief description.

Name Data Type Description
Sex nominal M, F, and I (infant)
Length (mm) continuous Longest shell measurement
Diameter (mm) continuous perpendicular to length
Height (mm) continuous with meat in shell
Whole weight (grams) continuous whole abalone
Rings integer +1.5 gives the age in years

Game Plan

We will solve this prediction problem in six steps:
A. Prepare
B. Validate
C. Explore
D. Fit the model
E. Analyse

A. Prepare

We have divided the entire dataset into two data files:
test.data which contains 500 (~12%) randomly selected observations
main.data which contains the remaining 3677 (~88%) observations

Reading the data from main.data and test.data into a data frames called main.df and test.df respectively and printing the first 5 lines of each data frame.

main.df = read.table("main.data", header = TRUE)
head(main.df, n = 5)
##   sex length diameter height weight rings
## 1   M  0.455    0.365  0.095 0.5140    15
## 2   M  0.350    0.265  0.090 0.2255     7
## 3   M  0.440    0.365  0.125 0.5160    10
## 4   I  0.330    0.255  0.080 0.2050     7
## 5   I  0.425    0.300  0.095 0.3515     8
test.df = read.table("test.data", header = TRUE)
head(test.df, n = 5)
##   sex length diameter height weight rings
## 1   F  0.530    0.420  0.135 0.6770     9
## 2   F  0.530    0.415  0.150 0.7775    20
## 3   M  0.490    0.380  0.135 0.5415    11
## 4   M  0.450    0.320  0.100 0.3810     9
## 5   M  0.575    0.425  0.140 0.8635    11

B. Validate

Here we try to identify any mistakes in data using summary and plots. One of the way to handle errorneous observations is to remove them (used here), others include substituting using mean.

1. The categorical variable

As per the data, the categorical variable Sex has three possible values M, F, I. Let’s check the same:

summary(main.df$sex)
##    F    I    M 
## 1154 1191 1332
summary(test.df$sex)
##   F   I   M 
## 153 151 196

Looks fine.

2. The numerical variables

summary(main.df[,-c(1,6)])
##      length         diameter          height           weight      
##  Min.   :0.075   Min.   :0.0550   Min.   :0.0000   Min.   :0.0020  
##  1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4430  
##  Median :0.545   Median :0.4250   Median :0.1400   Median :0.7950  
##  Mean   :0.524   Mean   :0.4078   Mean   :0.1395   Mean   :0.8286  
##  3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1540  
##  Max.   :0.815   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255
pairs20x(main.df[,-c(1,6)])

summary(test.df[,-c(1,6)])
##      length          diameter          height           weight      
##  Min.   :0.1100   Min.   :0.0900   Min.   :0.0200   Min.   :0.0080  
##  1st Qu.:0.4537   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4339  
##  Median :0.5500   Median :0.4300   Median :0.1425   Median :0.8255  
##  Mean   :0.5240   Mean   :0.4082   Mean   :0.1397   Mean   :0.8301  
##  3rd Qu.:0.6100   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1413  
##  Max.   :0.7750   Max.   :0.5950   Max.   :0.2350   Max.   :2.3235
pairs20x(test.df[,-c(1,6)])

Looks like some problem with values of height variable in main.df as the pairs plot shows couple of influential height points. test.df looks fine.

3. Verifying constraints

3.1. length is the longest measurement

main.df[(which(main.df$height > main.df$length | main.df$diameter > main.df$length)),]
##      sex length diameter height weight rings
## 1065   I  0.185    0.375   0.12 0.4645     6
## 1799   F  0.455    0.355   1.13 0.5940     8
test.df[which(test.df$height > test.df$length | test.df$diameter > test.df$length),]
## [1] sex      length   diameter height   weight   rings   
## <0 rows> (or 0-length row.names)

Found two points in main.df that violate this condition. test.df looks fine.

3.2. length, diameter, height, weight should be greater than zero

main.df[(which(main.df$length <= 0 | main.df$diameter <= 0 | main.df$height <= 0 | main.df$weight <= 0)),]
##      sex length diameter height weight rings
## 1109   I  0.430     0.34      0  0.428     8
## 3521   I  0.315     0.23      0  0.134     6
test.df[(which(test.df$length <= 0 | test.df$diameter <= 0 | test.df$height <= 0 | test.df$weight <= 0)),]
## [1] sex      length   diameter height   weight   rings   
## <0 rows> (or 0-length row.names)

No issues here.

Removing observations 1065, 1109, 1799 and 3521 from main.df as they violate the constraints 3.1. and 3.2. and plotting the data again.

newmain.df = main.df[-c(1065, 1109 ,1799, 3521),]
pairs20x(length ~ height + diameter, data=newmain.df )

newmain.df[which(newmain.df$height > 0.5),]
##      sex length diameter height weight rings
## 1249   M  0.705    0.565  0.515   2.21    10

Observation 1249 seems to be another suspicious data point, as the weight is very less compared to its dimensions, however it does statisfy the conditions 3.1. & 3.2., so we can keep 1249 in newmain.df for now and just remove 1065, 1109, 1799 and 3521.

C. Explore

Let us explore relationships between variables conditional on the level of another variable.

library(lattice)
xyplot(rings~height|sex,data=newmain.df,col="#FF000030",
panel=function(...){panel.xyplot(...)
panel.grid(h=5, v= 4, col="blue")})

There seems to be a linear relationship between height and rings among all three groups, its almost same in Male and Females, so it makes us think that there are inherently two groups Adults and Infants.

xyplot(length~weight|sex,data=newmain.df,col="#FF000030",
panel=function(...){panel.xyplot(...)
panel.grid(h=3, v= 4, col="blue")})

Males and Females have lower slope than infants, infants have steeper slope, here the relationship seems to be quadratic (curve). Again, Males and Females have almost similar relationship which makes us think that there are inherently two groups Adults and Infants.

D. Fit the model

We will fit three models to estimate the number of rings based on the values of the remaining variables.

(a) The ordinary regression model using rings as the response.

rings.lm = lm (rings ~ ., data = newmain.df)
plot(rings.lm,which=1)

# 1249 looks influential

cooks20x(rings.lm)

# Observation 1247 (= data point 1249 of main.df) have Cook's distance greater than 0.4, that was initial suspect as well.

# Updating model by removing influential point
newmain.df[1247,]
##      sex length diameter height weight rings
## 1249   M  0.705    0.565  0.515   2.21    10
new.df = newmain.df[-1247,]

rings.lm2 = lm (rings ~ . , data = new.df)

plot(rings.lm2, which=1)

# EOV check seems to be satisfied

cooks20x(rings.lm2)