Problem Statement

Given the salaries for all academic staff at a US university. We are interested in whether gender has a direct effect on salary, which would be unlawful discrimination.

Data Description

The salary.txt file contains salaries for all academic staff at a US university. It is part of a file made public in a gender discrimination lawsuit. The variables are:

Game Plan

A. Explore causality
B. Guess the missing variables
C. Fit the model
D. Analyse the results
E. Provide executive summary

A. Explore causality

To study the direct effect (causality) let us first draw the causal graph with all the variables given in the dataset.

Causal graph-1

Based on the given dataset, gender would have an impact on choosing a field, completion of highest degree (yrdeg) and starting to work at that institution (startyr). Similarly, a field would determine whether highest degree would be PhD or other (deg). Using year & yrdeg overall experience can be calculated and year & startyr would give experience at that institute. Overall experience and institutional experience would have an effect on rank and admin variables. Also, deg would have impact on rank and admin variables. Variables such as rank, admin and field would have an impact on salary. Red edge between gender and salary represents the relationship of our interest.

B. Missing variables

We think, some important variables not in the dataset that vital for this causal question are:

  1. family - this variable could provide information whether the staff has any family responsibilities or not such as childcare or sick person care at home, this could impact the number of hours and/or workload of the staff. Usually females have childcare responsibilities that can result in stronger preferences in certain shift timings, place of work, etc.

  2. courses - this variable could provide information about number of courses a staff is responsible for. More courses indicates more salary. Only certain ranks could be eligible for teaching/co-ordinating more courses.

  3. grantfunds - this variable provides information about research grant funds or external sponsorship. Certain fields could have research grant funds at departmental level, that allows staff to earn extra salary for e.g. summer research. Since choosing a field depends upon individual preferences therefore this component could be missed by some staff if their departments do not have such funds.

Let us see how these missing variables can affect the casuality.

Causal graph-2

C. Fit the model

Since we do not have information about these missing variables, we are restricting just to the variables in the dataset and pretending that no other variables are needed, estimate the overall effect and direct effect of gender on salary (this may depend, e.g., on year).

salary.df=read.table("salary.txt",header= T)
summary(salary.df)
##  gender      deg           yrdeg         field         startyr    
##  F: 408   Other: 142   Min.   :48.00   Arts : 220   Min.   :48.0  
##  M:1187   PhD  :1350   1st Qu.:69.00   Other:1065   1st Qu.:73.0  
##           Prof : 103   Median :75.00   Prof : 310   Median :83.0  
##                        Mean   :76.08                Mean   :81.1  
##                        3rd Qu.:84.00                3rd Qu.:90.0  
##                        Max.   :96.00                Max.   :95.0  
##                                                                   
##       year        rank         admin           salary     
##  Min.   :95   Assist:313   Min.   :0.000   Min.   : 3042  
##  1st Qu.:95   Assoc :437   1st Qu.:0.000   1st Qu.: 4743  
##  Median :95   Full  :845   Median :0.000   Median : 5962  
##  Mean   :95                Mean   :0.106   Mean   : 6392  
##  3rd Qu.:95                3rd Qu.:0.000   3rd Qu.: 7602  
##  Max.   :95                Max.   :1.000   Max.   :14464  
##                                            NA's   :1
# Remove NA's (handle missing data)
salary.df.na.rm=na.omit(salary.df)
summary(salary.df.na.rm)
##  gender      deg           yrdeg         field         startyr     
##  F: 408   Other: 142   Min.   :48.00   Arts : 220   Min.   :48.00  
##  M:1186   PhD  :1349   1st Qu.:69.00   Other:1064   1st Qu.:73.00  
##           Prof : 103   Median :75.00   Prof : 310   Median :83.00  
##                        Mean   :76.06                Mean   :81.09  
##                        3rd Qu.:84.00                3rd Qu.:90.00  
##                        Max.   :95.00                Max.   :95.00  
##       year        rank         admin           salary     
##  Min.   :95   Assist:312   Min.   :0.000   Min.   : 3042  
##  1st Qu.:95   Assoc :437   1st Qu.:0.000   1st Qu.: 4743  
##  Median :95   Full  :845   Median :0.000   Median : 5962  
##  Mean   :95                Mean   :0.106   Mean   : 6392  
##  3rd Qu.:95                3rd Qu.:0.000   3rd Qu.: 7602  
##  Max.   :95                Max.   :1.000   Max.   :14464
pairs20x(salary.df.na.rm[c(1,3,4,5,9)])

# no unusual points seen, lets model it

salary.lm=lm(salary~gender+field+startyr+yrdeg+rank+admin+deg,data=salary.df.na.rm )

plot(salary.lm, which =1)

# EOV check is not pass, as residuals are not evenly spread

normcheck(salary.lm)

# QQ plot shows right skewness

# Let's log the response
salary.lm2=lm(log(salary)~gender+field+startyr+yrdeg+rank+admin+deg,data=salary.df.na.rm )

plot(salary.lm2, which =1)