Given the salaries for all academic staff at a US university. We are interested in whether gender has a direct effect on salary, which would be unlawful discrimination.
The salary.txt
file contains salaries for all academic staff at a US university. It is part of a file made public in a gender discrimination lawsuit. The variables are:
gender
: M or Fdeg
: highest degree is PhD or Otheryrdeg
: year of highest degreefield
: Visual and Performing Arts vs Professional (Medicine, Law, Nursing, etc) vs Otherstartyr
: first year of employment at this institutionyear
: year of most recent salary data for this personrank
: Assistant Professor (Lecturer, Associate Professor (Senior Lecturer), Full Professor (A/Prof, Prof)admin
: has extra pay for administrative/managerial responsibility (Deans, etc)salary
: Monthly salaryA. Explore causality
B. Guess the missing variables
C. Fit the model
D. Analyse the results
E. Provide executive summary
To study the direct effect (causality) let us first draw the causal graph with all the variables given in the dataset.
Based on the given dataset, gender
would have an impact on choosing a field
, completion of highest degree (yrdeg
) and starting to work at that institution (startyr
). Similarly, a field
would determine whether highest degree would be PhD or other (deg
). Using year
& yrdeg
overall experience can be calculated and year
& startyr
would give experience at that institute. Overall experience and institutional experience would have an effect on rank
and admin
variables. Also, deg
would have impact on rank
and admin
variables. Variables such as rank
, admin
and field
would have an impact on salary. Red edge between gender
and salary
represents the relationship of our interest.
We think, some important variables not in the dataset that vital for this causal question are:
family
- this variable could provide information whether the staff has any family responsibilities or not such as childcare or sick person care at home, this could impact the number of hours and/or workload of the staff. Usually females have childcare responsibilities that can result in stronger preferences in certain shift timings, place of work, etc.
courses
- this variable could provide information about number of courses a staff is responsible for. More courses indicates more salary. Only certain ranks could be eligible for teaching/co-ordinating more courses.
grantfunds
- this variable provides information about research grant funds or external sponsorship. Certain fields could have research grant funds at departmental level, that allows staff to earn extra salary for e.g. summer research. Since choosing a field depends upon individual preferences therefore this component could be missed by some staff if their departments do not have such funds.
Let us see how these missing variables can affect the casuality.
Since we do not have information about these missing variables, we are restricting just to the variables in the dataset and pretending that no other variables are needed, estimate the overall effect and direct effect of gender on salary (this may depend, e.g., on year).
salary.df=read.table("salary.txt",header= T)
summary(salary.df)
## gender deg yrdeg field startyr
## F: 408 Other: 142 Min. :48.00 Arts : 220 Min. :48.0
## M:1187 PhD :1350 1st Qu.:69.00 Other:1065 1st Qu.:73.0
## Prof : 103 Median :75.00 Prof : 310 Median :83.0
## Mean :76.08 Mean :81.1
## 3rd Qu.:84.00 3rd Qu.:90.0
## Max. :96.00 Max. :95.0
##
## year rank admin salary
## Min. :95 Assist:313 Min. :0.000 Min. : 3042
## 1st Qu.:95 Assoc :437 1st Qu.:0.000 1st Qu.: 4743
## Median :95 Full :845 Median :0.000 Median : 5962
## Mean :95 Mean :0.106 Mean : 6392
## 3rd Qu.:95 3rd Qu.:0.000 3rd Qu.: 7602
## Max. :95 Max. :1.000 Max. :14464
## NA's :1
# Remove NA's (handle missing data)
salary.df.na.rm=na.omit(salary.df)
summary(salary.df.na.rm)
## gender deg yrdeg field startyr
## F: 408 Other: 142 Min. :48.00 Arts : 220 Min. :48.00
## M:1186 PhD :1349 1st Qu.:69.00 Other:1064 1st Qu.:73.00
## Prof : 103 Median :75.00 Prof : 310 Median :83.00
## Mean :76.06 Mean :81.09
## 3rd Qu.:84.00 3rd Qu.:90.00
## Max. :95.00 Max. :95.00
## year rank admin salary
## Min. :95 Assist:312 Min. :0.000 Min. : 3042
## 1st Qu.:95 Assoc :437 1st Qu.:0.000 1st Qu.: 4743
## Median :95 Full :845 Median :0.000 Median : 5962
## Mean :95 Mean :0.106 Mean : 6392
## 3rd Qu.:95 3rd Qu.:0.000 3rd Qu.: 7602
## Max. :95 Max. :1.000 Max. :14464
pairs20x(salary.df.na.rm[c(1,3,4,5,9)])
# no unusual points seen, lets model it
salary.lm=lm(salary~gender+field+startyr+yrdeg+rank+admin+deg,data=salary.df.na.rm )
plot(salary.lm, which =1)
# EOV check is not pass, as residuals are not evenly spread
normcheck(salary.lm)
# QQ plot shows right skewness
# Let's log the response
salary.lm2=lm(log(salary)~gender+field+startyr+yrdeg+rank+admin+deg,data=salary.df.na.rm )
plot(salary.lm2, which =1)