Introduction

The data consisting of student’s achievement in two Portuguese secondary schools includes grades, demographic, social and school related features.

Collected using school reports and questionnaires, it includes two subjects: Mathematics (mat) and Portuguese language (por) which were modeled under binary/five-level classification and regression tasks.

Note: target attribute G3 has a strong correlation with attributes G2 and G1.
This occurs because G3 is the final year grade, while G1 and G2 corresponds to 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful.

Data: Student

Relevant papers: Data Mining Prediction

Attributes: GitHub

Import data and libraries

setwd("D:/et4_e")

# Import and prepare the student performance dataset
school1 = read.table("student-mat.csv",sep=";",header=TRUE)
school2 = read.table("student-por.csv",sep=";",header=TRUE)
schools = merge(school1,school2,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))


# Changing characters to factors
school1[sapply(school1, is.character)] <- lapply(school1[sapply(school1, is.character)], 
                                                 as.factor)
school2[sapply(school2, is.character)] <- lapply(school2[sapply(school2, is.character)], 
                                                 as.factor)
schools[sapply(schools, is.character)] <- lapply(schools[sapply(schools, is.character)], 
                                                 as.factor)
# Check for missing values
sum(is.na(school1))
## [1] 0
sum(is.na(school2))
## [1] 0
sum(is.na(schools))
## [1] 0

Details of data

# Brief details of data given
library(dplyr)
library(tidyr)
library(ggplot2)

table(school1$school)
## 
##  GP  MS 
## 349  46
table(school2$school)
## 
##  GP  MS 
## 423 226
table(schools$school)
## 
##  GP  MS 
## 342  40
table1 <- rbind(c(mean(school1$G1), mean(school1$G2), mean(school1$G3)),c(mean(schools$G1.x), mean(schools$G2.x), mean(schools$G3.x)))
colnames(table1) <- c("G1","G2","G3")
rownames(table1) <- c("Math","Math&Port")
table1
##                 G1       G2       G3
## Math      10.90886 10.71392 10.41519
## Math&Port 10.86126 10.71204 10.38743
## Mean Grade of students in math did not improve throughout period.

Looking into the mean grades scored, we can observe that all math students and those who take both courses experienced a slight decrease in the mean grade as the school term goes on.


table2 <- rbind(c(mean(school2$G1), mean(school2$G2), mean(school2$G3)),c(mean(schools$G1.y), mean(schools$G2.y), mean(schools$G3.y)))
colnames(table2) <- c("G1","G2","G3")
rownames(table2) <- c("Port","Port&Math")
table2
##                 G1       G2       G3
## Port      11.39908 11.57011 11.90601
## Port&Math 12.11257 12.23822 12.51571
## Mean Grade of students in Portuguese have improved throughout period.

All Portuguese students and those who take both courses experienced a slight increase in mean grade as the school term goes on.



Finding relationship between variables and Final Grade

ggplot(schools, aes(x=age)) + geom_density()

No normal distribution so age varies.


# Reason affecting grades
ggplot(school1, aes(x=G3, fill = reason)) + geom_density(alpha = .3) + ggtitle("Reason affecting Math grade")

ggplot(school2, aes(x=G3, fill = reason)) + geom_density(alpha = .3) + ggtitle("Reason affecting Portuguese grade")

# Address affecting grades
ggplot(school1, aes(x=G3, fill = address)) + geom_density(alpha = .3) + ggtitle("Address affecting Math grade")

ggplot(school2, aes(x=G3, fill = address)) + geom_density(alpha = .3) + ggtitle("Address affecting Portuguese grade")

## Students in Urban region have higher mean final grade than rural region.

We can infer that the final grade can be influenced by whether a student lives in a rural or urban region.
Students in the urban region are more likely to score better for the final grade


# School affecting grades
ggplot(school1, aes(x= G3, fill = school)) + geom_density(alpha = .3) + ggtitle("School affecting Math grade")

ggplot(school2, aes(x= G3, fill = school)) + geom_density(alpha = .3) + ggtitle("School affecting Portuguese grade")

The type of school also influences the final grade, in which students enrolled in GP are more likely to score better in the Portuguese course than in Math.


# Desire for higher education affecting grades
ggplot(school1, aes(x= G3, fill = higher)) + geom_density(alpha = .3) + ggtitle("Higher Edu desire affecting Math grade")

ggplot(school2, aes(x= G3, fill = higher)) + geom_density(alpha = .3) + ggtitle("Higher Edu desire affecting Portuguese grade")

# Father's Education affecting grades
ggplot(school1, aes(x= G3)) + geom_boxplot(aes(color=as.factor(Fedu))) + ggtitle("Father's Education affecting Math grade")

ggplot(school2, aes(x= G3)) + geom_boxplot(aes(color=as.factor(Fedu))) + ggtitle("Father's Education affecting Portuguese grade")

# Mother's Education affecting grades
ggplot(school1, aes(x= G3)) + geom_boxplot(aes(color=as.factor(Medu))) + ggtitle("Mother's Education affecting Math grade")

ggplot(school2, aes(x= G3)) + geom_boxplot(aes(color=as.factor(Medu))) + ggtitle("Mother's Education affecting Portuguese grade")

The mother’s education can also slightly affect grades as seen above. Other variables such as the reason for choosing the school and desire for higher education are also slightly influential.

We will test out in the following models if these variables are indeed significant.



Hold-out Validation method

# Hold-out Validation method
set.seed(123)
library(caTools)

Msample = sample.split(school1, SplitRatio = 0.7)
Mtrain = subset(school1, Msample==TRUE,select = -c(G1,G2))
Mtest = subset(school1, Msample==FALSE,select = -c(G1,G2))

Psample = sample.split(school2, SplitRatio = 0.7)
Ptrain = subset(school2, Psample==TRUE,select = -c(G1,G2))
Ptest = subset(school2, Psample==FALSE,select = -c(G1,G2))

Using training and test sets for both courses, two linear regression models are created.



Linear Regression

# Linear Regression equations without G1, G2 variables
m1 <- lm(scale(G3) ~ ., data=Mtrain)
summary(m1)
## 
## Call:
## lm(formula = scale(G3) ~ ., data = Mtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63629 -0.39473  0.04351  0.55411  1.58925 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.938757   1.171289   0.801   0.4237    
## schoolMS          0.112067   0.204339   0.548   0.5839    
## sexM              0.301670   0.132599   2.275   0.0238 *  
## age              -0.084854   0.057933  -1.465   0.1443    
## addressU          0.187970   0.151808   1.238   0.2169    
## famsizeLE3        0.107281   0.127340   0.842   0.4004    
## PstatusT          0.010046   0.185521   0.054   0.9569    
## Medu              0.053769   0.085197   0.631   0.5286    
## Fedu             -0.037413   0.074412  -0.503   0.6156    
## Mjobhealth        0.333242   0.297567   1.120   0.2639    
## Mjobother        -0.073581   0.191238  -0.385   0.7008    
## Mjobservices      0.202353   0.205641   0.984   0.3261    
## Mjobteacher      -0.183133   0.275427  -0.665   0.5068    
## Fjobhealth        0.116873   0.401195   0.291   0.7711    
## Fjobother        -0.309049   0.244010  -1.267   0.2066    
## Fjobservices     -0.366905   0.257570  -1.424   0.1556    
## Fjobteacher       0.146425   0.328052   0.446   0.6558    
## reasonhome        0.015640   0.144540   0.108   0.9139    
## reasonother       0.112157   0.205276   0.546   0.5853    
## reasonreputation  0.084010   0.156742   0.536   0.5925    
## guardianmother    0.041294   0.138769   0.298   0.7663    
## guardianother     0.099976   0.262609   0.381   0.7038    
## traveltime        0.007881   0.088094   0.089   0.9288    
## studytime         0.111833   0.075468   1.482   0.1397    
## failures         -0.365094   0.080207  -4.552 8.53e-06 ***
## schoolsupyes     -0.307894   0.174556  -1.764   0.0791 .  
## famsupyes        -0.168093   0.124482  -1.350   0.1782    
## paidyes           0.097423   0.124735   0.781   0.4356    
## activitiesyes    -0.160953   0.117128  -1.374   0.1707    
## nurseryyes        0.013323   0.143816   0.093   0.9263    
## higheryes         0.413487   0.261036   1.584   0.1145    
## internetyes      -0.035061   0.168832  -0.208   0.8357    
## romanticyes      -0.247584   0.120981  -2.046   0.0418 *  
## famrel            0.132873   0.061758   2.152   0.0325 *  
## freetime          0.009596   0.062436   0.154   0.8780    
## goout            -0.144626   0.061019  -2.370   0.0186 *  
## Dalc             -0.059154   0.082667  -0.716   0.4750    
## Walc              0.098673   0.065583   1.505   0.1338    
## health           -0.078134   0.043458  -1.798   0.0735 .  
## absences          0.011796   0.007271   1.622   0.1061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8834 on 235 degrees of freedom
## Multiple R-squared:  0.3307, Adjusted R-squared:  0.2196 
## F-statistic: 2.977 on 39 and 235 DF,  p-value: 1.85e-07
## failures, goout, internet, romantic relationship, gender are the significant variables for Math course

For Math students, the number of past class failures would be the significant variable that affects the model, followed by the amount of going out with friends, being in a romantic relationship and having school support.


m2 <- lm(scale(G3) ~ ., data=Ptrain)
summary(m2)
## 
## Call:
## lm(formula = scale(G3) ~ ., data = Ptrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9024 -0.4445 -0.0029  0.4736  2.1803 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.432951   0.769040  -0.563 0.573757    
## schoolMS         -0.345492   0.101133  -3.416 0.000698 ***
## sexM             -0.130141   0.094934  -1.371 0.171166    
## age               0.022795   0.039585   0.576 0.565026    
## addressU          0.119400   0.099849   1.196 0.232460    
## famsizeLE3        0.089959   0.090522   0.994 0.320915    
## PstatusT          0.094951   0.133880   0.709 0.478588    
## Medu             -0.051827   0.059299  -0.874 0.382635    
## Fedu              0.078977   0.053229   1.484 0.138642    
## Mjobhealth        0.415715   0.205649   2.021 0.043875 *  
## Mjobother         0.050454   0.116354   0.434 0.664785    
## Mjobservices      0.163656   0.140206   1.167 0.243780    
## Mjobteacher       0.344471   0.188073   1.832 0.067734 .  
## Fjobhealth       -0.263540   0.298599  -0.883 0.377972    
## Fjobother        -0.225694   0.189139  -1.193 0.233450    
## Fjobservices     -0.319539   0.195397  -1.635 0.102741    
## Fjobteacher      -0.053279   0.265920  -0.200 0.841301    
## reasonhome        0.006333   0.106930   0.059 0.952803    
## reasonother      -0.291601   0.145165  -2.009 0.045213 *  
## reasonreputation  0.010924   0.114433   0.095 0.923995    
## guardianmother   -0.021224   0.103134  -0.206 0.837053    
## guardianother     0.178871   0.211657   0.845 0.398545    
## traveltime        0.025677   0.059758   0.430 0.667656    
## studytime         0.098113   0.051788   1.895 0.058856 .  
## failures         -0.384632   0.073541  -5.230  2.7e-07 ***
## schoolsupyes     -0.377004   0.140223  -2.689 0.007465 ** 
## famsupyes         0.002073   0.086341   0.024 0.980861    
## paidyes          -0.011827   0.173477  -0.068 0.945680    
## activitiesyes     0.145483   0.084906   1.713 0.087377 .  
## nurseryyes        0.029700   0.101106   0.294 0.769095    
## higheryes         0.578472   0.149760   3.863 0.000130 ***
## internetyes       0.081091   0.105562   0.768 0.442817    
## romanticyes      -0.138853   0.087984  -1.578 0.115293    
## famrel            0.020627   0.045066   0.458 0.647413    
## freetime         -0.091201   0.042730  -2.134 0.033400 *  
## goout             0.018394   0.040400   0.455 0.649136    
## Dalc             -0.086457   0.057208  -1.511 0.131479    
## Walc             -0.034165   0.044661  -0.765 0.444721    
## health           -0.058375   0.030220  -1.932 0.054081 .  
## absences         -0.011531   0.009228  -1.250 0.212135    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8355 on 413 degrees of freedom
## Multiple R-squared:  0.3622, Adjusted R-squared:  0.3019 
## F-statistic: 6.013 on 39 and 413 DF,  p-value: < 2.2e-16
## school, higher edu, failures, school support are significant variables of Portuguese language course

For Portuguese students, the number of past class failures would also be the significant variable affecting the model alongside having the desire to pursue higher education, followed by having school support.


# Relationship of failures with G3 for Math and G3 in Portuguese language
m1failure <- lm(G3 ~failures, data = Mtrain)
summary(m1failure)
## 
## Call:
## lm(formula = G3 ~ failures, data = Mtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4915  -2.1468   0.5085   2.8532   8.8532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.4915     0.2864   40.13  < 2e-16 ***
## failures     -2.3446     0.3216   -7.29 3.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.317 on 273 degrees of freedom
## Multiple R-squared:  0.1629, Adjusted R-squared:  0.1599 
## F-statistic: 53.14 on 1 and 273 DF,  p-value: 3.36e-12
plot(x = school1$failures, y = school1$G3,
     pch = 18,
     col = gray(.1, .1),
     xlab = "Past Class Failures",
     ylab = "Final Grade",
     main = "Math Students"
)
abline(m1failure, lty = 2)

Final Grade scores have a negative linear relationship with the amount of past class failures for Math.


m2failure <- lm(G3 ~failures, data = Ptrain)
plot(x = school2$failures, y = school2$G3,
     pch = 20,
     col = gray(.1, .1),
     xlab = "Past Class Failures",
     ylab = "Final Grade",
     main = "Portuguese language Students"
)
abline(m2failure, lty = 2)

Final grade scores have a negative linear relationship with the amount of past class failures for Portuguese language.

There is a negative linear relationship between the number of past class failures and final grades for both Math and Portuguese.

Hence, the more the number of past class failures, the lower the final grade achieved for students in both courses



Predicting Grades with attributes

A new linear regression model with specific variables is created for both Math and Portuguese courses to predict and test the linear regression model’s accuracy.


Math grade

# Predicting data using Linear Regression

#Math prediction
sch1_m1 <- lm(G3 ~ sex + age + goout + failures,data=Mtrain)
summary(sch1_m1)
## 
## Call:
## lm(formula = G3 ~ sex + age + goout + failures, data = Mtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6662  -1.9848   0.1945   2.8467   9.6036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.8222     3.5231   4.207 3.52e-05 ***
## sexM          1.0072     0.5268   1.912    0.057 .  
## age          -0.1650     0.2093  -0.788    0.431    
## goout        -0.3443     0.2381  -1.446    0.149    
## failures     -2.2441     0.3328  -6.744 9.35e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.292 on 270 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.1698 
## F-statistic: 15.01 on 4 and 270 DF,  p-value: 4.291e-11
# Predicting model with selective dataframe
newdf1 <- data.frame(sex = "F", age = 18, goout = 5,failures = 3)
predict(object = sch1_m1,newdata = newdf1)
##        1 
## 3.398869
## An 18 year old female student who went out 5 times and have 3 past class failures will be predicted to have a final math grade of 3.398869.

An 18-year-old female student who went out with friends 5 times, with 3 past class failures, will be predicted to have a final math grade of 3.298847.


Portuguese language

# Portuguese language prediction
sch2_m1 <- lm(G3 ~ sex+ age+ higher+ failures+ schoolsup,data=Ptrain)
summary(sch2_m1)
## 
## Call:
## lm(formula = G3 ~ sex + age + higher + failures + schoolsup, 
##     data = Ptrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8956  -1.5840   0.1044   1.7298   6.6209 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.22902    2.15230   4.288 2.21e-05 ***
## sexM         -0.51652    0.27129  -1.904   0.0576 .  
## age           0.05447    0.12021   0.453   0.6507    
## higheryes     2.68621    0.46844   5.734 1.80e-08 ***
## failures     -1.51196    0.23157  -6.529 1.80e-10 ***
## schoolsupyes -1.01134    0.44740  -2.260   0.0243 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.813 on 447 degrees of freedom
## Multiple R-squared:  0.2115, Adjusted R-squared:  0.2027 
## F-statistic: 23.98 on 5 and 447 DF,  p-value: < 2.2e-16
# Predicting the model with selective dataframe
newdf2 <- data.frame(sex = "M", age = 19, higher = "yes", failures = 1, schoolsup = "yes")
predict(object = sch2_m1, newdata = newdf2 )
##        1 
## 9.910286
## A 19 year old Male student who wants to pursue higher education, have 2 class failures and school support will be predicted to have a final Portuguese language grade of 9.910286.

A 19-year-old Male student who wants to pursue higher education, has 2 class failures and school support, will be predicted to have a final Portuguese language grade of 9.910286.


The results have a high similarity to its calculated value, where values are inserted into the new regression equation. To further prove the accuracy, we will also calculate the RMSE value.


Checking accuracy with RMSE

#Checking accuracy with RMSE
library(caret)

predictions1 <- m1 %>% predict(Mtest)
RMSE(predictions1, Mtest$G3)
## [1] 10.68777
predictions1 <- m1 %>% predict(Mtrain)
RMSE(predictions1, Mtrain$G3)
## [1] 11.49578
predictions3 <- m2 %>% predict(Ptest)
RMSE(predictions1, Ptest$G3)
## [1] 12.51613
predictions3 <- m2 %>% predict(Ptrain)
RMSE(predictions1, Ptrain$G3)
## [1] 12.2923
# The RMSE for train and test sets are very similar, hence a good model.
For Math and Portuguese courses to both be a good model, the RMSE value for their training and test value should be similar (Peterson, 2013).

Hence, in this case, even if the RMSE value is very high, which means that the model fails to account for significant features underlying the data (Moody, 2019), we are still able to conclude that it is a good model.


Plotted data

# Plotted data
plot(m1)

plot(m2)


Random Forest

Random Forest (RF) is a supervised learning algorithm used for classification and regression tasks. We can observe the Variable Importance Plots with RF to see the significant variables based on students who passed math and Portuguese. (Bhalla, 2015)


# Random Forest
library(randomForest)

# Cleaning and creating new dataframe
newdata1.rf = school1
newdata1.rf$passed = factor(ifelse(school1$G3 >= 10, "yes", "no"))
newdata2.rf = school2
newdata2.rf$passed = factor(ifelse(school2$G3 >= 10, "yes", "no"))
newdata1.rf = subset(newdata1.rf, select = -c(G1,G2,G3))
newdata2.rf = subset(newdata2.rf, select = -c(G1,G2,G3))

# making RF with 100 trees
pre <- ncol(newdata1.rf)
passedmath_rf <- randomForest(passed~., data=newdata1.rf, mtry=pre, ntree=100)
passedportuguese_rf <- randomForest(passed~., data=newdata2.rf, mtry=pre, ntree=100)

importance(passedmath_rf)
##            MeanDecreaseGini
## school            0.8649265
## sex               3.0189109
## age               8.3644034
## address           1.3523935
## famsize           1.9011923
## Pstatus           0.9941314
## Medu              4.8954322
## Fedu              6.8803743
## Mjob             10.4839020
## Fjob              7.1916762
## reason            6.6636272
## guardian          4.4369410
## traveltime        4.0690519
## studytime         6.9992196
## failures         22.3614037
## schoolsup         4.2130681
## famsup            1.8696568
## paid              1.7504652
## activities        1.4143022
## nursery           1.9675330
## higher            1.3425988
## internet          1.9692244
## romantic          1.7520544
## famrel            6.6821874
## freetime          9.0805313
## goout            11.5930671
## Dalc              3.0772864
## Walc              7.0508054
## health            8.2373854
## absences         21.4592608
varImpPlot(passedmath_rf)

importance(passedportuguese_rf)
##            MeanDecreaseGini
## school            7.6979885
## sex               2.0303048
## age               5.7866996
## address           1.2115635
## famsize           3.6780696
## Pstatus           0.4811528
## Medu              4.7106471
## Fedu              5.8887606
## Mjob              6.2460825
## Fjob              4.2780834
## reason           10.0501329
## guardian          4.1225158
## traveltime        5.0561509
## studytime         3.7093507
## failures         26.9670464
## schoolsup         1.0980053
## famsup            1.5211129
## paid              2.2724300
## activities        1.8969210
## nursery           3.2710682
## higher            8.6610220
## internet          1.8867788
## romantic          2.0739413
## famrel            7.3229006
## freetime          7.1521834
## goout             7.2556843
## Dalc              7.3141559
## Walc              8.3455655
## health            6.3352306
## absences         12.2714252
varImpPlot(passedportuguese_rf)

# Both absences and amount of past failed classes are influential variables.

passedmath_rf
## 
## Call:
##  randomForest(formula = passed ~ ., data = newdata1.rf, mtry = pre,      ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 30
## 
##         OOB estimate of  error rate: 29.11%
## Confusion matrix:
##     no yes class.error
## no  55  75   0.5769231
## yes 40 225   0.1509434
passedportuguese_rf
## 
## Call:
##  randomForest(formula = passed ~ ., data = newdata2.rf, mtry = pre,      ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 30
## 
##         OOB estimate of  error rate: 16.18%
## Confusion matrix:
##     no yes class.error
## no  33  67  0.67000000
## yes 38 511  0.06921676
## About 70% accuracy for Math, and 84% accuracy for Portuguese

Both failures and absences are significant predictors that affect whether a student passes in Math and Portuguese individually.
Using Accuracy = 1-Out-of-bag error, we see that the Math model has an accuracy of around 70% and the Portuguese model with an accuracy of almost 84%.



Ridge Regression

Ridge Regression is a shrinkage method to regularise the linear regression model coefficients towards zero, although seldom hitting zero, and the model can also be used to prevent multicollinearity (Jain, 2017)


# Ridge Regression
library(glmnet)
library(caTools)
Mridge <- select_if(school1,is.numeric)
Pridge <- select_if(school2,is.numeric)

# Train & Testsets for Math students
Msample <- sample.split(Mridge, SplitRatio = 0.7)
Mtrain.RR <- na.omit(Mridge[Msample,])
Mtrain.X <- as.matrix(Mtrain.RR[,-c(14,15,16)])
Mtrain.Y <- Mtrain.RR[,16]

Mtest.RR <- na.omit(Mridge[-Msample,])
Mtest.X <- as.matrix(Mtest.RR[,-c(14,15,16)])
Mtest.Y <- Mtest.RR[,16]

# Train & Testsets for Portuguese students
Psample <- sample.split(Pridge, SplitRatio = 0.7)
Ptrain.RR <- na.omit(Pridge[Psample,])
Ptrain.X <- as.matrix(Ptrain.RR[,-c(14,15,16)])
Ptrain.Y <- Ptrain.RR[,16]

Ptest.RR <- na.omit(Pridge[-Psample,])
Ptest.X <- as.matrix(Ptest.RR[,-c(14,15,16)])
Ptest.Y <- Ptest.RR[,16]

schgrid <- 10^seq(-3,1,length=100)


Math Students

# Math Students
M_ridge.m1 <- glmnet(Mtrain.X, Mtrain.Y, alpha = 1,intercept = FALSE, lambda = schgrid)

Mcv <- cv.glmnet(Mtrain.X, Mtrain.Y, alpha = 1,intercept = FALSE, lambda = schgrid)
plot(Mcv)

goodlamda <- Mcv$lambda.min
goodlamda
## [1] 0.2656088
M_ridge.pd <- predict(M_ridge.m1, s = goodlamda, newx = Mtest.X)
M_RMSE.ridge <- sqrt(mean((M_ridge.pd-Mtest.Y)^2))
M_RMSE.ridge
## [1] 4.299984
M_glm <- glmnet(Mtrain.X, Mtrain.Y,alpha=1,lambda = schgrid)
M_ridge.coef <- predict(M_glm,type="coefficients",s=goodlamda)
M_ridge.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 11.53516136
## age          .         
## Medu         0.18812823
## Fedu         .         
## traveltime  -0.40483960
## studytime    0.03831025
## failures    -1.66514181
## famrel       0.07732977
## freetime     .         
## goout       -0.24090839
## Dalc         .         
## Walc         .         
## health      -0.06971458
## absences     .

For Math students, the number of past failures is the significant variable that will negatively affect their grade in math, followed by the number of times they went out with friends.


Portuguese language students

# Portuguese
P_ridge.m1 <- glmnet(Ptrain.X, Ptrain.Y, alpha = 1,intercept = F, lambda = schgrid)

Pcv <- cv.glmnet(Ptrain.X, Ptrain.Y, alpha = 1,intercept = F, lambda = schgrid)
plot(Pcv)

goodlamda2 <- Pcv$lambda.min
goodlamda2
## [1] 0.03430469
P_ridge.pred <- predict(P_ridge.m1, s = goodlamda2, newx = Ptest.X)
P_RMSE.ridge <- sqrt(mean((P_ridge.pred-Ptest.Y)^2))
P_RMSE.ridge
## [1] 2.855098
P_out <- glmnet(Ptrain.X, Ptrain.Y,alpha=1,lambda=schgrid)
P_ridge.coef <- predict(P_out,type="coefficients",s=goodlamda2)
P_ridge.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.114769e+01
## age          .           
## Medu         3.382646e-01
## Fedu         2.384303e-01
## traveltime  -7.754733e-05
## studytime    6.024020e-01
## failures    -1.556690e+00
## famrel       7.941203e-02
## freetime    -1.550735e-01
## goout       -4.601989e-02
## Dalc        -2.512736e-01
## Walc        -3.569721e-02
## health      -1.685151e-01
## absences    -2.475563e-02

For Portuguese students, the number of past failures is also the significant variable that will negatively affect their grade in Portuguese, followed by the amount of workday alcohol consumption.


A low RMSE value would indicate a good fit for the model. In this case, the RMSE for Portuguese is lower, hence a better fit.


Conclusion

Initially from the relationships plots, we inferred from the plots that the type of school and address would be significant in determining grades. However, we find out from the models that those variables are not substantial enough to influence the final grade.

Also, across all models used, the most significant variable that influences the student’s grade for both courses would be the number of past failures.

Additionally, for math students, the number of times they went out with friends is also a slightly significant predictor.

Without the previous grades G1 and G2, it would be quite tough to make a prediction model for final grades with high accuracy since it is stated that there is a high correlation between G1, G2 and G3.
Multicollinearity would be an issue since it reduces the statistical significance of an independent coefficient. (Springer,1997)

Lastly, comparing the RMSE output of the Linear and Ridge Regression, it is evident that the RMSE value for Ridge Regression is lower; hence the Ridge Regression model is a better fit (Chugh, 2020).

Ridge Regressions also solves data with multicollinearity problems well. However, the Random Forest algorithm would be a slightly better predictive model with its accuracy above 70%.


[LinkedIn] [GitHub]

© Copyright. Evangeline Tan 2021.