The survey aims to study the well-being of individuals across Europe in 2016 and how it affected their outlook on their occupation.
7,647 individuals’ responses were collected, based on 11 questions about their life (Q2a,b & Q87a,b,c,d,e) and job satisfaction (Q90a,b,c,f).
Data: Sixth European Working Conditions Survey
Attributes: GitHub
setwd("D:/et4_e")
ewcs = read.table("ewcs2016.csv",sep=",",header=TRUE)
ewcs[,][ewcs[, ,] == -999] <- NA
kk = complete.cases(ewcs)
ewcs = ewcs[kk,]
# Check missing values
sum(is.na(ewcs))
## [1] 0
We find the basic details of the data such as names, mean and count so that analysis can be made clearer.
# Basic Details of data
names(ewcs)
## [1] "Q2a" "Q2b" "Q87a" "Q87b" "Q87c" "Q87d" "Q87e" "Q90a" "Q90b" "Q90c"
## [11] "Q90f"
apply(ewcs,2,mean)
## Q2a Q2b Q87a Q87b Q87c Q87d Q87e Q90a
## 1.490127 43.160194 2.426180 2.606120 2.415065 2.717275 2.407611 2.126324
## Q90b Q90c Q90f
## 2.194063 2.175363 1.530535
str(ewcs)
## 'data.frame': 7647 obs. of 11 variables:
## $ Q2a : int 1 2 2 1 2 1 1 2 2 2 ...
## $ Q2b : int 63 58 32 35 27 19 23 24 22 54 ...
## $ Q87a: int 3 2 2 3 2 2 2 3 3 3 ...
## $ Q87b: int 3 3 2 2 2 2 3 3 3 2 ...
## $ Q87c: int 3 2 3 2 3 2 3 1 2 2 ...
## $ Q87d: int 3 3 2 2 3 2 3 2 2 3 ...
## $ Q87e: int 3 2 3 3 2 3 3 3 3 3 ...
## $ Q90a: int 2 2 2 2 2 2 1 2 2 3 ...
## $ Q90b: int 2 3 2 2 4 1 1 2 1 2 ...
## $ Q90c: int 2 2 2 2 2 2 2 2 2 3 ...
## $ Q90f: int 2 2 2 2 2 1 1 1 1 1 ...
table(ewcs$Q87a)
##
## 1 2 3 4 5 6
## 1444 3160 1960 588 413 82
table(ewcs$Q87b)
##
## 1 2 3 4 5 6
## 1312 2803 2012 803 539 178
table(ewcs$Q87c)
##
## 1 2 3 4 5 6
## 1556 3136 1821 606 415 113
table(ewcs$Q87d)
##
## 1 2 3 4 5 6
## 1165 2744 1986 837 670 245
table(ewcs$Q87e)
##
## 1 2 3 4 5 6
## 1768 2896 1777 617 482 107
table(ewcs$Q90a)
##
## 1 2 3 4 5
## 1712 3758 1781 291 105
table(ewcs$Q90b)
##
## 1 2 3 4 5
## 2105 2961 1788 578 215
table(ewcs$Q90c)
##
## 1 2 3 4 5
## 2091 2911 2017 469 159
table(ewcs$Q90f)
##
## 1 2 3 4 5
## 4201 2970 377 63 36
In the EWCS 2016 large dataset, we find that there are a total of 11 variables and 7,647 observations.
In the following report, we will use unsupervised learning methods such as PCA and K-Means Clustering methods to visualise and summarise the data.
There are 3,899 males and 3,748 females in the data, with an age range of 15-87.
summary(ewcs)
## Q2a Q2b Q87a Q87b Q87c
## Min. :1.00 Min. :15.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.00 1st Qu.:34.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.00 Median :43.00 Median :2.000 Median :2.000 Median :2.000
## Mean :1.49 Mean :43.16 Mean :2.426 Mean :2.606 Mean :2.415
## 3rd Qu.:2.00 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :2.00 Max. :87.00 Max. :6.000 Max. :6.000 Max. :6.000
## Q87d Q87e Q90a Q90b
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :2.717 Mean :2.408 Mean :2.126 Mean :2.194
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :6.000 Max. :6.000 Max. :5.000 Max. :5.000
## Q90c Q90f
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000
## Mean :2.175 Mean :1.531
## 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000
Summarising the data, we can see that almost all the survey questions have a mean value between option 2-3, where people had positive feelings more than half of the time.
This excludes Q90f with a mean value between option 1-2, where they felt positive that they are good at their job.
We will then further use unsupervised learning to understand the data.
PCA would reduce the dimensions for the large EWCS dataset, which aims to find the minimum number of principal components to explain maximum variance. (Vidya, 2016)
# PCA
pr.ewcs <- prcomp(ewcs,center= TRUE, scale. = TRUE)
pr.ewcs$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## Q2a 0.03203956 -0.1386327 0.796373784 0.57638908 -6.171166e-02 0.01266193
## Q2b 0.07652230 -0.2204528 -0.584133839 0.76073419 7.105450e-02 0.00515712
## Q87a 0.39103574 -0.1996019 -0.038763673 -0.07849823 -3.148653e-02 0.02786038
## Q87b 0.37759153 -0.2359578 0.077079602 -0.16741716 -4.488656e-02 0.08133873
## Q87c 0.39652146 -0.2056496 -0.004550283 -0.03679735 -1.796326e-02 0.05172394
## Q87d 0.37141006 -0.2534245 0.062704331 -0.09378305 -5.747547e-05 0.14878121
## Q87e 0.36263461 -0.1259478 -0.059239797 -0.08174241 -3.299479e-02 -0.14466435
## Q90a 0.33784962 0.3007859 0.002609147 0.12630062 1.210966e-01 -0.20735048
## Q90b 0.27485090 0.4436706 0.054692725 0.05645151 2.715430e-01 -0.62004729
## Q90c 0.22363116 0.5038874 0.015633656 0.08498139 3.729994e-01 0.71576928
## Q90f 0.17680118 0.4160141 -0.080175825 0.10806045 -8.713190e-01 0.08308652
## PC7 PC8 PC9 PC10 PC11
## Q2a -0.092896333 0.01060833 -0.02186704 0.005570217 -0.009937025
## Q2b 0.005490225 -0.12077633 0.01372336 0.070150927 0.028533515
## Q87a -0.179630910 -0.07249015 -0.62888270 -0.261821916 -0.544290070
## Q87b 0.158131884 -0.36972267 -0.28416331 0.574242537 0.432370395
## Q87c 0.097909301 0.16394260 0.07090937 -0.668278083 0.554994648
## Q87d 0.360503139 -0.19974893 0.62699603 0.013748398 -0.446981468
## Q87e -0.712223779 0.37711846 0.29943899 0.284447504 0.019249723
## Q90a 0.495518337 0.62792742 -0.15661941 0.226314662 -0.078670047
## Q90b -0.100309447 -0.47561321 0.09433706 -0.131861974 0.026155642
## Q90c -0.177731352 -0.06749933 0.01374979 0.001025211 0.028835352
## Q90f -0.008177733 -0.09643027 0.04070890 -0.020991855 -0.002154017
dim(pr.ewcs$x)
## [1] 7647 11
summary(pr.ewcs)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0985 1.1874 1.01058 0.97061 0.88319 0.74969 0.71299
## Proportion of Variance 0.4003 0.1282 0.09284 0.08564 0.07091 0.05109 0.04621
## Cumulative Proportion 0.4003 0.5285 0.62133 0.70697 0.77788 0.82898 0.87519
## PC8 PC9 PC10 PC11
## Standard deviation 0.65201 0.59566 0.56499 0.52321
## Proportion of Variance 0.03865 0.03226 0.02902 0.02489
## Cumulative Proportion 0.91384 0.94609 0.97511 1.00000
## First two principal components captures 52% of variance.
We can see that PC1 alone accounts for 40% of the total variance in the data.
Hence, to find out the ideal number of principal components to use and to check if the EWCS data is suitable for PCA, we find the scree plot as shown below.
biplot(pr.ewcs, main = "Biplot",scale=0)
pr.var = pr.ewcs$sdev^2
pve.ewcs = pr.var/sum(pr.var)
plot(pve.ewcs , xlab="Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1),type='b',
main = "Scree Plot")
plot(cumsum(pve.ewcs), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),
type='b',main = "Cumulative Scree Plot")
library(ggfortify)
autoplot(pr.ewcs)
autoplot(pr.ewcs, loadings = TRUE, loadings.label = TRUE,
data = ewcs, colour = 'Q87a')
autoplot(pr.ewcs, loadings = TRUE, loadings.label = TRUE,
data = ewcs, colour = 'Q90a')
# Finding the top n principal component covering >80 % total variance
which(cumsum(pve.ewcs) >= 0.8)[1]
## [1] 6
The determining steep “elbow” point in the scree plot shows that the number of principal components we should keep is two. If we reduce data to two principal components, the model will explain approximately 53% of the variability. The ideal number of principal components required to explain at least 80% of total variance would be 6.
Therefore, the component PC1 focuses on the feelings that individuals have about their job.
Q2 have shorter lines; hence, they have a lower variance compared to Q87 and Q90.
Individually selecting both Q87a and Q90a as examples, we can see that both values’ scattered points are quite similar.
We can also infer that there are more positive feelings for both questions, where the value 5-6 are more significant around that specific line of question in the data.
K-Means algorithm classifies the dataset into “K” clusters, then minimises the distance between points with their centroid. (Sharma, 2019)
Firstly, we will need to find the optimal number of clusters “K”, which can be interpreted via both the Elbow and Silhouette method.
# K-means Clustering
ewcs.scaled <- scale(ewcs)
library(factoextra)
library(NbClust)
fviz_nbclust(ewcs.scaled, kmeans, method = "wss") +
geom_vline(xintercept = 2, linetype = 2) +
labs(subtitle = "Elbow method") #Elbow
fviz_nbclust(ewcs.scaled, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method") #Silhouette
fviz_cluster(kmeans(ewcs.scaled, centers = 2), geom = "point", data = ewcs.scaled)
According to both methods, we can determine approximately that the optimal number of clusters is two.
Using k=2, we can visualise the data in 2-dimensions, as seen above.
The optimal number of clusters would be 2.
Then, we will further investigate the pattern of clusters in the values for variables.
# Further info on cluster using PCA
pca1 <- prcomp(ewcs.scaled)
ewcs$cluster <- as.factor(kmeans(ewcs.scaled, centers = 2)$cluster)
## added cluster to initial ewcs data
ewcs$PC1 <- pca1$x[,1]
ewcs$PC2 <- pca1$x[,2]
## PC1 and PC2 added into initial ewcs data
ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q2a) + ggtitle("Q2a")
Similar cluster pattern for Q2a (gender).
ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q87a) + ggtitle("Q87a")
For Q87, majority of people who voted ‘6’ were in Cluster 1, majority who voted ‘1’ were in cluster 2.
ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q90a) + ggtitle("Q90a")
ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q90f) + ggtitle("Q90f")
For Q90, majority of people who voted ‘5’ were in cluster 1, majority who voted ‘1’ were in cluster2.
Since the cluster pattern within Q87 and Q90 itself are generally quite similar, Q87a and Q90a are selectively used for explanation.
For Q87, most people who voted ‘1’ where they had positive feelings over the last two weeks were in Cluster 1. The majority who voted ‘6’ where they had negative feelings over the last two weeks were in cluster 2.
For Q90, most people who voted ‘5’ were in cluster 2, the majority who voted ‘1’ were in Cluster 1. Both clusters were quite equal where people had positive opinions about being good at their job.
We then further investigate Gender and Age proportions in the clusters.
set.seed(123)
k2 <- kmeans(ewcs.scaled, centers=2) # set k = 2 to see natural clusters of 2.
#Checking Q2b
k1results <- data.frame(ewcs$Q2a, ewcs$Q2b, k2$cluster)
cluster1_2b <- subset(k1results, k2$cluster==1)
cluster2_2b <- subset(k1results, k2$cluster==2)
summary(cluster1_2b$ewcs.Q2b)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 36.00 45.00 45.46 54.00 87.00
summary(cluster2_2b$ewcs.Q2b)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 33.00 41.00 41.85 50.00 87.00
We observe that Cluster 1 has a slightly older mean age than Cluster 2, where males took up a relatively close proportion of 48% and 53% in Cluster 1 and 2, respectively.
# Checking Q87a
k2results <- data.frame(ewcs$Q2a, ewcs$Q87a, k2$cluster)
cluster1_Q87a <- subset(k2results, k2$cluster==1)
cluster2_Q87a <- subset(k2results, k2$cluster==2)
summary(cluster1_Q87a$ewcs.Q87a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.384 4.000 6.000
summary(cluster2_Q87a$ewcs.Q87a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.88 2.00 6.00
Cluster 2 has higher mean value than cluster 1 for Q87a.
# Checking Q90a
k3results <- data.frame(ewcs$Q2a, ewcs$Q90a, k2$cluster)
cluster1_Q90a <- subset(k3results, k2$cluster==1)
cluster2_Q90a <- subset(k3results, k2$cluster==2)
summary(cluster1_Q90a$ewcs.Q90a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.739 3.000 5.000
summary(cluster2_Q90a$ewcs.Q90a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.777 2.000 5.000
The mean value for both Q87a and Q90a in Cluster 1 is higher than its Cluster 2 value. Hence, we can infer that people in Cluster 2 are more likely to have had positive feelings over the last two weeks and positive feelings while working as compared to Cluster 1.
cluster1_2b$ewcs.Q2a <- factor(cluster1_2b$ewcs.Q2a)
cluster2_2b$ewcs.Q2a <- factor(cluster2_2b$ewcs.Q2a)
round(prop.table(table(cluster1_2b$ewcs.Q2a)),2)
##
## 1 2
## 0.48 0.52
round(prop.table(table(cluster2_2b$ewcs.Q2a)),2)
##
## 1 2
## 0.53 0.47
48% in Cluster 1 are Males, 53% in Cluster 2 are Males.
# Goodness of Fit Test
# Is Cluster 1 statistically same as Cluster 2 in terms of Q87?
M <- as.matrix(table(cluster1_Q87a$ewcs.Q87a))
p.null <- as.vector(prop.table(table(cluster2_Q87a$ewcs.Q87a)))
chisq.test(M, p=p.null)
##
## Chi-squared test for given probabilities
##
## data: M
## X-squared = 46935, df = 5, p-value < 2.2e-16
Cluster 1 Q87a Proportions are different from Cluster 2 Q87a Proportions.
K-means clustering concludes Q87a is a significant differentiator.
# Is Cluster 1 statistically same as Cluster 2 in terms of Q90?
Z <- as.matrix(table(cluster1_Q90a$ewcs.Q90a))
p.null1 <- as.vector(prop.table(table(cluster2_Q90a$ewcs.Q90a)))
chisq.test(Z, p=p.null1)
##
## Chi-squared test for given probabilities
##
## data: Z
## X-squared = 11795, df = 4, p-value < 2.2e-16
Cluster 1 Q90a Proportions are different from Cluster 2 Q90a Proportions
K-Means Clustering concludes Q90a is a significant differentiator.
Using the goodness of fit test, we also find out that both Q87 and Q90 have a difference in proportion in Cluster 1 and 2.
Hence, K-Means Clustering concludes that Q87 and Q90 questions are significant differentiators.
Both PCA and K-means Clustering can be used together for large datasets since PCA can first reduce the dimensionality, then after that using K-Means Clustering to get a more accurate clustering result.
Overall, we can see that the people in the EWCS data are more likely to have positive feelings over the last two weeks and in work.
© Copyright. Evangeline Tan 2021.