Analysing survey results for working conditions in Europe

Introduction
Import data and libraries
Details of data
Principal Component Analysis (PCA)
K-Means Clustering
Further info on PCA
Cluster proportions
Goodness of Fit Test
Conclusion

Introduction

The survey aims to study the well-being of individuals across Europe in 2016 and how it affected their outlook on their occupation.
7,647 individuals’ responses were collected, based on 11 questions about their life (Q2a,b & Q87a,b,c,d,e) and job satisfaction (Q90a,b,c,f).

Data: Sixth European Working Conditions Survey

Attributes: GitHub

Import data and libraries

setwd("D:/et4_e")

ewcs = read.table("ewcs2016.csv",sep=",",header=TRUE)
ewcs[,][ewcs[, ,] == -999] <- NA
kk = complete.cases(ewcs)
ewcs = ewcs[kk,]

# Check missing values
sum(is.na(ewcs))

## [1] 0

Details of data

We find the basic details of the data such as names, mean and count so that analysis can be made clearer.

# Basic Details of data
names(ewcs)

##  [1] "Q2a"  "Q2b"  "Q87a" "Q87b" "Q87c" "Q87d" "Q87e" "Q90a" "Q90b" "Q90c"
## [11] "Q90f"

apply(ewcs,2,mean)

##       Q2a       Q2b      Q87a      Q87b      Q87c      Q87d      Q87e      Q90a 
##  1.490127 43.160194  2.426180  2.606120  2.415065  2.717275  2.407611  2.126324 
##      Q90b      Q90c      Q90f 
##  2.194063  2.175363  1.530535

str(ewcs)

## 'data.frame':    7647 obs. of  11 variables:
##  $ Q2a : int  1 2 2 1 2 1 1 2 2 2 ...
##  $ Q2b : int  63 58 32 35 27 19 23 24 22 54 ...
##  $ Q87a: int  3 2 2 3 2 2 2 3 3 3 ...
##  $ Q87b: int  3 3 2 2 2 2 3 3 3 2 ...
##  $ Q87c: int  3 2 3 2 3 2 3 1 2 2 ...
##  $ Q87d: int  3 3 2 2 3 2 3 2 2 3 ...
##  $ Q87e: int  3 2 3 3 2 3 3 3 3 3 ...
##  $ Q90a: int  2 2 2 2 2 2 1 2 2 3 ...
##  $ Q90b: int  2 3 2 2 4 1 1 2 1 2 ...
##  $ Q90c: int  2 2 2 2 2 2 2 2 2 3 ...
##  $ Q90f: int  2 2 2 2 2 1 1 1 1 1 ...

table(ewcs$Q87a)

## 
##    1    2    3    4    5    6 
## 1444 3160 1960  588  413   82

table(ewcs$Q87b)

## 
##    1    2    3    4    5    6 
## 1312 2803 2012  803  539  178

table(ewcs$Q87c)

## 
##    1    2    3    4    5    6 
## 1556 3136 1821  606  415  113

table(ewcs$Q87d)

## 
##    1    2    3    4    5    6 
## 1165 2744 1986  837  670  245

table(ewcs$Q87e)

## 
##    1    2    3    4    5    6 
## 1768 2896 1777  617  482  107

table(ewcs$Q90a)

## 
##    1    2    3    4    5 
## 1712 3758 1781  291  105

table(ewcs$Q90b)

## 
##    1    2    3    4    5 
## 2105 2961 1788  578  215

table(ewcs$Q90c)

## 
##    1    2    3    4    5 
## 2091 2911 2017  469  159

table(ewcs$Q90f)

## 
##    1    2    3    4    5 
## 4201 2970  377   63   36

In the EWCS 2016 large dataset, we find that there are a total of 11 variables and 7,647 observations.

In the following report, we will use unsupervised learning methods such as PCA and K-Means Clustering methods to visualise and summarise the data.

There are 3,899 males and 3,748 females in the data, with an age range of 15-87.

summary(ewcs)

##       Q2a            Q2b             Q87a            Q87b            Q87c      
##  Min.   :1.00   Min.   :15.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.00   1st Qu.:34.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.00   Median :43.00   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :1.49   Mean   :43.16   Mean   :2.426   Mean   :2.606   Mean   :2.415  
##  3rd Qu.:2.00   3rd Qu.:52.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :2.00   Max.   :87.00   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##       Q87d            Q87e            Q90a            Q90b      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :2.717   Mean   :2.408   Mean   :2.126   Mean   :2.194  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :6.000   Max.   :6.000   Max.   :5.000   Max.   :5.000  
##       Q90c            Q90f      
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :1.000  
##  Mean   :2.175   Mean   :1.531  
##  3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000

Summarising the data, we can see that almost all the survey questions have a mean value between option 2-3, where people had positive feelings more than half of the time.

This excludes Q90f with a mean value between option 1-2, where they felt positive that they are good at their job.
We will then further use unsupervised learning to understand the data.

Principal Component Analysis (PCA)

PCA would reduce the dimensions for the large EWCS dataset, which aims to find the minimum number of principal components to explain maximum variance. (Vidya, 2016)

# PCA
pr.ewcs <- prcomp(ewcs,center= TRUE, scale. = TRUE)
pr.ewcs$rotation

##             PC1        PC2          PC3         PC4           PC5         PC6
## Q2a  0.03203956 -0.1386327  0.796373784  0.57638908 -6.171166e-02  0.01266193
## Q2b  0.07652230 -0.2204528 -0.584133839  0.76073419  7.105450e-02  0.00515712
## Q87a 0.39103574 -0.1996019 -0.038763673 -0.07849823 -3.148653e-02  0.02786038
## Q87b 0.37759153 -0.2359578  0.077079602 -0.16741716 -4.488656e-02  0.08133873
## Q87c 0.39652146 -0.2056496 -0.004550283 -0.03679735 -1.796326e-02  0.05172394
## Q87d 0.37141006 -0.2534245  0.062704331 -0.09378305 -5.747547e-05  0.14878121
## Q87e 0.36263461 -0.1259478 -0.059239797 -0.08174241 -3.299479e-02 -0.14466435
## Q90a 0.33784962  0.3007859  0.002609147  0.12630062  1.210966e-01 -0.20735048
## Q90b 0.27485090  0.4436706  0.054692725  0.05645151  2.715430e-01 -0.62004729
## Q90c 0.22363116  0.5038874  0.015633656  0.08498139  3.729994e-01  0.71576928
## Q90f 0.17680118  0.4160141 -0.080175825  0.10806045 -8.713190e-01  0.08308652
##               PC7         PC8         PC9         PC10         PC11
## Q2a  -0.092896333  0.01060833 -0.02186704  0.005570217 -0.009937025
## Q2b   0.005490225 -0.12077633  0.01372336  0.070150927  0.028533515
## Q87a -0.179630910 -0.07249015 -0.62888270 -0.261821916 -0.544290070
## Q87b  0.158131884 -0.36972267 -0.28416331  0.574242537  0.432370395
## Q87c  0.097909301  0.16394260  0.07090937 -0.668278083  0.554994648
## Q87d  0.360503139 -0.19974893  0.62699603  0.013748398 -0.446981468
## Q87e -0.712223779  0.37711846  0.29943899  0.284447504  0.019249723
## Q90a  0.495518337  0.62792742 -0.15661941  0.226314662 -0.078670047
## Q90b -0.100309447 -0.47561321  0.09433706 -0.131861974  0.026155642
## Q90c -0.177731352 -0.06749933  0.01374979  0.001025211  0.028835352
## Q90f -0.008177733 -0.09643027  0.04070890 -0.020991855 -0.002154017

dim(pr.ewcs$x)

## [1] 7647   11

summary(pr.ewcs)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0985 1.1874 1.01058 0.97061 0.88319 0.74969 0.71299
## Proportion of Variance 0.4003 0.1282 0.09284 0.08564 0.07091 0.05109 0.04621
## Cumulative Proportion  0.4003 0.5285 0.62133 0.70697 0.77788 0.82898 0.87519
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65201 0.59566 0.56499 0.52321
## Proportion of Variance 0.03865 0.03226 0.02902 0.02489
## Cumulative Proportion  0.91384 0.94609 0.97511 1.00000

## First two principal components captures 52% of variance.

We can see that PC1 alone accounts for 40% of the total variance in the data.

Hence, to find out the ideal number of principal components to use and to check if the EWCS data is suitable for PCA, we find the scree plot as shown below.

biplot(pr.ewcs, main = "Biplot",scale=0)

pr.var = pr.ewcs$sdev^2
pve.ewcs = pr.var/sum(pr.var)

plot(pve.ewcs , xlab="Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1),type='b',
     main = "Scree Plot")

plot(cumsum(pve.ewcs), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", 
     ylim=c(0,1),
     type='b',main = "Cumulative Scree Plot")

library(ggfortify)

autoplot(pr.ewcs)

autoplot(pr.ewcs, loadings = TRUE, loadings.label = TRUE,
         data = ewcs, colour = 'Q87a')

autoplot(pr.ewcs, loadings = TRUE, loadings.label = TRUE,
         data = ewcs, colour = 'Q90a')

# Finding the top n principal component covering >80 % total variance
which(cumsum(pve.ewcs) >= 0.8)[1]

## [1] 6

The determining steep “elbow” point in the scree plot shows that the number of principal components we should keep is two. If we reduce data to two principal components, the model will explain approximately 53% of the variability. The ideal number of principal components required to explain at least 80% of total variance would be 6.

We can infer that the Q90 measures correspond to PC1 with large positive loadings.

Therefore, the component PC1 focuses on the feelings that individuals have about their job.
Q2 have shorter lines; hence, they have a lower variance compared to Q87 and Q90.

Individually selecting both Q87a and Q90a as examples, we can see that both values’ scattered points are quite similar.
We can also infer that there are more positive feelings for both questions, where the value 5-6 are more significant around that specific line of question in the data.

K-Means Clustering

K-Means algorithm classifies the dataset into “K” clusters, then minimises the distance between points with their centroid. (Sharma, 2019)

Firstly, we will need to find the optimal number of clusters “K”, which can be interpreted via both the Elbow and Silhouette method.

# K-means Clustering
ewcs.scaled <- scale(ewcs)

library(factoextra)
library(NbClust)

fviz_nbclust(ewcs.scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 2, linetype = 2) + 
  labs(subtitle = "Elbow method") #Elbow

fviz_nbclust(ewcs.scaled, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method") #Silhouette

fviz_cluster(kmeans(ewcs.scaled, centers = 2), geom = "point", data = ewcs.scaled)

According to both methods, we can determine approximately that the optimal number of clusters is two.
Using k=2, we can visualise the data in 2-dimensions, as seen above.

The optimal number of clusters would be 2.

Then, we will further investigate the pattern of clusters in the values for variables.

Further info on PCA

# Further info on cluster using PCA
pca1 <- prcomp(ewcs.scaled)

ewcs$cluster <- as.factor(kmeans(ewcs.scaled, centers = 2)$cluster)
## added cluster to initial ewcs data

ewcs$PC1 <- pca1$x[,1]
ewcs$PC2 <- pca1$x[,2]
## PC1 and PC2 added into initial ewcs data

ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q2a) + ggtitle("Q2a")

Similar cluster pattern for Q2a (gender).

ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q87a) + ggtitle("Q87a")

For Q87, majority of people who voted ‘6’ were in Cluster 1, majority who voted ‘1’ were in cluster 2.

ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q90a) + ggtitle("Q90a")

ggplot(aes(x=PC1, y=PC2, col=cluster), data=ewcs) + geom_point() + facet_grid(.~Q90f) + ggtitle("Q90f")

For Q90, majority of people who voted ‘5’ were in cluster 1, majority who voted ‘1’ were in cluster2.

Since the cluster pattern within Q87 and Q90 itself are generally quite similar, Q87a and Q90a are selectively used for explanation.

For Q87, most people who voted ‘1’ where they had positive feelings over the last two weeks were in Cluster 1. The majority who voted ‘6’ where they had negative feelings over the last two weeks were in cluster 2.

For Q90, most people who voted ‘5’ were in cluster 2, the majority who voted ‘1’ were in Cluster 1. Both clusters were quite equal where people had positive opinions about being good at their job.

Cluster proportions

We then further investigate Gender and Age proportions in the clusters.

set.seed(123)
k2 <- kmeans(ewcs.scaled, centers=2)  # set k = 2 to see natural clusters of 2.

#Checking Q2b
k1results <- data.frame(ewcs$Q2a, ewcs$Q2b, k2$cluster)
cluster1_2b <- subset(k1results, k2$cluster==1)
cluster2_2b <- subset(k1results, k2$cluster==2)
summary(cluster1_2b$ewcs.Q2b)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   36.00   45.00   45.46   54.00   87.00

summary(cluster2_2b$ewcs.Q2b)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   33.00   41.00   41.85   50.00   87.00

We observe that Cluster 1 has a slightly older mean age than Cluster 2, where males took up a relatively close proportion of 48% and 53% in Cluster 1 and 2, respectively.

# Checking Q87a
k2results <- data.frame(ewcs$Q2a, ewcs$Q87a, k2$cluster)
cluster1_Q87a <- subset(k2results, k2$cluster==1)
cluster2_Q87a <- subset(k2results, k2$cluster==2)

summary(cluster1_Q87a$ewcs.Q87a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.384   4.000   6.000

summary(cluster2_Q87a$ewcs.Q87a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.88    2.00    6.00

Cluster 2 has higher mean value than cluster 1 for Q87a.

# Checking Q90a
k3results <- data.frame(ewcs$Q2a, ewcs$Q90a, k2$cluster)
cluster1_Q90a <- subset(k3results, k2$cluster==1)
cluster2_Q90a <- subset(k3results, k2$cluster==2)
summary(cluster1_Q90a$ewcs.Q90a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.739   3.000   5.000

summary(cluster2_Q90a$ewcs.Q90a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.777   2.000   5.000

The mean value for both Q87a and Q90a in Cluster 1 is higher than its Cluster 2 value. Hence, we can infer that people in Cluster 2 are more likely to have had positive feelings over the last two weeks and positive feelings while working as compared to Cluster 1.

cluster1_2b$ewcs.Q2a <- factor(cluster1_2b$ewcs.Q2a)
cluster2_2b$ewcs.Q2a <- factor(cluster2_2b$ewcs.Q2a)
round(prop.table(table(cluster1_2b$ewcs.Q2a)),2)

## 
##    1    2 
## 0.48 0.52

round(prop.table(table(cluster2_2b$ewcs.Q2a)),2)

## 
##    1    2 
## 0.53 0.47

48% in Cluster 1 are Males, 53% in Cluster 2 are Males.

Goodness of Fit Test

# Goodness of Fit Test
# Is Cluster 1 statistically same as Cluster 2 in terms of Q87?
M <- as.matrix(table(cluster1_Q87a$ewcs.Q87a))
p.null <- as.vector(prop.table(table(cluster2_Q87a$ewcs.Q87a)))
chisq.test(M, p=p.null)

## 
##  Chi-squared test for given probabilities
## 
## data:  M
## X-squared = 46935, df = 5, p-value < 2.2e-16

Cluster 1 Q87a Proportions are different from Cluster 2 Q87a Proportions.

K-means clustering concludes Q87a is a significant differentiator.

# Is Cluster 1 statistically same as Cluster 2 in terms of Q90?
Z <- as.matrix(table(cluster1_Q90a$ewcs.Q90a))
p.null1 <- as.vector(prop.table(table(cluster2_Q90a$ewcs.Q90a)))
chisq.test(Z, p=p.null1)

## 
##  Chi-squared test for given probabilities
## 
## data:  Z
## X-squared = 11795, df = 4, p-value < 2.2e-16

Cluster 1 Q90a Proportions are different from Cluster 2 Q90a Proportions

K-Means Clustering concludes Q90a is a significant differentiator.
Using the goodness of fit test, we also find out that both Q87 and Q90 have a difference in proportion in Cluster 1 and 2.
Hence, K-Means Clustering concludes that Q87 and Q90 questions are significant differentiators.

Conclusion

Both PCA and K-means Clustering can be used together for large datasets since PCA can first reduce the dimensionality, then after that using K-Means Clustering to get a more accurate clustering result.
Overall, we can see that the people in the EWCS data are more likely to have positive feelings over the last two weeks and in work.

[LinkedIn] [GitHub]