Our goal is to use the massive amount of airline data to visualise and study the flight patterns and predict if a flight will be delayed. For this study, both Python and R will be used to investigate 2 years’ worth of data, since two full business cycles are adequate in reducing bias for one cycle.
This notebook aims to look at these questions regarding flight travel:
library(ggplot2)
library(dplyr)
library(zoo)
library(tidyr)
library("data.table")
library(plyr)
options(warn=-1)
setwd("D:/et4_e/coursework/2021")
# Import & prepare the 2006 and 2007 dataset
df_2006 = fread("2006.csv.bz2")
|--------------------------------------------------|
|==================================================|
df_2007 = fread("2007.csv.bz2")
|--------------------------------------------------|
|==================================================|
mergeddf = rbind(df_2006, df_2007)
airport_df <- read.csv2('airports.csv',sep = ",",header = TRUE)
carrier_df <- read.csv2('carriers.csv',sep = ",",header = TRUE)
planes_df <- read.csv2('D:/et4_e/coursework/2021/plane-data.csv',sep = ",",header = TRUE)
# Create Date column
mergeddf$date <- as.Date(with(mergeddf, paste(Year, Month, DayofMonth, sep="-")), "%Y-%m-%d")
# Create copy of merged data
merged_df = mergeddf
head(merged_df)
dim(merged_df)
[1] 14595137 30
There are almost 14.6 million records with 29 variable columns. The columns include the airline and flight details etc. and are mostly time-related (in mins).
summary(merged_df)
Year Month DayofMonth DayOfWeek
Min. :2006 Min. : 1.000 Min. : 1.00 Min. :1.000
1st Qu.:2006 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
Median :2007 Median : 7.000 Median :16.00 Median :4.000
Mean :2007 Mean : 6.538 Mean :15.73 Mean :3.942
3rd Qu.:2007 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:6.000
Max. :2007 Max. :12.000 Max. :31.00 Max. :7.000
DepTime CRSDepTime ArrTime CRSArrTime
Min. : 1 Min. : 0 Min. : 1 Min. : 0
1st Qu.: 930 1st Qu.: 930 1st Qu.:1108 1st Qu.:1115
Median :1329 Median :1324 Median :1515 Median :1520
Mean :1340 Mean :1331 Mean :1484 Mean :1496
3rd Qu.:1732 3rd Qu.:1720 3rd Qu.:1911 3rd Qu.:1906
Max. :2930 Max. :2359 Max. :2955 Max. :2400
NA's :282682 NA's :316047
UniqueCarrier FlightNum TailNum
Length:14595137 Min. : 1 Length:14595137
Class :character 1st Qu.: 587 Class :character
Mode :character Median :1501 Mode :character
Mean :2187
3rd Qu.:3499
Max. :9619
ActualElapsedTime CRSElapsedTime AirTime
Min. : 5.0 Min. :-1240.0 Min. :-1425.0
1st Qu.: 75.0 1st Qu.: 77.0 1st Qu.: 54.0
Median : 108.0 Median : 109.0 Median : 84.0
Mean : 126.2 Mean : 127.2 Mean : 102.8
3rd Qu.: 156.0 3rd Qu.: 157.0 3rd Qu.: 131.0
Max. :1879.0 Max. : 1430.0 Max. : 1958.0
NA's :316047 NA's :998 NA's :316047
ArrDelay DepDelay Origin
Min. :-592.00 Min. :-1200.00 Length:14595137
1st Qu.: -9.00 1st Qu.: -4.00 Class :character
Median : -1.00 Median : 0.00 Mode :character
Mean : 9.45 Mean : 10.76
3rd Qu.: 13.00 3rd Qu.: 10.00
Max. :2598.00 Max. : 2601.00
NA's :316047 NA's :282682
Dest Distance TaxiIn
Length:14595137 Min. : 11.0 Min. : 0.000
Class :character 1st Qu.: 317.0 1st Qu.: 4.000
Mode :character Median : 569.0 Median : 5.000
Mean : 723.8 Mean : 6.873
3rd Qu.: 951.0 3rd Qu.: 8.000
Max. :4962.0 Max. :1501.000
TaxiOut Cancelled CancellationCode
Min. : 0.00 Min. :0.00000 Length:14595137
1st Qu.: 10.00 1st Qu.:0.00000 Class :character
Median : 13.00 Median :0.00000 Mode :character
Mean : 16.03 Mean :0.01937
3rd Qu.: 19.00 3rd Qu.:0.00000
Max. :602.00 Max. :1.00000
Diverted CarrierDelay WeatherDelay
Min. :0.000000 Min. : 0.000 Min. : 0.0000
1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.: 0.0000
Median :0.000000 Median : 0.000 Median : 0.0000
Mean :0.002286 Mean : 3.636 Mean : 0.7258
3rd Qu.:0.000000 3rd Qu.: 0.000 3rd Qu.: 0.0000
Max. :1.000000 Max. :2580.000 Max. :1429.0000
NASDelay SecurityDelay LateAircraftDelay
Min. : 0.000 Min. : 0.0000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
Median : 0.000 Median : 0.0000 Median : 0.000
Mean : 3.687 Mean : 0.0273 Mean : 4.813
3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000
Max. :1392.000 Max. :382.0000 Max. :1366.000
date
Min. :2006-01-01
1st Qu.:2006-07-08
Median :2007-01-08
Mean :2007-01-04
3rd Qu.:2007-07-06
Max. :2007-12-31
str(merged_df)
Classes ‘data.table’ and 'data.frame': 14595137 obs. of 30 variables:
$ Year : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 11 11 11 11 11 11 11 11 11 11 ...
$ DayOfWeek : int 3 3 3 3 3 3 3 3 3 3 ...
$ DepTime : int 743 1053 1915 1753 824 627 825 942 1239 1642 ...
$ CRSDepTime : int 745 1053 1915 1755 832 630 820 945 1245 1645 ...
$ ArrTime : int 1024 1313 2110 1925 1015 834 1041 1155 1438 1841 ...
$ CRSArrTime : int 1018 1318 2133 1933 1015 832 1021 1148 1445 1845 ...
$ UniqueCarrier : chr "US" "US" "US" "US" ...
$ FlightNum : int 343 613 617 300 765 295 349 356 775 1002 ...
$ TailNum : chr "N657AW" "N834AW" "N605AW" "N312AW" ...
$ ActualElapsedTime: int 281 260 235 152 171 127 136 133 119 119 ...
$ CRSElapsedTime : int 273 265 258 158 163 122 121 123 120 120 ...
$ AirTime : int 223 214 220 126 132 108 111 121 103 105 ...
$ ArrDelay : int 6 -5 -23 -8 0 2 20 7 -7 -4 ...
$ DepDelay : int -2 0 0 -2 -8 -3 5 -3 -6 -3 ...
$ Origin : chr "ATL" "ATL" "ATL" "AUS" ...
$ Dest : chr "PHX" "PHX" "PHX" "PHX" ...
$ Distance : int 1587 1587 1587 872 872 644 644 644 644 644 ...
$ TaxiIn : int 45 27 4 16 27 6 4 4 4 4 ...
$ TaxiOut : int 13 19 11 10 12 13 21 8 12 10 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : chr "" "" "" "" ...
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
$ CarrierDelay : int 0 0 0 0 0 0 0 0 0 0 ...
$ WeatherDelay : int 0 0 0 0 0 0 0 0 0 0 ...
$ NASDelay : int 0 0 0 0 0 0 20 0 0 0 ...
$ SecurityDelay : int 0 0 0 0 0 0 0 0 0 0 ...
$ LateAircraftDelay: int 0 0 0 0 0 0 0 0 0 0 ...
$ date : Date, format: "2006-01-11" ...
- attr(*, ".internal.selfref")=<externalptr>
Due to the large dataset requiring more time to execute, we will randomly select 10% of the data for quick analysis.
We then filter the data into Cancelled and Non-cancelled flights.
# Remove original data & take sample (10%) of merged data (save space & load faster)
set.seed(42)
merged_df = sample_frac(merged_df, 0.10, replace = FALSE)
# Rename CancellationCode Column
merged_df$CancellationCode <- mapvalues(merged_df$CancellationCode,
from = c("A", "B", "C", "D"),
to = c("Carrier", "Weather", "National Air System (NAS)", "Security"))
merged_df$Cancelled <- mapvalues(merged_df$Cancelled,
from = c(1, 0),
to = c("Cancelled", "Not Cancelled"))
merged_df$Diverted <- mapvalues(merged_df$Diverted,
from = c(1, 0),
to = c("Diverted", "Not Diverted"))
# Check for missing values
sum(is.na(merged_df))
[1] 184149
# Missing values per column
sapply(merged_df,function(x)sum(is.na(x)))
Year Month DayofMonth
0 0 0
DayOfWeek DepTime CRSDepTime
0 28496 0
ArrTime CRSArrTime UniqueCarrier
31764 0 0
FlightNum TailNum ActualElapsedTime
0 0 31764
CRSElapsedTime AirTime ArrDelay
101 31764 31764
DepDelay Origin Dest
28496 0 0
Distance TaxiIn TaxiOut
0 0 0
Cancelled CancellationCode Diverted
0 0 0
CarrierDelay WeatherDelay NASDelay
0 0 0
SecurityDelay LateAircraftDelay date
0 0 0
# Check for duplicates
sum(duplicated(merged_df))
[1] 0
distinct(merged_df)
Missing data is also handled using linear interpolation to estimate unknown data values between known data values, and duplicates are removed.
# Imputate Null values with interpolation
merged_df <- merged_df %>%
mutate(DepTime = na.approx(DepTime))
merged_df <- merged_df %>%
mutate(ArrTime = na.approx(ArrTime))
merged_df <- merged_df %>%
mutate(ActualElapsedTime = na.approx(ActualElapsedTime))
merged_df <- merged_df %>%
mutate(CRSElapsedTime = na.approx(CRSElapsedTime))
merged_df <- merged_df %>%
mutate(AirTime = na.approx(AirTime))
merged_df <- merged_df %>%
mutate(ArrDelay = na.approx(ArrDelay))
merged_df <- merged_df %>%
mutate(DepDelay = na.approx(DepDelay))
# Change selected data types (Numeric to Categorical)
merged_df$Year <- as.factor(merged_df$Year)
merged_df$Month <- as.factor(merged_df$Month)
merged_df$DayofMonth <- as.factor(merged_df$DayofMonth)
merged_df$DayOfWeek <- as.factor(merged_df$DayOfWeek)
merged_df$FlightNum <- as.factor(merged_df$FlightNum)
merged_df$Cancelled <- as.factor(merged_df$Cancelled)
merged_df$Diverted <- as.factor(merged_df$Diverted)
# Create labelled column for easier visualisation
merged_df$Month_label <- month.abb[merged_df$Month]
# Check for missing values per column
sapply(merged_df,function(x)sum(is.na(x)))
Year Month DayofMonth
0 0 0
DayOfWeek DepTime CRSDepTime
0 0 0
ArrTime CRSArrTime UniqueCarrier
0 0 0
FlightNum TailNum ActualElapsedTime
0 0 0
CRSElapsedTime AirTime ArrDelay
0 0 0
DepDelay Origin Dest
0 0 0
Distance TaxiIn TaxiOut
0 0 0
Cancelled CancellationCode Diverted
0 0 0
CarrierDelay WeatherDelay NASDelay
0 0 0
SecurityDelay LateAircraftDelay date
0 0 0
Month_label
0
The null values have all been removed from the dataframe.
flight_cancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Cancelled')))
flight_notcancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Not Cancelled')))
We assume that a delayed flight is equivalent to arriving late for more than 15 minutes at its destination. (ArrDelay > 15 mins)
Since flights can be delayed on its Departure but still arrive on time, hence we do not classify those as a delayed flight.
Hence we create a DelayStatus column into the main dataframe (merged_df),where 0 = No Delay, 1 = Delay.
# Creating new column showing ArrDelay > 15mins
# 0 = No Delay, 1= Delay
flight_notcancelled$DelayStatus <- ifelse(flight_notcancelled$ArrDelay > 15, 1, 0)
table(flight_notcancelled$DelayStatus)
0 1
1102581 328437
prop.table(table(flight_notcancelled$DelayStatus))
0 1
0.7704872 0.2295128
This shows that 77% have no delays (ArrDelay > 15 minutes), where they either arrived early or on time. Also, 23% of flights were delayed. Equivalent to about 1 out of every 5 flights being delayed.
We will be looking at the different variables to get a better understanding of the data.
prop.table(table(merged_df$Month))
1 2 3 4 5 6
0.08206499 0.07508801 0.08511806 0.08237742 0.08450758 0.08426093
7 8 9 10 11 12
0.08654045 0.08835270 0.08098038 0.08534827 0.08177654 0.08358467
prop.table(table(merged_df$DayOfWeek))
1 2 3 4 5 6
0.1483665 0.1443227 0.1459657 0.1475135 0.1475149 0.1257288
7
0.1405879
# Percentage Distribution of Month
ggplot(merged_df, aes(x = Month)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Month (%)") +
ylab("Percentage (%)")
# Percentage Distribution of DayOfWeek
ggplot(merged_df, aes(x = DayOfWeek)) +
geom_bar(aes(y = (..count..)/sum(..count..)),fill = 'cornflowerblue') + ggtitle("DayOfWeek (%)") +
ylab("Percentage (%)")
The data is almost evenly distributed between Month and DayOfWeek, with February and Saturday having the least number of total flights.
#Total Number of flights per Airline
ggplot(merged_df, aes(x = forcats::fct_infreq(UniqueCarrier))) +
geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Flights per Airline") +
xlab("UniqueCarrier")
carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO','MQ','US')))
The top 5 airlines with the most flights are WN, AA, OO, MQ, UA.
#Boxplot Distribution of Total ArrDelay per Carrier
boxplot(flight_notcancelled$ArrDelay~flight_notcancelled$UniqueCarrier,
main = "Distribution of Total ArrDelay per Carrier",
xlab = "Carrier/Airline",
ylab = "Total ArrDelay (mins)",
border = "black"
)
# Percentage Distribution of Cancellation
ggplot(flight_cancelled, aes(x = CancellationCode)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Cancellation Reasons (%)") +
ylab("Percentage (%)")
We can conclude that Cancellations are mostly due to Carrier, Weather and NAS with around 43%, 35% and 20% respectively.
#Total Number of Cancelled flights per Airline
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(UniqueCarrier))) +
geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Airline") +
xlab("UniqueCarrier")
carrier_df %>% filter_all(any_vars(. %in% c('MQ', 'AA','OO')))
The top 3 most cancelled flights throughout these 2 years are:
#Total Number of Cancelled flights per Month
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(Month))) +
geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Month") +
xlab("UniqueCarrier")
#Total Number of Cancelled flights per DayOfWeek
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(DayOfWeek))) +
geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per DayOfWeek") +
xlab("UniqueCarrier")
The flights are mostly likely to be cancelled in December and February, on a Thursday and Friday.
cor.test(flight_notcancelled$ArrDelay, flight_notcancelled$DepDelay, method = "pearson")
Pearson's product-moment correlation
data: flight_notcancelled$ArrDelay and flight_notcancelled$DepDelay
t = 2841.3, df = 1431016, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9213976 0.9218910
sample estimates:
cor
0.9216447
ArrDelay and DepDelay have a strong positive linear relationship, implying that a Departure Delay will almost certainly result in an Arrival Delay.
# BarPlot
ggplot(flight_notcancelled, aes(x= UniqueCarrier)) + geom_bar(aes(fill=as.factor(DelayStatus))) + ggtitle("Barplot of DelayStatus counts per UniqueCarrier")
carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO')))
The top 3 most delayed flights throughout these 2 years are:
We will breakdown this question into three parts, where we will find the airline carrier and time period least likely to have delayed flights:
# Create copy of dataframe
df_q1 <- flight_notcancelled
Time Intervals column ‘ArrPeriod’ was created based on ‘ArrTime’. 24 hours in a day will split into 6 different periods with at least 3 to 5-hour intervals since different timings like 5am and 11am are better not generalised together into a single timeframe. The period is split as such:
# Categorising ArrTime and DepTime by 6 periods: Midnight, Early Morning, Late Morning, Afternoon, Evening, Night
df_q1 <- df_q1 %>%
mutate(ArrPeriod = case_when(ArrTime >= 500 & ArrTime < 900 ~ 'Early Morning',
ArrTime >= 900 & ArrTime < 1200 ~ 'Late Morning',
ArrTime >= 1200 & ArrTime < 1700 ~ 'Afternoon',
ArrTime >= 1700 & ArrTime < 2100 ~ 'Evening',
ArrTime >= 2100 & ArrTime < 2400 ~ 'Night',
TRUE ~ 'Midnight'))
# Create table counts
counts <- table(df_q1$DelayStatus,df_q1$ArrPeriod)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
xlab = "Time Period", ylab = "Total Delay Count",
main = "Total Delay per Period",beside=TRUE,
legend = rownames(counts))
Early Morning followed by Midnight has the lowest count for number of flights delayed. However, we will proceed on to check if having the lowest number of count will equate to the shortest average delayed time.
# Boxplot
ggplot(df_q1, aes(x= ArrDelay)) + geom_boxplot(aes(color=as.factor(ArrPeriod))) + ggtitle("Boxplot of ArrDelay (mins) vs ArrPeriod")
The boxplot shows that the mean of ArrDelay for midnight is more than the other time period. Hence, validating our assumption that lowest flight delay counts does not equate to lowest average delayed time.
#Set Sequence to ArrPeriod
df_q1$ArrPeriod <- factor(df_q1$ArrPeriod,levels = c("Midnight", "Early Morning", "Late Morning", "Afternoon", "Evening", "Night"))
df_q1 %>% group_by(ArrPeriod) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x= ArrPeriod, y = mean)) +
geom_point() +
geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Time Period") +
xlab("Time Period") + ylab("Average ArrDelay (mins)")
The best time period to minimise flight delays would be in the Early Morning from 5am to 9am.
# Create table for counts
counts <- table(df_q1$DelayStatus,df_q1$DayOfWeek)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
xlab = "DayOfWeek",ylab = "Total Delay Count",
main = "Total Delay per DayOfWeek",beside=TRUE,
legend = rownames(counts))
#Line Plot
df_q1 %>% group_by(DayOfWeek) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x= DayOfWeek, y = mean)) +
geom_point() +
geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayOfWeek") +
xlab("DayOfWeek") + ylab("Average ArrDelay (mins)")
The Best Day of the Week to minimize delays is to travel on Saturday, followed by Tuesday, with average ArrDelay timing of less than 6 minutes and 8 minutes, respectively.
The longest average delays of 12-13 minutes are expected in the middle of the week, from Thursday to Friday.
# create table counts
counts <- table(df_q1$DelayStatus,df_q1$Month)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
xlab = "Month",ylab = "Total Delay Count",
main = "Total Delay per Month",beside=TRUE,
legend = rownames(counts))
df_q1 %>% group_by(Month) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x= Month, y = mean)) +
geom_point() +
geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Month") +
xlab("Month") + ylab("Average ArrDelay (mins)")
Based on ArrDelay, the best time of year to minimise travel delay is November, then September, with both averaging approximately 6 minutes of delay, as opposed to June and December, with more than twice the number of minutes delayed.
Due to the U.S. summer and winter vacation (School Holidays USA, 2022), June and December are projected to be popular months to travel.
df_q1 %>% group_by(DayofMonth) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x= DayofMonth, y = mean)) +
geom_point() +
geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayofMonth") +
xlab("DayofMonth") + ylab("Average ArrDelay (mins)")
Also, travelling in the first half of the month, around the 8th – 9th, results in an average ArrDelay of around 6 minutes, and the second half of the month with double the delay amount
Hence overall, the top 2 recommended time period to avoid flight delay is:
All the above periods are delayed by an average of 0 to 6 minutes. Passengers may reduce their flight delays even further by booking flights with airlines that are below the delay threshold (%).
# To save up space
rm(df_q1)
The issue date of planes will be extracted from the planes_df data since we can judge the age of the aircraft and make our analysis based on that.
To determine the age of the aircrafts, the engine’s issue date is obtained by mapping the plane-data.csv consisting of the plane’s ‘issue_date’, into a duplicated data frame of the ‘flight_notcancelled’ data as a new column.
# Create a new dataframe
df_q2 <- flight_notcancelled
# Check for missing value
sapply(df_q2,function(x)sum(is.na(x)))
Year Month DayofMonth DayOfWeek DepTime CRSDepTime
0 0 0 0 0 0
ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
0 0 0 0 0 0
CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest
0 0 0 0 0 0
Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
0 0 0 0 0 0
CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay date
0 0 0 0 0 0
Month_label DelayStatus issue_date
0 0 166043
# Check for missing value
sapply(df_q2,function(x)sum(is.na(x)))
Year Month DayofMonth
0 0 0
DayOfWeek DepTime CRSDepTime
0 0 0
ArrTime CRSArrTime UniqueCarrier
0 0 0
FlightNum TailNum ActualElapsedTime
0 0 0
CRSElapsedTime AirTime ArrDelay
0 0 0
DepDelay Origin Dest
0 0 0
Distance TaxiIn TaxiOut
0 0 0
Cancelled CancellationCode Diverted
0 0 0
CarrierDelay WeatherDelay NASDelay
0 0 0
SecurityDelay LateAircraftDelay date
0 0 0
Month_label DelayStatus issue_date
0 0 166043
Since our dataset is huge with more than 2 million records, we will proceed to clean the 166k records of missing issue_date.
# Remove missing & None values
df_q2 <- na.omit(df_q2)
df_q2 <- df_q2 %>%
filter(!grepl('None', issue_date))
# Convert to datetime format
df_q2$issue_date <- as.Date(df_q2$issue_date, "%m/%d/%Y")
# Total Sum of Delay by issue_date
ggplot(data = df_q2, aes(x=issue_date, y=ArrDelay))+
geom_line() + ggtitle("Total ArrDelay with Issue Date") +
xlab("Issue Date") + ylab("Total ArrDelay (mins)")
From the graph, we can see three peak points around year 1986, 2000 and 2004. Hence, we proceed to check with the mean value.
# Average of ArrDelay by issue_date
df_q2 %>% group_by(issue_date) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x=issue_date, y=mean))+
geom_line() + ggtitle("Average ArrDelay with Issue Date") +
xlab("Issue Date") + ylab("Average ArrDelay (mins)")
Based on Issue Date, there seem to be more delay occurrences for planes issued after 1998. Also, the low total minutes delayed between 1976 and 1984 might be attributed to planes older than 24 years being removed owing to maintenance faults, or the airline scheduling them with ample time to avoid delays.
A new column of issue year and plane age is then extracted from the issue date.
df_q2$issue_year = format(as.Date(df_q2$issue_date, format="%m/%d/%Y"),"%Y")
df_q2 %>% group_by(issue_year) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x= issue_year, y = mean)) +
geom_point() +
geom_line(aes(group = 1), size=1, col = "cornflowerblue") + ggtitle("Average ArrDelay vs issue_year") +
xlab("issue_year") + ylab("Average ArrDelay (mins)")
Older planes issued before 1984 have a higher average delay of more than 11 minutes, compared to those 1984 onwards with under 11 minutes of average delay.
Further investigations will then be made with Issue Date rather than its Year, as delays became increasingly prevalent after year 2000.
The average age of U.S planes is 11 years, with about 25% of planes above 15 years old (Mayerowitz, 2011).
Hence, we will deem an aircraft to be old when in operation for 15+ years, where data will split into two, with planes issued before 1993 termed old, and those issued 1993 onwards termed normal..
# Plot a subset of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")),
aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Old Planes (Before 1993))") +
xlab("Issue Date") + ylab("Total ArrDelay (mins)")
# Plot a subset of normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")),
aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
xlab("Issue Date") + ylab("Total ArrDelay (mins)")
# Plot Average ArrDelay of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")),
aes(x=issue_date, y=ArrDelay))+ stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Old Planes (Before 1993))") +
xlab("Issue Date") + ylab("Average ArrDelay (mins)")
# Plot Average ArrDelay of Normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")),
aes(x=issue_date, y=ArrDelay))+ stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
xlab("Issue Date") + ylab("Average ArrDelay (mins)")
# Plot Average ArrDelay by Distance of Old/Normal planes
ggplot() + stat_summary(data=subset(df_q2, issue_date < as.Date("1993-01-01")), aes(x=Distance, y = ArrDelay,group=1, color = 'red'), fun=mean, geom="line",group=1, size=0.5) + stat_summary(data=subset(df_q2, issue_date > as.Date("1992-12-31")), aes(x=Distance, y = ArrDelay,group=1, color = 'cornflowerblue'), fun=mean, geom="line",group=1, size=0.5) + ggtitle("Average ArrDelay by Distance") +
xlab("Distance (miles)") + ylab("Average ArrDelay (mins)") +
scale_colour_manual(name = 'Planes',
values =c('cornflowerblue'='cornflowerblue','red'='red'), labels = c('Normal Planes (1993 Onwards)','Old Planes (Before 1993)'))+ theme(legend.position = c(0.82,0.88))
Older planes are more likely than normal planes to have larger Average Delays as the Distance grows, with Average Arrival Delays surpassing 100 minutes or more. This further confirms that older planes tend to suffer from more delays when on a long-haul flight. Further investigations are done with Distance to identify the different delay factors that may impact old and normal planes.
library(reshape2)
package 㤼㸱reshape2㤼㸲 was built under R version 4.0.3
Attaching package: 㤼㸱reshape2㤼㸲
The following objects are masked from 㤼㸱package:data.table㤼㸲:
dcast, melt
The following object is masked from 㤼㸱package:tidyr㤼㸲:
smiths
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for full data
ggplot(df_q2a, aes(issue_date, value)) +
geom_line(aes(colour = Delay),size=0.8) + ggtitle("Total Delay by issue_date") +
xlab("issue_date") + ylab("Total Delay (mins)")+ theme(legend.position = c(0.12,0.78))
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for full data
ggplot(data=df_q2a,
aes(x=issue_date, y=value))+ stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date") +
xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.12,0.78))
The full data shows Carrier Delay being prominent with planes issued 1990-1992 and 2000.
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")),
aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Old Planes)") +
xlab("Issue Date") + ylab("Total Delay (mins)") + theme(legend.position = c(0.12,0.78))
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")),
aes(x=issue_date, y=value))+ stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date (Old Planes)") +
xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.5,0.78))
Old Planes issued around 1990 are more likely to have delays due CarrierDelay, which includes maintainence etc.
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date > as.Date("1992-12-31")),
aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Normal Planes)") +
xlab("Issue Date") + ylab("Total Delay (mins)")
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a , id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date > as.Date("1992-12-31")),
aes(x=issue_date, y=value))+ stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size =1) + ggtitle("Average Delay with Issue Date (Normal Planes)") +
xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.77))
Normal Planes are likely to have CarrierDelays as well.
Distance is then used to check the delays that Old and Normal Planes will have with respect to the number of miles that they have travelled.
# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01"))
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes, id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Old planes
ggplot(data = df_q2a,
aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Old Planes)") +
xlab("Distance (miles)") + ylab("Total Delay (mins)") + theme(legend.position = c(0.88,0.78))
# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01"))
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes, id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Old planes
ggplot(data = df_q2a,
aes(x=Distance, y=value))+ stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1,size=1)+ ggtitle("Average Delay with Distance (Old Planes)") +
xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))
In R, although the LateAircraftDelay and NASDelay have higher average values, Carrier Delays also have occurrences of average delay ranging around 50 minutes at 1500 miles.
# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31"))
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes, id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Normal planes
ggplot(data = df_q2a,
aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Normal Planes)") +
xlab("Distance (miles)") + ylab("Total Delay (mins)")
# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31"))
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes, id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Normal planes
ggplot(data = df_q2a,
aes(x=Distance, y=value))+ stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1)+ ggtitle("Average Delay with Distance (Normal Planes)") +
xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))
Normal planes have lower average values of Carrier Delays of less than 80 minutes.
Hence, it is evident that older planes would suffer more delays by having a larger value of Carrier Delay minutes while flying routes longer than 1500 miles.
Overall, the grouping by Issue Year was clear in showing that older planes do suffer from more delay, but only by up to 8 minutes on average, where LateAircraftDelay, followed by Carrier/NASDelay are the primary causes.When utilized for long-haul flights, older planes are also more likely to have delays due to carrier delays that might have resulted from aircraft maintenance or inspection.
The difference in delay however is quite negligible and it might be due to airlines “padding” and scheduling extra time for flights to prevent flights from being classified as delayed (Kramer, 2019).
# To save up space
rm(df_q2)
rm(df_q2a)
rm(counts)
rm(normal_planes)
rm(old_planes)
We will find the most popular routes to gauge the number of people flying between these different locations.
# Create new dataframe
df_q3 <- flight_notcancelled
To begin, the Origin and Destination are combined into a new ‘FlightRoute’ column (e.g., OGG to HNL).
# Create new column as its flight route
df_q3$FlightRoute <- paste(df_q3$Origin, "to", df_q3$Dest)
# Most popular routes
df_q3 %>%
group_by(df_q3$FlightRoute) %>%
dplyr::summarize(Count=n()) %>%
arrange(desc(Count))
The top five most popular routes are:
Out of the 5196 distinct routes, the Top 5 routes with the highest count are identified to examine if the number of flights has changed over the course of the year.
# BarPlot
ggplot(df_q3, aes(x= Month)) + geom_bar(aes(fill=as.factor(FlightRoute))) + ggtitle("Barplot of FlightRoute counts")
# BarPlot
ggplot(df_q3, aes(x= Month)) + geom_bar(aes(fill=as.factor(FlightRoute))) + ggtitle("Barplot of FlightRoute counts")
# FlightRoute Line Plot
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) +
geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
xlab("Month") + ylab("Total Flight Counts")
For the first five months, the flight routes [OGG to HNL] and [HNL to OGG] had roughly 50 fewer flights than the other three routes.
However, from June to August, both routes begin to increase in their number of flights, reaching over 100 more flights than the other routes.
Between September and December, these routes continue to have more flights than the others.
To make the most of the data, the States were queried to justify the number of people flying interstate and intrastate.
# Query state from airport_df into dataframe
df_q3$state <- airport_df$state[match(df_q3$Origin, airport_df$iata)]
# Query state from airport_df into dataframe
df_q3$state <- airport_df$state[match(df_q3$Origin, airport_df$iata)]
# Most popular states
df_q3 %>%
group_by(df_q3$state) %>%
dplyr::summarize(Count=n()) %>%
arrange(desc(Count))
Top 3 popular states are:
# Select the top 3 states
df_q3 <- df_q3[df_q3$state %in% c('CA', 'TX', 'IL'), ]
# Create new column of state flight routes
df_q3$FlightRoute <- paste(df_q3$state, "to", df_q3$state[-1:0])
# Create new column of state flight routes
df_q3$FlightRoute <- paste(df_q3$state, "to", df_q3$state[-1:0])
# Flight Routes of Top 3 states
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) +
geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
xlab("Month") + ylab("Total Flight Counts")
For both intrastate and interstate, February and September have the fewest flights with around 100 less than other months. As expected, the number of flights increases in the middle of the year, between June and August.
With two distinct methodologies, it is evident that February has the fewest flights, followed by September. The more popular travelling months are during June to August, which is likely due to the summer holidays in the USA which last 11 weeks from June to August (School Holidays USA, 2022).
# To save up space
rm(df_q3)
Cascading failures occur when a flight delay for one plane in an airport causes a flight delay in another.
This is explained by the existing variable ‘LateAircraftDelay’ which describes how a particular flight delayed in its Origin arrives late in its Destination, then affecting the next flight’s departure since the same plane was used. The ripple impact of a previous delay at downstream airports hence causes cascading failures.
Since Tail Numbers are identification numbers on aircraft, it will be easier to focus on data with ‘LateAircraftDelay’, then focus on a selected aircraft and observe its flight schedule.
We will approach the question as follow:
# Create new dataframe
df_q4 <- merged_df
# Find Date with highest value counts
df_q4 %>%
group_by(df_q4$TailNum) %>%
dplyr::summarize(Count=n()) %>%
arrange(desc(Count))
‘0’ and ‘000000’ are likely private confidential TailNums, so we ignore those data and focus on the top 2 highest counts of TailNum:
We will approach the dataset in the following way to effectively illustrate cascading failures, that is, delays in one airport will cause delays in another:
This approach will allow us to assess if a delayed flight in one airport may cause a delay in another. To ensure that the data extraction and analysis approaches are valid, two separate tests will be conducted.
The initial test used N308SW to identify the highest date counts.
# First extraction of data with top most TailNum counts
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$LateAircraftDelay > 15, ]
df_q4_1 %>%
group_by(df_q4_1$date) %>%
dplyr::summarize(Count=n()) %>%
arrange(desc(Count))
We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.
# Extracting dataset that matches our findings
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1),]
It is clear that one flight’s delay in an airport can cause cascading failures in another.
Taking Flight 458 as an example: has to depart at 1835 (6:35pm) but delayed for 52 mins till 1927 (7:27pm), flying from DAL to SAT. Supposed to reach 1935 (7:35pm) but delayed 50 mins and reached 2025 (8:25pm).
This 77 mins delay thereafter caused the next flight from SAT to DAL (FlightNum 178) to have departure delay for 45 minutes (43 mins for LateAircraftDelay) also since its scheduled departure time is at 2005 (8:05pm), but it only took off at 2050 (8:50pm). It also arrived 43 minutes later than expected time of 2100 at 2143.
We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.
# Matching with full data to confirm assumption
df_q4_1 <- mergeddf[mergeddf$TailNum %in% c('N308SW') & mergeddf$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1:8, 10,12,14),] %>% select(10,5:8,29,15:18,30,9,11) #Select important columns
Considering Flight 755: there was a 52 minutes ArrDelay in the Destination Airport (LIT). Due to the delay, Flight 458, scheduled to depart at 1700 (5pm), was delayed for 60 mins till 1800 (6pm), flying from LIT to DAL.
Supposed to reach 1805 (6:05pm) but reached 1900 (7pm). With the 55-min delay, the next flight from DAL to SAT (FlightNum 458) was subsequently delayed for 52 minutes (50-min for LateAircraftDelay) since its scheduled departure time was at 1835 (6:35pm), but it only took off at 1927 (7:27pm). It then landed 50 minutes later, at 2143, than its expected time of 2100.
Since the DepTime overlapped with its previous flight’s CRSArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failure.
With this, we can observe the cascading failure for Carrier WN of same TailNum, which persisted throughout the night at different airports (LIT, DAL, SAT, TUL etc.)
Since the CRSDepTime overlapped with its previous flight’s ArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failures.
The secondary test used N478HA to identify the highest date counts.
# First extraction of data with top most TailNum counts
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$LateAircraftDelay > 15, ]
df_q4_2 %>%
group_by(df_q4_2$date) %>%
dplyr::summarize(Count=n()) %>%
arrange(desc(Count))
We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.
# Extracting dataset that matches our findings
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)]
Taking Flight 179 as an example: has to depart at 1328 (1:28pm) but delayed for 82 mins till 1450 (2:50pm), flying from KOA to OGG. Supposed to reach 1357 (1:57pm) but delayed 87 mins and reached 1524 (3:24pm).
We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.
# Matching with full data to confirm assumption
df_q4_2 <- mergeddf[mergeddf$TailNum %in% c('N478HA') & mergeddf$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)][-c(1:4),] %>% select(10,5:8,29,15:18,30,9,11) #Select important columns
NA
This 87 mins arrival delay of FlightNum 179 thereafter caused the next flight from OGG to HNL (FlightNum 179) to have departure delay for 79 minutes (89 mins for LateAircraftDelay) also since its scheduled departure time is at 1425 (2:25pm), but it only took off at 1544 (3:44pm). It also arrived 89 minutes later than expected time of 1459 at 1628. We can also see from the day that one flight delay in an airport will cause cascading failures, contributing to ArrDelay and LateAircraftDelay for flights, hence confirming our observation.
It is also worth noting the instances where subsequent flights might not use the same TailNum, but they are assumed due to a lack of schedule information. A total of two tests were conducted in R, to demonstrate the same effect of cascading failures where delays at one airport causes delays in another. The previous flight’s late arrival, which used the same plane that would be departing, caused the subsequent flight’s delay. As a result, the current flight will depart late, setting off a chain reaction causing passengers at other airports to board the plane much later as well.
# To save space
rm(df_q4)
rm(df_q4_1)
rm(df_q4_2)
This works by allowing the model to predict the label of new data points based on past data.
To predict delays, these few supervised learning models with selected variables are used:
# Select variable columns
df_q5 <- df_q5 %>% select(Month, UniqueCarrier, DepTime, CRSDepTime, ArrTime, CRSArrTime, ArrDelay, DepDelay,Distance,TaxiIn,TaxiOut, ActualElapsedTime,AirTime, DelayStatus)
# Select variable columns
df_q5 <- df_q5 %>% select(Month, UniqueCarrier, DepTime, CRSDepTime, ArrTime, CRSArrTime, ArrDelay, DepDelay,Distance,TaxiIn,TaxiOut, ActualElapsedTime,AirTime, DelayStatus)
Time-related and Factor columns are selected as predictor varaibles for analysis.
# Change selected var as factors
df_q5$UniqueCarrier<- as.factor(df_q5$UniqueCarrier)
df_q5$Month <- as.factor(df_q5$Month)
# Hold-out Validation method
library(caret)
set.seed(42)
F_sample = createDataPartition(y=df_q5$ArrDelay, p = 0.7, list = F)
train = df_q5[F_sample,]
test = df_q5[-F_sample,]
First, ‘CreateDataPartition’ feature will create a train-test split to prevent overfitting, with Trainset accounting for 70%, and Testset for 30%. The model will be trained using the Trainset data, and its performance will be evaluated by predicting with the unseen Testset. For reproduction, random state is set to a random seed 42.
The response variable Y, for the Regression model will comprise of ArrDelay (minutes).
MLR is a regression model which enables us to understand and estimate relationships between multiple variables.
# Load model
mlr_model <- lm(ArrDelay ~. -DelayStatus, data = train)
summary(mlr_model)
Call:
lm(formula = ArrDelay ~ . - DelayStatus, data = train)
Residuals:
Min 1Q Median 3Q Max
-544.76 -4.62 0.28 4.76 964.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.291e+01 9.164e-02 -250.057 < 2e-16 ***
Month2 5.663e-01 5.064e-02 11.183 < 2e-16 ***
Month3 1.807e-01 4.890e-02 3.694 0.00022 ***
Month4 6.833e-01 4.921e-02 13.887 < 2e-16 ***
Month5 8.454e-01 4.885e-02 17.308 < 2e-16 ***
Month6 1.209e+00 4.908e-02 24.634 < 2e-16 ***
Month7 8.584e-01 4.870e-02 17.626 < 2e-16 ***
Month8 7.547e-01 4.840e-02 15.594 < 2e-16 ***
Month9 5.705e-01 4.946e-02 11.535 < 2e-16 ***
Month10 8.010e-01 4.882e-02 16.406 < 2e-16 ***
Month11 -5.906e-01 4.928e-02 -11.985 < 2e-16 ***
Month12 1.041e-02 4.924e-02 0.212 0.83248
UniqueCarrierAA 3.581e+00 8.346e-02 42.911 < 2e-16 ***
UniqueCarrierAQ 1.144e+01 1.526e-01 74.939 < 2e-16 ***
UniqueCarrierAS 4.942e+00 1.014e-01 48.753 < 2e-16 ***
UniqueCarrierB6 -1.331e+00 9.994e-02 -13.316 < 2e-16 ***
UniqueCarrierCO -1.652e+00 9.025e-02 -18.308 < 2e-16 ***
UniqueCarrierDL 8.364e-01 8.509e-02 9.830 < 2e-16 ***
UniqueCarrierEV 5.848e-01 9.131e-02 6.405 1.51e-10 ***
UniqueCarrierF9 6.085e+00 1.159e-01 52.504 < 2e-16 ***
UniqueCarrierFL 3.379e+00 9.269e-02 36.459 < 2e-16 ***
UniqueCarrierHA 1.273e+01 1.379e-01 92.327 < 2e-16 ***
UniqueCarrierMQ 2.741e+00 8.388e-02 32.683 < 2e-16 ***
UniqueCarrierNW 3.897e+00 8.614e-02 45.239 < 2e-16 ***
UniqueCarrierOH 1.801e-02 9.246e-02 0.195 0.84558
UniqueCarrierOO 4.397e+00 8.338e-02 52.737 < 2e-16 ***
UniqueCarrierTZ 7.791e+00 2.775e-01 28.073 < 2e-16 ***
UniqueCarrierUA 2.812e+00 8.549e-02 32.895 < 2e-16 ***
UniqueCarrierUS 2.171e+00 8.489e-02 25.568 < 2e-16 ***
UniqueCarrierWN 4.935e+00 7.991e-02 61.761 < 2e-16 ***
UniqueCarrierXE 8.736e-02 8.571e-02 1.019 0.30808
UniqueCarrierYV 1.168e+00 9.015e-02 12.951 < 2e-16 ***
DepTime -4.595e-04 8.438e-05 -5.445 5.17e-08 ***
CRSDepTime 8.421e-04 8.466e-05 9.947 < 2e-16 ***
ArrTime 8.532e-04 4.092e-05 20.851 < 2e-16 ***
CRSArrTime -1.974e-03 4.941e-05 -39.958 < 2e-16 ***
DepDelay 9.816e-01 3.036e-04 3233.017 < 2e-16 ***
Distance -4.051e-02 7.202e-05 -562.542 < 2e-16 ***
TaxiIn 7.041e-02 1.078e-03 65.327 < 2e-16 ***
TaxiOut 5.560e-01 1.466e-03 379.253 < 2e-16 ***
ActualElapsedTime 2.747e-01 1.116e-03 246.238 < 2e-16 ***
AirTime 4.863e-02 9.890e-04 49.168 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.984 on 1001672 degrees of freedom
Multiple R-squared: 0.931, Adjusted R-squared: 0.931
F-statistic: 3.295e+05 on 41 and 1001672 DF, p-value: < 2.2e-16
The model summary is then used to identify the significant variables affecting ArrDelay, excluding DelayStatus since they are related.
The R2 score is 0.931, where predictor variables in the model explained 93.1% of the variation in Y (ArrDelay). RMSE is also considerably fitting at 9.98.
# Predicting delays
delayPred <- predict(mlr_model, test)
# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$ArrDelay, Predicted=round(delayPred)))
ActualnPred
# 96.4% correlation accuracy of similar directional movement
correlation_accuracy <- cor(ActualnPred)
correlation_accuracy
Actual Predicted
Actual 1.0000000 0.9641986
Predicted 0.9641986 1.0000000
The Correlation Accuracy of 96.4% shows how the actual and predicted values have similar directional movements.
sqrt(mean((test$ArrDelay - delayPred)^2))
[1] 9.996467
The R2 score is 0.9369 and Root Mean Squared Error (RMSE) is 9.676569. The R2 value shows that the predictor variables in the model are able to explain 93.69% of the variation in ArrDelay.
R uses classification model where the output is chosen by a majority vote among decision trees.
set.seed(42)
require(caTools)
sample = sample.split(df_q5$DelayStatus, SplitRatio = .70)
train = subset(df_q5, select = -c(ArrDelay), sample == TRUE)
test = subset(df_q5, select = -c(ArrDelay), sample == FALSE)
train$DelayStatus <- as.character(train$DelayStatus)
train$DelayStatus <- as.factor(train$DelayStatus)
Hence, a new train-test set with response variable Y as DelayStatus of similar parameters (30% Test) is created, with the ArrDelay column removed to prevent multicollinearity.
rf
Call:
randomForest(formula = DelayStatus ~ ., data = train, ntree = 10, random_state = 42)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 3
OOB estimate of error rate: 6.9%
Confusion matrix:
0 1 class.error
0 734411 29334 0.03840811
1 39044 188485 0.17160010
rf
Call:
randomForest(formula = DelayStatus ~ ., data = train, ntree = 10, random_state = 42)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 3
OOB estimate of error rate: 6.9%
Confusion matrix:
0 1 class.error
0 734411 29334 0.03840811
1 39044 188485 0.17160010
The Out-of-bag error estimate is at 6.9%, where Accuracy = 1 -OOB error, so the model has an accuracy of 93.1%.
delayPred <- predict(rf, test)
delayPred <- predict(rf, test)
test$DelayStatus <- as.factor(test$DelayStatus)
# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred
Since DepDelay and ArrDelay have a positive linear relationship which might lead to multicollinearity, we will investigate to see if there are changes to the factors that might affect DelayStatus.
rf <- randomForest(DelayStatus ~ .- DepDelay, data=train, ntree=10, random_state =42)
rf
Call:
randomForest(formula = DelayStatus ~ . - DepDelay, data = train, ntree = 10, random_state = 42)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 3
OOB estimate of error rate: 6.18%
Confusion matrix:
0 1 class.error
0 740506 23440 0.0306828
1 37786 189765 0.1660551
delayPred <- predict(rf, test)
Models with and without DepDelay show the changes in attributing factors causing DelayStatus, where the top 6 factors for both models are similar.
delayPred <- predict(rf, test)
test$DelayStatus <- as.factor(test$DelayStatus)
# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred
In total, one regression and one classification method were tested out in R. Overall, the best model to predict the continuous ArrDelay would be MLR with 95.7% accuracy. To predict the binary outcome DelayStatus, Random Forest model is around 93.1% accurate.