Introduction

Every year, nearly 25% of airline flights are delayed or cancelled, costing travellers over $30 billion in lost time and money Flight delays have long been a cause of dissatisfaction in the airline industry, as well as a source of annoyance for passengers and carriers.

Our goal is to use the massive amount of airline data to visualise and study the flight patterns and predict if a flight will be delayed. For this study, both Python and R will be used to investigate 2 years’ worth of data, since two full business cycles are adequate in reducing bias for one cycle.

This notebook aims to look at these questions regarding flight travel:

  1. When is the best time of day, day of the week, and time of year to fly to minimise delays?
  2. Do older planes suffer more delays?
  3. How does the number of people flying between different locations change over time?
  4. Can you detect cascading failures as delays in one airport create delays in others?
  5. Use the available variables to construct a model that predicts delays.

Data: Airlines - Harvard Dataverse

Import data and libraries

library(ggplot2)
library(dplyr)
library(zoo)
library(tidyr)
library("data.table")
library(plyr)
options(warn=-1)

setwd("D:/et4_e/coursework/2021")

# Import & prepare the 2006 and 2007 dataset
df_2006 = fread("2006.csv.bz2")
|--------------------------------------------------|
|==================================================|
df_2007 = fread("2007.csv.bz2")
|--------------------------------------------------|
|==================================================|
mergeddf = rbind(df_2006, df_2007)
airport_df <- read.csv2('airports.csv',sep = ",",header = TRUE)
carrier_df <- read.csv2('carriers.csv',sep = ",",header = TRUE)
planes_df <- read.csv2('D:/et4_e/coursework/2021/plane-data.csv',sep = ",",header = TRUE)


Understanding Data

# Create Date column
mergeddf$date <- as.Date(with(mergeddf, paste(Year, Month, DayofMonth, sep="-")), "%Y-%m-%d")
# Create copy of merged data
merged_df = mergeddf
head(merged_df)
dim(merged_df)
[1] 14595137       30


There are almost 14.6 million records with 29 variable columns. The columns include the airline and flight details etc. and are mostly time-related (in mins).

summary(merged_df)
      Year          Month          DayofMonth      DayOfWeek    
 Min.   :2006   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
 1st Qu.:2006   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
 Median :2007   Median : 7.000   Median :16.00   Median :4.000  
 Mean   :2007   Mean   : 6.538   Mean   :15.73   Mean   :3.942  
 3rd Qu.:2007   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:6.000  
 Max.   :2007   Max.   :12.000   Max.   :31.00   Max.   :7.000  
                                                                
    DepTime         CRSDepTime      ArrTime         CRSArrTime  
 Min.   :   1     Min.   :   0   Min.   :   1     Min.   :   0  
 1st Qu.: 930     1st Qu.: 930   1st Qu.:1108     1st Qu.:1115  
 Median :1329     Median :1324   Median :1515     Median :1520  
 Mean   :1340     Mean   :1331   Mean   :1484     Mean   :1496  
 3rd Qu.:1732     3rd Qu.:1720   3rd Qu.:1911     3rd Qu.:1906  
 Max.   :2930     Max.   :2359   Max.   :2955     Max.   :2400  
 NA's   :282682                  NA's   :316047                 
 UniqueCarrier        FlightNum      TailNum         
 Length:14595137    Min.   :   1   Length:14595137   
 Class :character   1st Qu.: 587   Class :character  
 Mode  :character   Median :1501   Mode  :character  
                    Mean   :2187                     
                    3rd Qu.:3499                     
                    Max.   :9619                     
                                                     
 ActualElapsedTime CRSElapsedTime       AirTime       
 Min.   :   5.0    Min.   :-1240.0   Min.   :-1425.0  
 1st Qu.:  75.0    1st Qu.:   77.0   1st Qu.:   54.0  
 Median : 108.0    Median :  109.0   Median :   84.0  
 Mean   : 126.2    Mean   :  127.2   Mean   :  102.8  
 3rd Qu.: 156.0    3rd Qu.:  157.0   3rd Qu.:  131.0  
 Max.   :1879.0    Max.   : 1430.0   Max.   : 1958.0  
 NA's   :316047    NA's   :998       NA's   :316047   
    ArrDelay          DepDelay           Origin         
 Min.   :-592.00   Min.   :-1200.00   Length:14595137   
 1st Qu.:  -9.00   1st Qu.:   -4.00   Class :character  
 Median :  -1.00   Median :    0.00   Mode  :character  
 Mean   :   9.45   Mean   :   10.76                     
 3rd Qu.:  13.00   3rd Qu.:   10.00                     
 Max.   :2598.00   Max.   : 2601.00                     
 NA's   :316047    NA's   :282682                       
     Dest              Distance          TaxiIn        
 Length:14595137    Min.   :  11.0   Min.   :   0.000  
 Class :character   1st Qu.: 317.0   1st Qu.:   4.000  
 Mode  :character   Median : 569.0   Median :   5.000  
                    Mean   : 723.8   Mean   :   6.873  
                    3rd Qu.: 951.0   3rd Qu.:   8.000  
                    Max.   :4962.0   Max.   :1501.000  
                                                       
    TaxiOut         Cancelled       CancellationCode  
 Min.   :  0.00   Min.   :0.00000   Length:14595137   
 1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
 Median : 13.00   Median :0.00000   Mode  :character  
 Mean   : 16.03   Mean   :0.01937                     
 3rd Qu.: 19.00   3rd Qu.:0.00000                     
 Max.   :602.00   Max.   :1.00000                     
                                                      
    Diverted         CarrierDelay       WeatherDelay      
 Min.   :0.000000   Min.   :   0.000   Min.   :   0.0000  
 1st Qu.:0.000000   1st Qu.:   0.000   1st Qu.:   0.0000  
 Median :0.000000   Median :   0.000   Median :   0.0000  
 Mean   :0.002286   Mean   :   3.636   Mean   :   0.7258  
 3rd Qu.:0.000000   3rd Qu.:   0.000   3rd Qu.:   0.0000  
 Max.   :1.000000   Max.   :2580.000   Max.   :1429.0000  
                                                          
    NASDelay        SecurityDelay      LateAircraftDelay 
 Min.   :   0.000   Min.   :  0.0000   Min.   :   0.000  
 1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:   0.000  
 Median :   0.000   Median :  0.0000   Median :   0.000  
 Mean   :   3.687   Mean   :  0.0273   Mean   :   4.813  
 3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:   0.000  
 Max.   :1392.000   Max.   :382.0000   Max.   :1366.000  
                                                         
      date           
 Min.   :2006-01-01  
 1st Qu.:2006-07-08  
 Median :2007-01-08  
 Mean   :2007-01-04  
 3rd Qu.:2007-07-06  
 Max.   :2007-12-31  
                     
str(merged_df)
Classes ‘data.table’ and 'data.frame':  14595137 obs. of  30 variables:
 $ Year             : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int  11 11 11 11 11 11 11 11 11 11 ...
 $ DayOfWeek        : int  3 3 3 3 3 3 3 3 3 3 ...
 $ DepTime          : int  743 1053 1915 1753 824 627 825 942 1239 1642 ...
 $ CRSDepTime       : int  745 1053 1915 1755 832 630 820 945 1245 1645 ...
 $ ArrTime          : int  1024 1313 2110 1925 1015 834 1041 1155 1438 1841 ...
 $ CRSArrTime       : int  1018 1318 2133 1933 1015 832 1021 1148 1445 1845 ...
 $ UniqueCarrier    : chr  "US" "US" "US" "US" ...
 $ FlightNum        : int  343 613 617 300 765 295 349 356 775 1002 ...
 $ TailNum          : chr  "N657AW" "N834AW" "N605AW" "N312AW" ...
 $ ActualElapsedTime: int  281 260 235 152 171 127 136 133 119 119 ...
 $ CRSElapsedTime   : int  273 265 258 158 163 122 121 123 120 120 ...
 $ AirTime          : int  223 214 220 126 132 108 111 121 103 105 ...
 $ ArrDelay         : int  6 -5 -23 -8 0 2 20 7 -7 -4 ...
 $ DepDelay         : int  -2 0 0 -2 -8 -3 5 -3 -6 -3 ...
 $ Origin           : chr  "ATL" "ATL" "ATL" "AUS" ...
 $ Dest             : chr  "PHX" "PHX" "PHX" "PHX" ...
 $ Distance         : int  1587 1587 1587 872 872 644 644 644 644 644 ...
 $ TaxiIn           : int  45 27 4 16 27 6 4 4 4 4 ...
 $ TaxiOut          : int  13 19 11 10 12 13 21 8 12 10 ...
 $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr  "" "" "" "" ...
 $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CarrierDelay     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WeatherDelay     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ NASDelay         : int  0 0 0 0 0 0 20 0 0 0 ...
 $ SecurityDelay    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ LateAircraftDelay: int  0 0 0 0 0 0 0 0 0 0 ...
 $ date             : Date, format: "2006-01-11" ...
 - attr(*, ".internal.selfref")=<externalptr> 

Data Pre-Processing/Cleaning

Creating sample of entire data

Due to the large dataset requiring more time to execute, we will randomly select 10% of the data for quick analysis.

We then filter the data into Cancelled and Non-cancelled flights.

# Remove original data & take sample (10%) of merged data (save space & load faster)
set.seed(42)
merged_df = sample_frac(merged_df, 0.10, replace = FALSE)
# Rename CancellationCode Column
merged_df$CancellationCode <- mapvalues(merged_df$CancellationCode,
                           from = c("A", "B", "C", "D"),
                           to = c("Carrier", "Weather", "National Air System (NAS)", "Security"))

merged_df$Cancelled <- mapvalues(merged_df$Cancelled,
                           from = c(1, 0),
                           to = c("Cancelled", "Not Cancelled"))
merged_df$Diverted <- mapvalues(merged_df$Diverted,
                           from = c(1, 0),
                           to = c("Diverted", "Not Diverted"))
# Check for missing values
sum(is.na(merged_df))
[1] 184149
# Missing values per column
sapply(merged_df,function(x)sum(is.na(x)))
             Year             Month        DayofMonth 
                0                 0                 0 
        DayOfWeek           DepTime        CRSDepTime 
                0             28496                 0 
          ArrTime        CRSArrTime     UniqueCarrier 
            31764                 0                 0 
        FlightNum           TailNum ActualElapsedTime 
                0                 0             31764 
   CRSElapsedTime           AirTime          ArrDelay 
              101             31764             31764 
         DepDelay            Origin              Dest 
            28496                 0                 0 
         Distance            TaxiIn           TaxiOut 
                0                 0                 0 
        Cancelled  CancellationCode          Diverted 
                0                 0                 0 
     CarrierDelay      WeatherDelay          NASDelay 
                0                 0                 0 
    SecurityDelay LateAircraftDelay              date 
                0                 0                 0 
# Check for duplicates
sum(duplicated(merged_df))
[1] 0
distinct(merged_df)


Missing data is also handled using linear interpolation to estimate unknown data values between known data values, and duplicates are removed.

# Imputate Null values with interpolation

merged_df <- merged_df %>%
        mutate(DepTime = na.approx(DepTime))
merged_df <- merged_df %>%
        mutate(ArrTime = na.approx(ArrTime))
merged_df <- merged_df %>%
        mutate(ActualElapsedTime = na.approx(ActualElapsedTime))
merged_df <- merged_df %>%
        mutate(CRSElapsedTime = na.approx(CRSElapsedTime))
merged_df <- merged_df %>%
        mutate(AirTime = na.approx(AirTime))
merged_df <- merged_df %>%
        mutate(ArrDelay = na.approx(ArrDelay))
merged_df <- merged_df %>%
        mutate(DepDelay = na.approx(DepDelay))
# Change selected data types (Numeric to Categorical)
merged_df$Year <- as.factor(merged_df$Year)
merged_df$Month <- as.factor(merged_df$Month)
merged_df$DayofMonth <- as.factor(merged_df$DayofMonth)
merged_df$DayOfWeek <- as.factor(merged_df$DayOfWeek)
merged_df$FlightNum <- as.factor(merged_df$FlightNum)
merged_df$Cancelled <- as.factor(merged_df$Cancelled)
merged_df$Diverted <- as.factor(merged_df$Diverted)
# Create labelled column for easier visualisation
merged_df$Month_label <- month.abb[merged_df$Month]
# Check for missing values per column
sapply(merged_df,function(x)sum(is.na(x)))
             Year             Month        DayofMonth 
                0                 0                 0 
        DayOfWeek           DepTime        CRSDepTime 
                0                 0                 0 
          ArrTime        CRSArrTime     UniqueCarrier 
                0                 0                 0 
        FlightNum           TailNum ActualElapsedTime 
                0                 0                 0 
   CRSElapsedTime           AirTime          ArrDelay 
                0                 0                 0 
         DepDelay            Origin              Dest 
                0                 0                 0 
         Distance            TaxiIn           TaxiOut 
                0                 0                 0 
        Cancelled  CancellationCode          Diverted 
                0                 0                 0 
     CarrierDelay      WeatherDelay          NASDelay 
                0                 0                 0 
    SecurityDelay LateAircraftDelay              date 
                0                 0                 0 
      Month_label 
                0 


The null values have all been removed from the dataframe.

flight_cancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Cancelled')))
flight_notcancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Not Cancelled')))


Creating Delay Status

We assume that a delayed flight is equivalent to arriving late for more than 15 minutes at its destination. (ArrDelay > 15 mins)

Since flights can be delayed on its Departure but still arrive on time, hence we do not classify those as a delayed flight.

Hence we create a DelayStatus column into the main dataframe (merged_df),where 0 = No Delay, 1 = Delay.

# Creating new column showing ArrDelay > 15mins
# 0 = No Delay, 1= Delay
flight_notcancelled$DelayStatus <- ifelse(flight_notcancelled$ArrDelay > 15, 1, 0)
table(flight_notcancelled$DelayStatus)

      0       1 
1102581  328437 
prop.table(table(flight_notcancelled$DelayStatus))

        0         1 
0.7704872 0.2295128 

This shows that 77% have no delays (ArrDelay > 15 minutes), where they either arrived early or on time. Also, 23% of flights were delayed. Equivalent to about 1 out of every 5 flights being delayed.


Exploratory Data Analysis (EDA)

We will be looking at the different variables to get a better understanding of the data.

  • Total Flight Distribution
  • Cancellation
  • Delay

Total Flight Distribution

Total Flight Distribution of Full Data by Month

prop.table(table(merged_df$Month))

         1          2          3          4          5          6 
0.08206499 0.07508801 0.08511806 0.08237742 0.08450758 0.08426093 
         7          8          9         10         11         12 
0.08654045 0.08835270 0.08098038 0.08534827 0.08177654 0.08358467 
prop.table(table(merged_df$DayOfWeek))

        1         2         3         4         5         6 
0.1483665 0.1443227 0.1459657 0.1475135 0.1475149 0.1257288 
        7 
0.1405879 
# Percentage Distribution of Month
ggplot(merged_df, aes(x = Month)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Month (%)") +
  ylab("Percentage (%)")

# Percentage Distribution of DayOfWeek
ggplot(merged_df, aes(x = DayOfWeek)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)),fill = 'cornflowerblue') + ggtitle("DayOfWeek (%)") +
  ylab("Percentage (%)")


The data is almost evenly distributed between Month and DayOfWeek, with February and Saturday having the least number of total flights.

#Total Number of flights per Airline
ggplot(merged_df, aes(x = forcats::fct_infreq(UniqueCarrier))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Flights per Airline") +
  xlab("UniqueCarrier")

carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO','MQ','US')))

The top 5 airlines with the most flights are WN, AA, OO, MQ, UA.

  1. Southwest Airlines
  2. American Airlines
  3. Skywest Airlines
  4. American Eagle Airlines
  5. United Airlines
#Boxplot Distribution of Total ArrDelay per Carrier
boxplot(flight_notcancelled$ArrDelay~flight_notcancelled$UniqueCarrier,
        main = "Distribution of Total ArrDelay per Carrier",
        xlab = "Carrier/Airline",
        ylab = "Total ArrDelay (mins)",
        border = "black"
        )

Cancellation

# Percentage Distribution of Cancellation
ggplot(flight_cancelled, aes(x = CancellationCode)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Cancellation Reasons (%)") +
  ylab("Percentage (%)")

We can conclude that Cancellations are mostly due to Carrier, Weather and NAS with around 43%, 35% and 20% respectively.

#Total Number of Cancelled flights per Airline
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(UniqueCarrier))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Airline") +
  xlab("UniqueCarrier")

carrier_df %>% filter_all(any_vars(. %in% c('MQ', 'AA','OO')))

The top 3 most cancelled flights throughout these 2 years are:

  1. American Eagle Airlines
  2. American Airlines
  3. Skywest Airlines
#Total Number of Cancelled flights per Month
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(Month))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Month") +
  xlab("UniqueCarrier")

#Total Number of Cancelled flights per DayOfWeek
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(DayOfWeek))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per DayOfWeek") +
  xlab("UniqueCarrier")

The flights are mostly likely to be cancelled in December and February, on a Thursday and Friday.

Delay

cor.test(flight_notcancelled$ArrDelay, flight_notcancelled$DepDelay, method = "pearson")

    Pearson's product-moment correlation

data:  flight_notcancelled$ArrDelay and flight_notcancelled$DepDelay
t = 2841.3, df = 1431016, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9213976 0.9218910
sample estimates:
      cor 
0.9216447 

ArrDelay and DepDelay have a strong positive linear relationship, implying that a Departure Delay will almost certainly result in an Arrival Delay.

# BarPlot
ggplot(flight_notcancelled, aes(x= UniqueCarrier)) + geom_bar(aes(fill=as.factor(DelayStatus))) + ggtitle("Barplot of DelayStatus counts per UniqueCarrier")

carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO')))

The top 3 most delayed flights throughout these 2 years are:

  1. Southwest Airlines
  2. American Airlines
  3. Skywest Airlines

Q1. When is the best time of day, day of the week, and time of year to fly to minimise delays?

We will breakdown this question into three parts, where we will find the airline carrier and time period least likely to have delayed flights:

  • Best Time of the Day
  • Best Day of the Week
  • Best Month of the Year
  • Best Day of the Month

Best Time of the Day

Distribution of Average Delay by Time Period

# Create copy of dataframe
df_q1 <- flight_notcancelled


Time Intervals column ‘ArrPeriod’ was created based on ‘ArrTime’. 24 hours in a day will split into 6 different periods with at least 3 to 5-hour intervals since different timings like 5am and 11am are better not generalised together into a single timeframe. The period is split as such:

  • Midnight (12am - 5am)
  • Early Morning (5am - 9am)
  • Late Morning (9am - 12pm)
  • Afternoon (12pm - 5pm)
  • Evening (5pm - 9pm)
  • Night (9pm - 12am)
# Categorising ArrTime and DepTime by 6 periods: Midnight, Early Morning, Late Morning, Afternoon, Evening, Night
df_q1 <- df_q1 %>%
  mutate(ArrPeriod = case_when(ArrTime >= 500 & ArrTime < 900 ~ 'Early Morning', 
                          ArrTime >= 900 & ArrTime < 1200 ~ 'Late Morning',
                          ArrTime >= 1200 & ArrTime < 1700 ~ 'Afternoon',
                          ArrTime >= 1700 & ArrTime < 2100 ~ 'Evening',
                          ArrTime >= 2100 & ArrTime < 2400 ~ 'Night',
                          TRUE ~  'Midnight'))
# Create table counts
counts <- table(df_q1$DelayStatus,df_q1$ArrPeriod)

# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "Time Period", ylab = "Total Delay Count",
                       main = "Total Delay per Period",beside=TRUE, 
                       legend = rownames(counts))

Early Morning followed by Midnight has the lowest count for number of flights delayed. However, we will proceed on to check if having the lowest number of count will equate to the shortest average delayed time.

# Boxplot
ggplot(df_q1, aes(x= ArrDelay)) + geom_boxplot(aes(color=as.factor(ArrPeriod))) + ggtitle("Boxplot of ArrDelay (mins) vs ArrPeriod")

The boxplot shows that the mean of ArrDelay for midnight is more than the other time period. Hence, validating our assumption that lowest flight delay counts does not equate to lowest average delayed time.

#Set Sequence to ArrPeriod
df_q1$ArrPeriod <- factor(df_q1$ArrPeriod,levels = c("Midnight", "Early Morning", "Late Morning", "Afternoon", "Evening", "Night"))

df_q1  %>% group_by(ArrPeriod) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= ArrPeriod, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Time Period") +
  xlab("Time Period") + ylab("Average ArrDelay (mins)")

The best time period to minimise flight delays would be in the Early Morning from 5am to 9am.

Best Day of the Week

Distribution of Day Delay

# Create table for counts
counts <- table(df_q1$DelayStatus,df_q1$DayOfWeek)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "DayOfWeek",ylab = "Total Delay Count",
                       main = "Total Delay per DayOfWeek",beside=TRUE, 
                       legend = rownames(counts))

#Line Plot
df_q1  %>% group_by(DayOfWeek) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= DayOfWeek, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayOfWeek") +
  xlab("DayOfWeek") + ylab("Average ArrDelay (mins)")

The Best Day of the Week to minimize delays is to travel on Saturday, followed by Tuesday, with average ArrDelay timing of less than 6 minutes and 8 minutes, respectively.

The longest average delays of 12-13 minutes are expected in the middle of the week, from Thursday to Friday.

Best Month of the Year

Distribution of Monthly Delay

# create table counts
counts <- table(df_q1$DelayStatus,df_q1$Month)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "Month",ylab = "Total Delay Count",
                       main = "Total Delay per Month",beside=TRUE, 
                       legend = rownames(counts))

df_q1 %>% group_by(Month) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= Month, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Month") +
  xlab("Month") + ylab("Average ArrDelay (mins)")

Based on ArrDelay, the best time of year to minimise travel delay is November, then September, with both averaging approximately 6 minutes of delay, as opposed to June and December, with more than twice the number of minutes delayed.

Due to the U.S. summer and winter vacation (School Holidays USA, 2022), June and December are projected to be popular months to travel.

Best Day of the Month

df_q1 %>% group_by(DayofMonth) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= DayofMonth, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayofMonth") +
  xlab("DayofMonth") + ylab("Average ArrDelay (mins)")

Also, travelling in the first half of the month, around the 8th – 9th, results in an average ArrDelay of around 6 minutes, and the second half of the month with double the delay amount


Hence overall, the top 2 recommended time period to avoid flight delay is:

  • Early Morning (5am-9am) & Late Morning (9am-12pm) (≈0 minutes)
  • Saturday & Tuesday (≈6 minutes)
  • September & November (≈6 minutes)
  • Around 8th – 9th (Specifically on 8th) (≈6 minutes)

All the above periods are delayed by an average of 0 to 6 minutes. Passengers may reduce their flight delays even further by booking flights with airlines that are below the delay threshold (%).

# To save up space
rm(df_q1)


Q2. Do older planes suffer more delays?

The issue date of planes will be extracted from the planes_df data since we can judge the age of the aircraft and make our analysis based on that.

  • Distribution of Arr/Dep Delay by Issue Date
  • Distribution of Arr/Dep Delay by Issue Year
  • Comparison of Old and Normal Planes

To determine the age of the aircrafts, the engine’s issue date is obtained by mapping the plane-data.csv consisting of the plane’s ‘issue_date’, into a duplicated data frame of the ‘flight_notcancelled’ data as a new column.

# Create a new dataframe
df_q2 <- flight_notcancelled
# Check for missing value
sapply(df_q2,function(x)sum(is.na(x)))
             Year             Month        DayofMonth         DayOfWeek           DepTime        CRSDepTime 
                0                 0                 0                 0                 0                 0 
          ArrTime        CRSArrTime     UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
                0                 0                 0                 0                 0                 0 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay            Origin              Dest 
                0                 0                 0                 0                 0                 0 
         Distance            TaxiIn           TaxiOut         Cancelled  CancellationCode          Diverted 
                0                 0                 0                 0                 0                 0 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay LateAircraftDelay              date 
                0                 0                 0                 0                 0                 0 
      Month_label       DelayStatus        issue_date 
                0                 0            166043 
# Check for missing value
sapply(df_q2,function(x)sum(is.na(x)))
             Year             Month        DayofMonth 
                0                 0                 0 
        DayOfWeek           DepTime        CRSDepTime 
                0                 0                 0 
          ArrTime        CRSArrTime     UniqueCarrier 
                0                 0                 0 
        FlightNum           TailNum ActualElapsedTime 
                0                 0                 0 
   CRSElapsedTime           AirTime          ArrDelay 
                0                 0                 0 
         DepDelay            Origin              Dest 
                0                 0                 0 
         Distance            TaxiIn           TaxiOut 
                0                 0                 0 
        Cancelled  CancellationCode          Diverted 
                0                 0                 0 
     CarrierDelay      WeatherDelay          NASDelay 
                0                 0                 0 
    SecurityDelay LateAircraftDelay              date 
                0                 0                 0 
      Month_label       DelayStatus        issue_date 
                0                 0            166043 


Since our dataset is huge with more than 2 million records, we will proceed to clean the 166k records of missing issue_date.

# Remove missing & None values
df_q2 <- na.omit(df_q2)
df_q2 <- df_q2 %>% 
  filter(!grepl('None', issue_date))

# Convert to datetime format
df_q2$issue_date <- as.Date(df_q2$issue_date, "%m/%d/%Y")

Distribution of Arr Delay with Issue Date

Total Sum of ArrDelay by Issue Date

# Total Sum of Delay by issue_date
ggplot(data = df_q2, aes(x=issue_date, y=ArrDelay))+
  geom_line() + ggtitle("Total ArrDelay with Issue Date") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")

From the graph, we can see three peak points around year 1986, 2000 and 2004. Hence, we proceed to check with the mean value.

# Average of ArrDelay by issue_date
df_q2 %>% group_by(issue_date) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x=issue_date, y=mean))+
  geom_line() + ggtitle("Average ArrDelay with Issue Date") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")

Based on Issue Date, there seem to be more delay occurrences for planes issued after 1998. Also, the low total minutes delayed between 1976 and 1984 might be attributed to planes older than 24 years being removed owing to maintenance faults, or the airline scheduling them with ample time to avoid delays.

Distribution of Arr Delay with Issue Year

A new column of issue year and plane age is then extracted from the issue date.

df_q2$issue_year = format(as.Date(df_q2$issue_date, format="%m/%d/%Y"),"%Y")
df_q2 %>% group_by(issue_year) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= issue_year, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), size=1, col = "cornflowerblue") + ggtitle("Average ArrDelay vs issue_year") +
  xlab("issue_year") + ylab("Average ArrDelay (mins)")

Older planes issued before 1984 have a higher average delay of more than 11 minutes, compared to those 1984 onwards with under 11 minutes of average delay.

Further investigations will then be made with Issue Date rather than its Year, as delays became increasingly prevalent after year 2000.

Comparing Old Planes and Normal Planes

The average age of U.S planes is 11 years, with about 25% of planes above 15 years old (Mayerowitz, 2011).

Hence, we will deem an aircraft to be old when in operation for 15+ years, where data will split into two, with planes issued before 1993 termed old, and those issued 1993 onwards termed normal..

Distribution of Old & Normal Planes

# Plot a subset of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Old Planes (Before 1993))") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")

# Plot a subset of normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")

# Plot Average ArrDelay of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=ArrDelay))+  stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Old Planes (Before 1993))") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")

# Plot Average ArrDelay of Normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=ArrDelay))+  stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")

# Plot Average ArrDelay by Distance of Old/Normal planes
ggplot() +  stat_summary(data=subset(df_q2, issue_date < as.Date("1993-01-01")), aes(x=Distance, y = ArrDelay,group=1, color = 'red'), fun=mean, geom="line",group=1, size=0.5) +  stat_summary(data=subset(df_q2, issue_date  > as.Date("1992-12-31")), aes(x=Distance, y = ArrDelay,group=1, color = 'cornflowerblue'), fun=mean, geom="line",group=1, size=0.5) + ggtitle("Average ArrDelay by Distance") +
  xlab("Distance (miles)") + ylab("Average ArrDelay (mins)") + 
    scale_colour_manual(name = 'Planes', 
         values =c('cornflowerblue'='cornflowerblue','red'='red'), labels = c('Normal Planes (1993 Onwards)','Old Planes (Before 1993)'))+ theme(legend.position = c(0.82,0.88))

Older planes are more likely than normal planes to have larger Average Delays as the Distance grows, with Average Arrival Delays surpassing 100 minutes or more. This further confirms that older planes tend to suffer from more delays when on a long-haul flight. Further investigations are done with Distance to identify the different delay factors that may impact old and normal planes.

library(reshape2)
package 㤼㸱reshape2㤼㸲 was built under R version 4.0.3
Attaching package: 㤼㸱reshape2㤼㸲

The following objects are masked from 㤼㸱package:data.table㤼㸲:

    dcast, melt

The following object is masked from 㤼㸱package:tidyr㤼㸲:

    smiths
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for full data
ggplot(df_q2a, aes(issue_date, value)) +
  geom_line(aes(colour = Delay),size=0.8) + ggtitle("Total Delay by issue_date") +
  xlab("issue_date") + ylab("Total Delay (mins)")+ theme(legend.position = c(0.12,0.78))

# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for full data
ggplot(data=df_q2a, 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date") +
  xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.12,0.78))

The full data shows Carrier Delay being prominent with planes issued 1990-1992 and 2000.

# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Old Planes)") +
  xlab("Issue Date") + ylab("Total Delay (mins)") + theme(legend.position = c(0.12,0.78))

# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date (Old Planes)") +
  xlab("Issue Date") + ylab("Average Delay (mins)")  + theme(legend.position = c(0.5,0.78))                   

Old Planes issued around 1990 are more likely to have delays due CarrierDelay, which includes maintainence etc.

# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Normal Planes)") +
  xlab("Issue Date") + ylab("Total Delay (mins)")                                                            

# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date  > as.Date("1992-12-31")), 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size =1) + ggtitle("Average Delay with Issue Date (Normal Planes)") +
  xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.77))                     

Normal Planes are likely to have CarrierDelays as well.

Distance Distribution of Old & Normal Planes

Distance is then used to check the delays that Old and Normal Planes will have with respect to the number of miles that they have travelled.

# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01")) 
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Old planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Old Planes)") +
  xlab("Distance (miles)") + ylab("Total Delay (mins)") + theme(legend.position = c(0.88,0.78))

# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01")) 
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Old planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1,size=1)+ ggtitle("Average Delay with Distance (Old Planes)") +
  xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))

In R, although the LateAircraftDelay and NASDelay have higher average values, Carrier Delays also have occurrences of average delay ranging around 50 minutes at 1500 miles.

# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31")) 
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Normal planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Normal Planes)") +
  xlab("Distance (miles)") + ylab("Total Delay (mins)") 

# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31")) 
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Normal planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1)+ ggtitle("Average Delay with Distance (Normal Planes)") +
  xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))

Normal planes have lower average values of Carrier Delays of less than 80 minutes.

Hence, it is evident that older planes would suffer more delays by having a larger value of Carrier Delay minutes while flying routes longer than 1500 miles.

Overall, the grouping by Issue Year was clear in showing that older planes do suffer from more delay, but only by up to 8 minutes on average, where LateAircraftDelay, followed by Carrier/NASDelay are the primary causes.

When utilized for long-haul flights, older planes are also more likely to have delays due to carrier delays that might have resulted from aircraft maintenance or inspection.

The difference in delay however is quite negligible and it might be due to airlines “padding” and scheduling extra time for flights to prevent flights from being classified as delayed (Kramer, 2019).

# To save up space
rm(df_q2)
rm(df_q2a)
rm(counts)
rm(normal_planes)
rm(old_planes)


Q3. How does the number of people flying between different locations change over time

We will find the most popular routes to gauge the number of people flying between these different locations.

  • Distribution by Flight Routes
  • Distribution by State
# Create new dataframe
df_q3 <- flight_notcancelled

Distribution by Flight Routes

To begin, the Origin and Destination are combined into a new ‘FlightRoute’ column (e.g., OGG to HNL).

# Create new column as its flight route
df_q3$FlightRoute <- paste(df_q3$Origin, "to", df_q3$Dest)
# Most popular routes
df_q3 %>% 
  group_by(df_q3$FlightRoute) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))

The top five most popular routes are:

  1. OGG to HNL
  2. HNL to OGG
  3. LAX to LAS
  4. SAN to LAX
  5. LAX to SAN

Out of the 5196 distinct routes, the Top 5 routes with the highest count are identified to examine if the number of flights has changed over the course of the year.

# BarPlot
ggplot(df_q3, aes(x= Month)) + geom_bar(aes(fill=as.factor(FlightRoute))) + ggtitle("Barplot of FlightRoute counts")

# BarPlot
ggplot(df_q3, aes(x= Month)) + geom_bar(aes(fill=as.factor(FlightRoute))) + ggtitle("Barplot of FlightRoute counts")

# FlightRoute Line Plot
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) + 
  geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
  xlab("Month") + ylab("Total Flight Counts")

For the first five months, the flight routes [OGG to HNL] and [HNL to OGG] had roughly 50 fewer flights than the other three routes.

However, from June to August, both routes begin to increase in their number of flights, reaching over 100 more flights than the other routes.

Between September and December, these routes continue to have more flights than the others.

Distribution by State

To make the most of the data, the States were queried to justify the number of people flying interstate and intrastate.

# Query state from airport_df into dataframe
df_q3$state <- airport_df$state[match(df_q3$Origin, airport_df$iata)]
# Query state from airport_df into dataframe
df_q3$state <- airport_df$state[match(df_q3$Origin, airport_df$iata)]
# Most popular states
df_q3 %>% 
  group_by(df_q3$state) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))

Top 3 popular states are:

  1. California
  2. Texas
  3. Illinois
# Select the top 3 states
df_q3 <- df_q3[df_q3$state %in% c('CA', 'TX', 'IL'), ]
# Create new column of state flight routes
df_q3$FlightRoute <- paste(df_q3$state, "to", df_q3$state[-1:0])
# Create new column of state flight routes
df_q3$FlightRoute <- paste(df_q3$state, "to", df_q3$state[-1:0])
# Flight Routes of Top 3 states
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) + 
  geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
  xlab("Month") + ylab("Total Flight Counts")

For both intrastate and interstate, February and September have the fewest flights with around 100 less than other months. As expected, the number of flights increases in the middle of the year, between June and August.

With two distinct methodologies, it is evident that February has the fewest flights, followed by September. The more popular travelling months are during June to August, which is likely due to the summer holidays in the USA which last 11 weeks from June to August (School Holidays USA, 2022).

# To save up space
rm(df_q3)


Q4. Can you detect cascading failures as delays in one airport create delays in others?

Cascading failures occur when a flight delay for one plane in an airport causes a flight delay in another.

This is explained by the existing variable ‘LateAircraftDelay’ which describes how a particular flight delayed in its Origin arrives late in its Destination, then affecting the next flight’s departure since the same plane was used. The ripple impact of a previous delay at downstream airports hence causes cascading failures.

Since Tail Numbers are identification numbers on aircraft, it will be easier to focus on data with ‘LateAircraftDelay’, then focus on a selected aircraft and observe its flight schedule.

We will approach the question as follow:

  • Methodology
  • Initial Test
  • Secondary Testing
# Create new dataframe
df_q4 <- merged_df
# Find Date with highest value counts
df_q4 %>% 
  group_by(df_q4$TailNum) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))

‘0’ and ‘000000’ are likely private confidential TailNums, so we ignore those data and focus on the top 2 highest counts of TailNum:

  • N478HA
  • N308SW

Methodology

We will approach the dataset in the following way to effectively illustrate cascading failures, that is, delays in one airport will cause delays in another:

  1. Find Highest ‘TailNum’ count (Higher chance of continuous flights in a day)
  2. Extract data fitting condition of Highest ‘TailNum’ & ‘LateAircraftDelay’ > 15 minutes
  3. Find Highest ‘Date’ count from extracted data (Higher chance of continuous flights in a day)
  4. Extract overall data fitting conditions of Highest ‘TailNum’ & Highest ‘Date’ count
  5. Sort data by ‘DepTime’ (To see the flight schedule of same aircraft by Departure Timing)

This approach will allow us to assess if a delayed flight in one airport may cause a delay in another. To ensure that the data extraction and analysis approaches are valid, two separate tests will be conducted.

Initial Test

The initial test used N308SW to identify the highest date counts.

# First extraction of data with top most TailNum counts
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$LateAircraftDelay > 15, ]
df_q4_1 %>% 
  group_by(df_q4_1$date) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))


We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.

# Extracting dataset that matches our findings
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1),]

It is clear that one flight’s delay in an airport can cause cascading failures in another.

Taking Flight 458 as an example: has to depart at 1835 (6:35pm) but delayed for 52 mins till 1927 (7:27pm), flying from DAL to SAT. Supposed to reach 1935 (7:35pm) but delayed 50 mins and reached 2025 (8:25pm).

This 77 mins delay thereafter caused the next flight from SAT to DAL (FlightNum 178) to have departure delay for 45 minutes (43 mins for LateAircraftDelay) also since its scheduled departure time is at 2005 (8:05pm), but it only took off at 2050 (8:50pm). It also arrived 43 minutes later than expected time of 2100 at 2143.

We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.

# Matching with full data to confirm assumption
df_q4_1 <- mergeddf[mergeddf$TailNum %in% c('N308SW') & mergeddf$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1:8, 10,12,14),] %>% select(10,5:8,29,15:18,30,9,11) #Select important columns

Considering Flight 755: there was a 52 minutes ArrDelay in the Destination Airport (LIT). Due to the delay, Flight 458, scheduled to depart at 1700 (5pm), was delayed for 60 mins till 1800 (6pm), flying from LIT to DAL.

Supposed to reach 1805 (6:05pm) but reached 1900 (7pm). With the 55-min delay, the next flight from DAL to SAT (FlightNum 458) was subsequently delayed for 52 minutes (50-min for LateAircraftDelay) since its scheduled departure time was at 1835 (6:35pm), but it only took off at 1927 (7:27pm). It then landed 50 minutes later, at 2143, than its expected time of 2100.

Since the DepTime overlapped with its previous flight’s CRSArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failure.

With this, we can observe the cascading failure for Carrier WN of same TailNum, which persisted throughout the night at different airports (LIT, DAL, SAT, TUL etc.)

Since the CRSDepTime overlapped with its previous flight’s ArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failures.

Secondary Testing

The secondary test used N478HA to identify the highest date counts.

# First extraction of data with top most TailNum counts
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$LateAircraftDelay > 15, ]
df_q4_2 %>% 
  group_by(df_q4_2$date) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))


We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.

# Extracting dataset that matches our findings
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)]

Taking Flight 179 as an example: has to depart at 1328 (1:28pm) but delayed for 82 mins till 1450 (2:50pm), flying from KOA to OGG. Supposed to reach 1357 (1:57pm) but delayed 87 mins and reached 1524 (3:24pm).

We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.

# Matching with full data to confirm assumption
df_q4_2 <- mergeddf[mergeddf$TailNum %in% c('N478HA') & mergeddf$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)][-c(1:4),]  %>% select(10,5:8,29,15:18,30,9,11) #Select important columns
NA

This 87 mins arrival delay of FlightNum 179 thereafter caused the next flight from OGG to HNL (FlightNum 179) to have departure delay for 79 minutes (89 mins for LateAircraftDelay) also since its scheduled departure time is at 1425 (2:25pm), but it only took off at 1544 (3:44pm). It also arrived 89 minutes later than expected time of 1459 at 1628. We can also see from the day that one flight delay in an airport will cause cascading failures, contributing to ArrDelay and LateAircraftDelay for flights, hence confirming our observation.

It is also worth noting the instances where subsequent flights might not use the same TailNum, but they are assumed due to a lack of schedule information. A total of two tests were conducted in R, to demonstrate the same effect of cascading failures where delays at one airport causes delays in another. The previous flight’s late arrival, which used the same plane that would be departing, caused the subsequent flight’s delay. As a result, the current flight will depart late, setting off a chain reaction causing passengers at other airports to board the plane much later as well.

# To save space
rm(df_q4)
rm(df_q4_1)
rm(df_q4_2)


Q5. Use the available variables to construct a model that predicts delays.

With the flight data labelled, Supervised Learning algorithms such as Multiple Linear Regression and Random Forest are used to construct Regression and Classification prediction models in R.

This works by allowing the model to predict the label of new data points based on past data.

To predict delays, these few supervised learning models with selected variables are used:

  • Multiple Linear Regression
  • Random Forest
# Select variable columns
df_q5 <- df_q5 %>% select(Month, UniqueCarrier, DepTime, CRSDepTime, ArrTime, CRSArrTime, ArrDelay, DepDelay,Distance,TaxiIn,TaxiOut, ActualElapsedTime,AirTime, DelayStatus)
# Select variable columns
df_q5 <- df_q5 %>% select(Month, UniqueCarrier, DepTime, CRSDepTime, ArrTime, CRSArrTime, ArrDelay, DepDelay,Distance,TaxiIn,TaxiOut, ActualElapsedTime,AirTime, DelayStatus)

Time-related and Factor columns are selected as predictor varaibles for analysis.

# Change selected var as factors
df_q5$UniqueCarrier<- as.factor(df_q5$UniqueCarrier)
df_q5$Month <- as.factor(df_q5$Month)
# Hold-out Validation method
library(caret)
set.seed(42)

F_sample = createDataPartition(y=df_q5$ArrDelay, p = 0.7, list = F)
train = df_q5[F_sample,]
test = df_q5[-F_sample,]

First, ‘CreateDataPartition’ feature will create a train-test split to prevent overfitting, with Trainset accounting for 70%, and Testset for 30%. The model will be trained using the Trainset data, and its performance will be evaluated by predicting with the unseen Testset. For reproduction, random state is set to a random seed 42.

The response variable Y, for the Regression model will comprise of ArrDelay (minutes).

Multiple Linear Regression

MLR is a regression model which enables us to understand and estimate relationships between multiple variables.

# Load model
mlr_model <- lm(ArrDelay ~. -DelayStatus, data = train)
summary(mlr_model)

Call:
lm(formula = ArrDelay ~ . - DelayStatus, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-544.76   -4.62    0.28    4.76  964.24 

Coefficients:
                    Estimate Std. Error  t value Pr(>|t|)    
(Intercept)       -2.291e+01  9.164e-02 -250.057  < 2e-16 ***
Month2             5.663e-01  5.064e-02   11.183  < 2e-16 ***
Month3             1.807e-01  4.890e-02    3.694  0.00022 ***
Month4             6.833e-01  4.921e-02   13.887  < 2e-16 ***
Month5             8.454e-01  4.885e-02   17.308  < 2e-16 ***
Month6             1.209e+00  4.908e-02   24.634  < 2e-16 ***
Month7             8.584e-01  4.870e-02   17.626  < 2e-16 ***
Month8             7.547e-01  4.840e-02   15.594  < 2e-16 ***
Month9             5.705e-01  4.946e-02   11.535  < 2e-16 ***
Month10            8.010e-01  4.882e-02   16.406  < 2e-16 ***
Month11           -5.906e-01  4.928e-02  -11.985  < 2e-16 ***
Month12            1.041e-02  4.924e-02    0.212  0.83248    
UniqueCarrierAA    3.581e+00  8.346e-02   42.911  < 2e-16 ***
UniqueCarrierAQ    1.144e+01  1.526e-01   74.939  < 2e-16 ***
UniqueCarrierAS    4.942e+00  1.014e-01   48.753  < 2e-16 ***
UniqueCarrierB6   -1.331e+00  9.994e-02  -13.316  < 2e-16 ***
UniqueCarrierCO   -1.652e+00  9.025e-02  -18.308  < 2e-16 ***
UniqueCarrierDL    8.364e-01  8.509e-02    9.830  < 2e-16 ***
UniqueCarrierEV    5.848e-01  9.131e-02    6.405 1.51e-10 ***
UniqueCarrierF9    6.085e+00  1.159e-01   52.504  < 2e-16 ***
UniqueCarrierFL    3.379e+00  9.269e-02   36.459  < 2e-16 ***
UniqueCarrierHA    1.273e+01  1.379e-01   92.327  < 2e-16 ***
UniqueCarrierMQ    2.741e+00  8.388e-02   32.683  < 2e-16 ***
UniqueCarrierNW    3.897e+00  8.614e-02   45.239  < 2e-16 ***
UniqueCarrierOH    1.801e-02  9.246e-02    0.195  0.84558    
UniqueCarrierOO    4.397e+00  8.338e-02   52.737  < 2e-16 ***
UniqueCarrierTZ    7.791e+00  2.775e-01   28.073  < 2e-16 ***
UniqueCarrierUA    2.812e+00  8.549e-02   32.895  < 2e-16 ***
UniqueCarrierUS    2.171e+00  8.489e-02   25.568  < 2e-16 ***
UniqueCarrierWN    4.935e+00  7.991e-02   61.761  < 2e-16 ***
UniqueCarrierXE    8.736e-02  8.571e-02    1.019  0.30808    
UniqueCarrierYV    1.168e+00  9.015e-02   12.951  < 2e-16 ***
DepTime           -4.595e-04  8.438e-05   -5.445 5.17e-08 ***
CRSDepTime         8.421e-04  8.466e-05    9.947  < 2e-16 ***
ArrTime            8.532e-04  4.092e-05   20.851  < 2e-16 ***
CRSArrTime        -1.974e-03  4.941e-05  -39.958  < 2e-16 ***
DepDelay           9.816e-01  3.036e-04 3233.017  < 2e-16 ***
Distance          -4.051e-02  7.202e-05 -562.542  < 2e-16 ***
TaxiIn             7.041e-02  1.078e-03   65.327  < 2e-16 ***
TaxiOut            5.560e-01  1.466e-03  379.253  < 2e-16 ***
ActualElapsedTime  2.747e-01  1.116e-03  246.238  < 2e-16 ***
AirTime            4.863e-02  9.890e-04   49.168  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.984 on 1001672 degrees of freedom
Multiple R-squared:  0.931, Adjusted R-squared:  0.931 
F-statistic: 3.295e+05 on 41 and 1001672 DF,  p-value: < 2.2e-16

The model summary is then used to identify the significant variables affecting ArrDelay, excluding DelayStatus since they are related.

The R2 score is 0.931, where predictor variables in the model explained 93.1% of the variation in Y (ArrDelay). RMSE is also considerably fitting at 9.98.

# Predicting delays
delayPred <- predict(mlr_model, test)  
# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$ArrDelay, Predicted=round(delayPred)))
ActualnPred
# 96.4% correlation accuracy of similar directional movement
correlation_accuracy <- cor(ActualnPred)
correlation_accuracy
             Actual Predicted
Actual    1.0000000 0.9641986
Predicted 0.9641986 1.0000000

The Correlation Accuracy of 96.4% shows how the actual and predicted values have similar directional movements.

sqrt(mean((test$ArrDelay - delayPred)^2))
[1] 9.996467

The R2 score is 0.9369 and Root Mean Squared Error (RMSE) is 9.676569. The R2 value shows that the predictor variables in the model are able to explain 93.69% of the variation in ArrDelay.

Random Forest

Used for Classification and Regression, Random Forest is a Supervised Learning algorithm that constructs many decision trees.

R uses classification model where the output is chosen by a majority vote among decision trees.

set.seed(42)
require(caTools)
sample = sample.split(df_q5$DelayStatus, SplitRatio = .70)
train = subset(df_q5, select = -c(ArrDelay), sample == TRUE)
test  = subset(df_q5, select = -c(ArrDelay), sample == FALSE)
train$DelayStatus <- as.character(train$DelayStatus)
train$DelayStatus <- as.factor(train$DelayStatus)

Hence, a new train-test set with response variable Y as DelayStatus of similar parameters (30% Test) is created, with the ArrDelay column removed to prevent multicollinearity.

rf

Call:
 randomForest(formula = DelayStatus ~ ., data = train, ntree = 10,      random_state = 42) 
               Type of random forest: classification
                     Number of trees: 10
No. of variables tried at each split: 3

        OOB estimate of  error rate: 6.9%
Confusion matrix:
       0      1 class.error
0 734411  29334  0.03840811
1  39044 188485  0.17160010
rf

Call:
 randomForest(formula = DelayStatus ~ ., data = train, ntree = 10,      random_state = 42) 
               Type of random forest: classification
                     Number of trees: 10
No. of variables tried at each split: 3

        OOB estimate of  error rate: 6.9%
Confusion matrix:
       0      1 class.error
0 734411  29334  0.03840811
1  39044 188485  0.17160010

The Out-of-bag error estimate is at 6.9%, where Accuracy = 1 -OOB error, so the model has an accuracy of 93.1%.

delayPred <- predict(rf, test)  
delayPred <- predict(rf, test)  
test$DelayStatus <- as.factor(test$DelayStatus)

# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred

RF Model without DepDelay

Since DepDelay and ArrDelay have a positive linear relationship which might lead to multicollinearity, we will investigate to see if there are changes to the factors that might affect DelayStatus.

rf <- randomForest(DelayStatus ~ .- DepDelay, data=train, ntree=10, random_state =42)
rf

Call:
 randomForest(formula = DelayStatus ~ . - DepDelay, data = train,      ntree = 10, random_state = 42) 
               Type of random forest: classification
                     Number of trees: 10
No. of variables tried at each split: 3

        OOB estimate of  error rate: 6.18%
Confusion matrix:
       0      1 class.error
0 740506  23440   0.0306828
1  37786 189765   0.1660551
delayPred <- predict(rf, test)  

Models with and without DepDelay show the changes in attributing factors causing DelayStatus, where the top 6 factors for both models are similar.

delayPred <- predict(rf, test)  
test$DelayStatus <- as.factor(test$DelayStatus)

# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred

In total, one regression and one classification method were tested out in R. Overall, the best model to predict the continuous ArrDelay would be MLR with 95.7% accuracy. To predict the binary outcome DelayStatus, Random Forest model is around 93.1% accurate.

---
title: "Flight Analysis in the U.S."
author: 'Evangeline Tan'
output: 
  html_notebook:
    toc: true
    toc_depth: 2
    toc_float: true
    theme: united
---


# Introduction

Every year, nearly 25% of airline flights are delayed or cancelled, costing travellers over $30 billion in lost time and money  Flight delays have long been a cause of dissatisfaction in the airline industry, as well as a source of annoyance for passengers and carriers. <p>
Our goal is to use the massive amount of airline data to visualise and study the flight patterns and predict if a flight will be delayed. For this study, both Python and R will be used to investigate 2 years’ worth of data, since two full business cycles are adequate in reducing bias for one cycle.


This notebook aims to look at these questions regarding flight travel:

1. When is the best time of day, day of the week, and time of year to fly to minimise delays?
2. Do older planes suffer more delays?
3. How does the number of people flying between different locations change over time?
4. Can you detect cascading failures as delays in one airport create delays in others?
5. Use the available variables to construct a model that predicts delays.

<p>

Data: [Airlines - Harvard Dataverse](https://doi.org/10.7910/DVN/HG7NV7)
<br>

# Import data and libraries
```{r data, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(zoo)
library(tidyr)
library("data.table")
library(plyr)
options(warn=-1)

setwd("D:/et4_e/coursework/2021")

# Import & prepare the 2006 and 2007 dataset
df_2006 = fread("2006.csv.bz2")
df_2007 = fread("2007.csv.bz2")

mergeddf = rbind(df_2006, df_2007)
airport_df <- read.csv2('airports.csv',sep = ",",header = TRUE)
carrier_df <- read.csv2('carriers.csv',sep = ",",header = TRUE)
planes_df <- read.csv2('D:/et4_e/coursework/2021/plane-data.csv',sep = ",",header = TRUE)
```

<br>

# Understanding Data
```{r details}
# Create Date column
mergeddf$date <- as.Date(with(mergeddf, paste(Year, Month, DayofMonth, sep="-")), "%Y-%m-%d")
# Create copy of merged data
merged_df = mergeddf
head(merged_df)
dim(merged_df)
```
 <br>
There are almost 14.6 million records with 29 variable columns. The columns include the airline and flight details etc. and are mostly time-related (in mins).
<br>

```{r summary}
summary(merged_df)
str(merged_df)
```

# Data Pre-Processing/Cleaning

## Creating sample of entire data {.unlisted .unnumbered}

Due to the large dataset requiring more time to execute, we will randomly select 10% of the data for quick analysis.

We then filter the data into Cancelled and Non-cancelled flights.
<br>
```{r sample}
# Remove original data & take sample (10%) of merged data (save space & load faster)
rm(df_2006)
rm(df_2007)
set.seed(42)
merged_df = sample_frac(merged_df, 0.10, replace = FALSE)
```


```{r rename}
# Rename CancellationCode Column
merged_df$CancellationCode <- mapvalues(merged_df$CancellationCode,
                           from = c("A", "B", "C", "D"),
                           to = c("Carrier", "Weather", "National Air System (NAS)", "Security"))

merged_df$Cancelled <- mapvalues(merged_df$Cancelled,
                           from = c(1, 0),
                           to = c("Cancelled", "Not Cancelled"))
merged_df$Diverted <- mapvalues(merged_df$Diverted,
                           from = c(1, 0),
                           to = c("Diverted", "Not Diverted"))
```




```{r missing}
# Check for missing values
sum(is.na(merged_df))

# Missing values per column
sapply(merged_df,function(x)sum(is.na(x)))
```

```{r cleaning duplicates}
# Check for duplicates
sum(duplicated(merged_df))
distinct(merged_df)
```
<br>

Missing data is also handled using linear interpolation to estimate unknown data values between known data values, and duplicates are removed. 

```{r interpolation}
# Imputate Null values with interpolation

merged_df <- merged_df %>%
        mutate(DepTime = na.approx(DepTime))
merged_df <- merged_df %>%
        mutate(ArrTime = na.approx(ArrTime))
merged_df <- merged_df %>%
        mutate(ActualElapsedTime = na.approx(ActualElapsedTime))
merged_df <- merged_df %>%
        mutate(CRSElapsedTime = na.approx(CRSElapsedTime))
merged_df <- merged_df %>%
        mutate(AirTime = na.approx(AirTime))
merged_df <- merged_df %>%
        mutate(ArrDelay = na.approx(ArrDelay))
merged_df <- merged_df %>%
        mutate(DepDelay = na.approx(DepDelay))
```

```{r type}
# Change selected data types (Numeric to Categorical)
merged_df$Year <- as.factor(merged_df$Year)
merged_df$Month <- as.factor(merged_df$Month)
merged_df$DayofMonth <- as.factor(merged_df$DayofMonth)
merged_df$DayOfWeek <- as.factor(merged_df$DayOfWeek)
merged_df$FlightNum <- as.factor(merged_df$FlightNum)
merged_df$Cancelled <- as.factor(merged_df$Cancelled)
merged_df$Diverted <- as.factor(merged_df$Diverted)
```

```{r month label}
# Create labelled column for easier visualisation
merged_df$Month_label <- month.abb[merged_df$Month]

```


```{r check missing}
# Check for missing values per column
sapply(merged_df,function(x)sum(is.na(x)))

```
<br>
The null values have all been removed from the dataframe. 
<br>

```{r filter}
flight_cancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Cancelled')))
flight_notcancelled <- merged_df %>% filter_at(vars(Cancelled), any_vars(. %in% c('Not Cancelled')))

```
<br>

## Creating Delay Status

We assume that a delayed flight is equivalent to arriving late for more than 15 minutes at its destination. (ArrDelay > 15 mins)

Since flights can be delayed on its Departure but still arrive on time, hence we do not classify those as a delayed flight.

Hence we create a DelayStatus column into the main dataframe (merged_df),where 0 = No Delay, 1 = Delay.

```{r delay status}
# Creating new column showing ArrDelay > 15mins
# 0 = No Delay, 1= Delay
flight_notcancelled$DelayStatus <- ifelse(flight_notcancelled$ArrDelay > 15, 1, 0)

```


```{r propdelayed}
table(flight_notcancelled$DelayStatus)
prop.table(table(flight_notcancelled$DelayStatus))

```


This shows that 77% have no delays (ArrDelay > 15 minutes), where they either arrived early or on time. Also, 23% of flights were delayed. Equivalent to about 1 out of every 5 flights being delayed.

<br>

# Exploratory Data Analysis (EDA)

We will be looking at the different variables to get a better understanding of the data.

* Total Flight Distribution
* Cancellation
* Delay
<br>

## Total Flight Distribution

### Total Flight Distribution of Full Data by Month

```{r alldistr}
prop.table(table(merged_df$Month))
prop.table(table(merged_df$DayOfWeek))
```



```{r plot total}
# Percentage Distribution of Month
ggplot(merged_df, aes(x = Month)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Month (%)") +
  ylab("Percentage (%)")
# Percentage Distribution of DayOfWeek
ggplot(merged_df, aes(x = DayOfWeek)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)),fill = 'cornflowerblue') + ggtitle("DayOfWeek (%)") +
  ylab("Percentage (%)")
```
<br>
The data is almost evenly distributed between Month and DayOfWeek, with February and Saturday having the least number of total flights.
<br>

```{r totalflights per airline}
#Total Number of flights per Airline
ggplot(merged_df, aes(x = forcats::fct_infreq(UniqueCarrier))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Flights per Airline") +
  xlab("UniqueCarrier")
```

```{r query mostflights}
carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO','MQ','US')))
```

The top 5 airlines with the most flights are WN, AA, OO, MQ, UA.

1. Southwest Airlines
2. American Airlines
3. Skywest Airlines
4. American Eagle Airlines
5. United Airlines

```{r boxplot depdelay}
#Boxplot Distribution of Total ArrDelay per Carrier
boxplot(flight_notcancelled$ArrDelay~flight_notcancelled$UniqueCarrier,
        main = "Distribution of Total ArrDelay per Carrier",
        xlab = "Carrier/Airline",
        ylab = "Total ArrDelay (mins)",
        border = "black"
        )
```

### Cancellation
```{r}
# Percentage Distribution of Cancellation
ggplot(flight_cancelled, aes(x = CancellationCode)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'cornflowerblue') + ggtitle("Cancellation Reasons (%)") +
  ylab("Percentage (%)")
```

We can conclude that Cancellations are mostly due to Carrier, Weather and NAS with around 43%, 35% and 20% respectively.

```{r totalcancelledflights per airline}
#Total Number of Cancelled flights per Airline
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(UniqueCarrier))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Airline") +
  xlab("UniqueCarrier")
```

```{r query mostcancelledflights}
carrier_df %>% filter_all(any_vars(. %in% c('MQ', 'AA','OO')))
```

The top 3 most cancelled flights throughout these 2 years are:

1. American Eagle Airlines
2. American Airlines
3. Skywest Airlines

```{r}
#Total Number of Cancelled flights per Month
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(Month))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per Month") +
  xlab("UniqueCarrier")
#Total Number of Cancelled flights per DayOfWeek
ggplot(flight_cancelled, aes(x = forcats::fct_infreq(DayOfWeek))) +  
  geom_bar(aes(y = (..count..)),fill = 'cornflowerblue') + ggtitle("Total Number of Cancelled Flights per DayOfWeek") +
  xlab("UniqueCarrier")
```

The flights are mostly likely to be cancelled in December and February, on a Thursday and Friday.

<p>

### Delay

```{r}
cor.test(flight_notcancelled$ArrDelay, flight_notcancelled$DepDelay, method = "pearson")
```

ArrDelay and DepDelay have a strong positive linear relationship, implying that a Departure Delay will almost certainly result in an Arrival Delay.

```{r}
# BarPlot
ggplot(flight_notcancelled, aes(x= UniqueCarrier)) + geom_bar(aes(fill=as.factor(DelayStatus))) + ggtitle("Barplot of DelayStatus counts per UniqueCarrier")
```

```{r query mostdelayedflights}
carrier_df %>% filter_all(any_vars(. %in% c('WN', 'AA','OO')))
```

The top 3 most delayed flights throughout these 2 years are:

1. Southwest Airlines
2. American Airlines
3. Skywest Airlines


# Q1. When is the best time of day, day of the week, and time of year to fly to minimise delays?

We will breakdown this question into three parts, where we will find the airline carrier and time period least likely to have delayed flights:

* Best Time of the Day
* Best Day of the Week
* Best Month of the Year
* Best Day of the Month
<br>

## Best Time of the Day
### Distribution of Average Delay by Time Period
```{r copydata1}
# Create copy of dataframe
df_q1 <- flight_notcancelled

```
<br>
Time Intervals column ‘ArrPeriod’ was created based on ‘ArrTime’. 24 hours in a day will split into 6 different periods with at least 3 to 5-hour intervals since different timings like 5am and 11am are better not generalised together into a single timeframe. The period is split as such: 

* Midnight (12am - 5am)
* Early Morning (5am - 9am)
* Late Morning (9am - 12pm)
* Afternoon (12pm - 5pm)
* Evening (5pm - 9pm)
* Night (9pm - 12am)

```{r timeperiod}
# Categorising ArrTime and DepTime by 6 periods: Midnight, Early Morning, Late Morning, Afternoon, Evening, Night
df_q1 <- df_q1 %>%
  mutate(ArrPeriod = case_when(ArrTime >= 500 & ArrTime < 900 ~ 'Early Morning', 
                          ArrTime >= 900 & ArrTime < 1200 ~ 'Late Morning',
                          ArrTime >= 1200 & ArrTime < 1700 ~ 'Afternoon',
                          ArrTime >= 1700 & ArrTime < 2100 ~ 'Evening',
                          ArrTime >= 2100 & ArrTime < 2400 ~ 'Night',
                          TRUE ~  'Midnight'))
```


```{r timebar}
# Create table counts
counts <- table(df_q1$DelayStatus,df_q1$ArrPeriod)

# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "Time Period", ylab = "Total Delay Count",
                       main = "Total Delay per Period",beside=TRUE, 
                       legend = rownames(counts))
```

Early Morning followed by Midnight has the lowest count for number of flights delayed. However, we will proceed on to check if having the lowest number of count will equate to the shortest average delayed time.

```{r time boxplot}
# Boxplot
ggplot(df_q1, aes(x= ArrDelay)) + geom_boxplot(aes(color=as.factor(ArrPeriod))) + ggtitle("Boxplot of ArrDelay (mins) vs ArrPeriod")
```

The boxplot shows that the mean of ArrDelay for midnight is more than the other time period. Hence, validating our assumption that lowest flight delay counts does not equate to lowest average delayed time.

```{r timeline}
#Set Sequence to ArrPeriod
df_q1$ArrPeriod <- factor(df_q1$ArrPeriod,levels = c("Midnight", "Early Morning", "Late Morning", "Afternoon", "Evening", "Night"))

df_q1  %>% group_by(ArrPeriod) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= ArrPeriod, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Time Period") +
  xlab("Time Period") + ylab("Average ArrDelay (mins)")
```


The best time period to minimise flight delays would be in the Early Morning from 5am to 9am.


 <p>

## Best Day of the Week
### Distribution of Day Delay
```{r daybar}
# Create table for counts
counts <- table(df_q1$DelayStatus,df_q1$DayOfWeek)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "DayOfWeek",ylab = "Total Delay Count",
                       main = "Total Delay per DayOfWeek",beside=TRUE, 
                       legend = rownames(counts))
```

```{r dayline}
#Line Plot
df_q1  %>% group_by(DayOfWeek) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= DayOfWeek, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayOfWeek") +
  xlab("DayOfWeek") + ylab("Average ArrDelay (mins)")
```

The Best Day of the Week to minimize delays is to travel on Saturday, followed by Tuesday, with average ArrDelay timing of less than 6 minutes and 8 minutes, respectively.

The longest average delays of 12-13 minutes are expected in the middle of the week, from Thursday to Friday.


 <p>

## Best Month of the Year
### Distribution of Monthly Delay
```{r monthbar}
# create table counts
counts <- table(df_q1$DelayStatus,df_q1$Month)
# Plot grouped barplot
barplot(counts, col = c("white","cornflowerblue"),
                       xlab = "Month",ylab = "Total Delay Count",
                       main = "Total Delay per Month",beside=TRUE, 
                       legend = rownames(counts))
```

```{r monthline}
df_q1 %>% group_by(Month) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= Month, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs Month") +
  xlab("Month") + ylab("Average ArrDelay (mins)")
```

Based on ArrDelay, the best time of year to minimise travel delay is November, then September, with both averaging approximately 6 minutes of delay, as opposed to June and December, with more than twice the number of minutes delayed.

Due to the U.S. summer and winter vacation (School Holidays USA, 2022), June and December are projected to be popular months to travel.


 <p>

## Best Day of the Month
```{r}
df_q1 %>% group_by(DayofMonth) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= DayofMonth, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), col = "cornflowerblue") + ggtitle("Average ArrDelay vs DayofMonth") +
  xlab("DayofMonth") + ylab("Average ArrDelay (mins)")
```
Also, travelling in the first half of the month, around the 8th – 9th, results in an average ArrDelay of around 6 minutes, and the second half of the month with double the delay amount
 
<br>

Hence overall, the top 2 recommended time period to avoid flight delay is:

* Early Morning (5am-9am) & Late Morning (9am-12pm) (≈0 minutes)
* Saturday & Tuesday (≈6 minutes)
* September & November (≈6 minutes)
* Around 8th – 9th (Specifically on 8th) (≈6 minutes)

All the above periods are delayed by an average of 0 to 6 minutes. Passengers may reduce their flight delays even further by booking flights with airlines that are below the delay threshold (%).


```{r remove dfq1}
# To save up space
rm(df_q1)
```
<br>

# Q2. Do older planes suffer more delays?

The issue date of planes will be extracted from the planes_df data since we can judge the age of the aircraft and make our analysis based on that.


* Distribution of Arr/Dep Delay by Issue Date
* Distribution of Arr/Dep Delay by Issue Year
* Comparison of Old and Normal Planes


To determine the age of the aircrafts, the engine's issue date is obtained by mapping the plane-data.csv consisting of the plane's ‘issue_date’, into a duplicated data frame of the 'flight_notcancelled' data as a new column.

```{r copydata2}
# Create a new dataframe
df_q2 <- flight_notcancelled
```


```{r extractissue_date}
# Extract issue_date from plane_df
df_q2$issue_date <- planes_df$issue_date[match(df_q2$TailNum, planes_df$tailnum)]
```

```{r checkmissing2}
# Check for missing value
sapply(df_q2,function(x)sum(is.na(x)))
```
<br>

Since our dataset is huge with more than 2 million records, we will proceed to clean the 166k records of missing issue_date.


```{r qn2 cleaning}
# Remove missing & None values
df_q2 <- na.omit(df_q2)
df_q2 <- df_q2 %>% 
  filter(!grepl('None', issue_date))

# Convert to datetime format
df_q2$issue_date <- as.Date(df_q2$issue_date, "%m/%d/%Y")
```


## Distribution of Arr Delay with Issue Date

### Total Sum of ArrDelay by Issue Date
```{r plot issue_date}
# Total Sum of Delay by issue_date
ggplot(data = df_q2, aes(x=issue_date, y=ArrDelay))+
  geom_line() + ggtitle("Total ArrDelay with Issue Date") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")
```

From the graph, we can see three peak points around year 1986, 2000 and 2004. Hence, we proceed to check with the mean value.

```{r plot mean issue_date}
# Average of ArrDelay by issue_date
df_q2 %>% group_by(issue_date) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
ggplot(aes(x=issue_date, y=mean))+
  geom_line() + ggtitle("Average ArrDelay with Issue Date") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")
```

Based on Issue Date, there seem to be more delay occurrences for planes issued after 1998. Also, the low total minutes delayed between 1976 and 1984 might be attributed to planes older than 24 years being removed owing to maintenance faults, or the airline scheduling them with ample time to avoid delays.


<p>

## Distribution of Arr Delay with Issue Year


A new column of issue year and plane age is then extracted from the issue date.
```{r issueyear}
df_q2$issue_year = format(as.Date(df_q2$issue_date, format="%m/%d/%Y"),"%Y")
```

```{r issueyearline, fig.width=10}
df_q2 %>% group_by(issue_year) %>% dplyr::summarize(mean = mean(ArrDelay)) %>%
  ggplot(aes(x= issue_year, y = mean)) +
  geom_point() +
  geom_line(aes(group = 1), size=1, col = "cornflowerblue") + ggtitle("Average ArrDelay vs issue_year") +
  xlab("issue_year") + ylab("Average ArrDelay (mins)")

```

Older planes issued before 1984 have a higher average delay of more than 11 minutes, compared to those 1984 onwards with under 11 minutes of average delay.

Further investigations will then be made with Issue Date rather than its Year, as delays became increasingly prevalent after year 2000.



## Comparing Old Planes and Normal Planes

The average age of U.S planes is 11 years, with about 25% of planes above 15 years old (Mayerowitz, 2011). 

Hence, we will deem an aircraft to be old when in operation for 15+ years, where data will split into two, with planes issued before 1993 termed old, and those issued 1993 onwards termed normal..

<p>

### Distribution of Old & Normal Planes
```{r totalplotdist oldnormal}
# Plot a subset of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Old Planes (Before 1993))") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")
# Plot a subset of normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=ArrDelay))+ geom_line() + ggtitle("Total ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
  xlab("Issue Date") + ylab("Total ArrDelay (mins)")
```

```{r meanplotdist oldnormal}
# Plot Average ArrDelay of older planes
ggplot(data=subset(df_q2, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=ArrDelay))+  stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Old Planes (Before 1993))") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")
# Plot Average ArrDelay of Normal planes
ggplot(data=subset(df_q2, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=ArrDelay))+  stat_summary(aes(y = ArrDelay,group=1), fun=mean, geom="line",group=1) + ggtitle("Average ArrDelay with Issue Date (Normal Planes (1993 Onwards))") +
  xlab("Issue Date") + ylab("Average ArrDelay (mins)")
# Plot Average ArrDelay by Distance of Old/Normal planes
ggplot() +  stat_summary(data=subset(df_q2, issue_date < as.Date("1993-01-01")), aes(x=Distance, y = ArrDelay,group=1, color = 'red'), fun=mean, geom="line",group=1, size=0.5) +  stat_summary(data=subset(df_q2, issue_date  > as.Date("1992-12-31")), aes(x=Distance, y = ArrDelay,group=1, color = 'cornflowerblue'), fun=mean, geom="line",group=1, size=0.5) + ggtitle("Average ArrDelay by Distance") +
  xlab("Distance (miles)") + ylab("Average ArrDelay (mins)") + 
    scale_colour_manual(name = 'Planes', 
         values =c('cornflowerblue'='cornflowerblue','red'='red'), labels = c('Normal Planes (1993 Onwards)','Old Planes (Before 1993)'))+ theme(legend.position = c(0.82,0.88))
```

Older planes are more likely than normal planes to have larger Average Delays as the Distance grows, with Average Arrival Delays surpassing 100 minutes or more. This further confirms that older planes tend to suffer from more delays when on a long-haul flight. Further investigations are done with Distance to identify the different delay factors that may impact old and normal planes. 

```{r indv delay total}
library(reshape2)
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for full data
ggplot(df_q2a, aes(issue_date, value)) +
  geom_line(aes(colour = Delay),size=0.8) + ggtitle("Total Delay by issue_date") +
  xlab("issue_date") + ylab("Total Delay (mins)")+ theme(legend.position = c(0.12,0.78))
```

```{r avg indv delay full}
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for full data
ggplot(data=df_q2a, 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date") +
  xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.12,0.78))
```

The full data shows Carrier Delay being prominent with planes issued 1990-1992 and 2000.

```{r total indv delay old}
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Old Planes)") +
  xlab("Issue Date") + ylab("Total Delay (mins)") + theme(legend.position = c(0.12,0.78))
```

```{r avg indv delay old}
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for old planes
ggplot(data=subset(df_q2a, issue_date < as.Date("1993-01-01")), 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size=1) + ggtitle("Average Delay with Issue Date (Old Planes)") +
  xlab("Issue Date") + ylab("Average Delay (mins)")  + theme(legend.position = c(0.5,0.78))                   
```

Old Planes issued around 1990 are more likely to have delays due CarrierDelay, which includes maintainence etc.

```{r total indv delay normal}
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Total Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date > as.Date("1992-12-31")), 
       aes(x=issue_date, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Issue Date (Normal Planes)") +
  xlab("Issue Date") + ylab("Total Delay (mins)")                                                            
```

```{r avg indv delay normal}
# Organizing data
df_q2a <- df_q2 %>% select(issue_date, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(df_q2a ,  id.vars = 'issue_date', variable.name = 'Delay')
# Plot Average Individual Delays for normal planes
ggplot(data=subset(df_q2a, issue_date  > as.Date("1992-12-31")), 
       aes(x=issue_date, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1, size =1) + ggtitle("Average Delay with Issue Date (Normal Planes)") +
  xlab("Issue Date") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.77))                     
```

Normal Planes are likely to have CarrierDelays as well.


### Distance Distribution of Old & Normal Planes

Distance is then used to check the delays that Old and Normal Planes will have with respect to the number of miles that they have travelled.
```{r total delay w dist old}
# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01")) 
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Old planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Old Planes)") +
  xlab("Distance (miles)") + ylab("Total Delay (mins)") + theme(legend.position = c(0.88,0.78))
```

```{r avg delay w dist old}
# Organizing data
old_planes=subset(df_q2, issue_date < as.Date("1993-01-01")) 
old_planes <- old_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(old_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Old planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1,size=1)+ ggtitle("Average Delay with Distance (Old Planes)") +
  xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))
```

In R, although the LateAircraftDelay and NASDelay have higher average values, Carrier Delays also have occurrences of average delay ranging around 50 minutes at 1500 miles. 

```{r total delay w dist normal}
# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31")) 
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Total Individual Delays by Distance for Normal planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+ geom_line(aes(colour = Delay),size=1) + ggtitle("Total Delay with Distance (Normal Planes)") +
  xlab("Distance (miles)") + ylab("Total Delay (mins)") 
```

```{r avg delay w dist normal}
# Organizing data
normal_planes=subset(df_q2, issue_date > as.Date("1992-12-31")) 
normal_planes <- normal_planes %>% select(Distance, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
df_q2a <- melt(normal_planes,  id.vars = 'Distance', variable.name = 'Delay')
# Plot Average Individual Delays by Distance for Normal planes
ggplot(data = df_q2a, 
       aes(x=Distance, y=value))+  stat_summary(aes(y = value,group=1,colour = Delay), fun=mean, geom="line",group=1)+ ggtitle("Average Delay with Distance (Normal Planes)") +
  xlab("Distance (miles)") + ylab("Average Delay (mins)") + theme(legend.position = c(0.88,0.78))
```

Normal planes have lower average values of Carrier Delays of less than 80 minutes.

Hence, it is evident that older planes would suffer more delays by having a larger value of Carrier Delay minutes while flying routes longer than 1500 miles.


Overall, the grouping by Issue Year was clear in showing that older planes do suffer from more delay, but only by up to 8 minutes on average, where LateAircraftDelay, followed by Carrier/NASDelay are the primary causes. <p>
    When utilized for long-haul flights, older planes are also more likely to have delays due to carrier delays that might have resulted from aircraft maintenance or inspection. <p>
        The difference in delay however is quite negligible and it might be due to airlines “padding” and scheduling extra time for flights to prevent flights from being classified as delayed (Kramer, 2019).

```{r remove dfq2}
# To save up space
rm(df_q2)
rm(df_q2a)
rm(counts)
rm(normal_planes)
rm(old_planes)
```

<br>

# Q3. How does the number of people flying between different locations change over time


We will find the most popular routes to gauge the number of people flying between these different locations.

* Distribution by Flight Routes
* Distribution by State

```{r copydata3}
# Create new dataframe
df_q3 <- flight_notcancelled
```


## Distribution by Flight Routes


To begin, the Origin and Destination are combined into a new ‘FlightRoute’ column (e.g., OGG to HNL).

```{r OriginToDest}
# Create new column as its flight route
df_q3$FlightRoute <- paste(df_q3$Origin, "to", df_q3$Dest)
```

```{r valuecounts routes}
# Most popular routes
df_q3 %>% 
  group_by(df_q3$FlightRoute) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))
```


The top five most popular routes are:


1. OGG to HNL
2. HNL to OGG
3. LAX to LAS
4. SAN to LAX
5. LAX to SAN


Out of the 5196 distinct routes, the Top 5 routes with the highest count are identified to examine if the number of flights has changed over the course of the year. 

```{r top5routes}
# Extract the top 5 flight routes
df_q3 <- df_q3[df_q3$FlightRoute %in% c('OGG to HNL', 'HNL to OGG', 'LAX to LAS', 'LAX to SAN','SAN to LAX'), ]
```

```{r barplot flightroute}
# BarPlot
ggplot(df_q3, aes(x= Month)) + geom_bar(aes(fill=as.factor(FlightRoute))) + ggtitle("Barplot of FlightRoute counts")
```


```{r flightroute line}
# FlightRoute Line Plot
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) + 
  geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
  xlab("Month") + ylab("Total Flight Counts")
```

For the first five months, the flight routes [OGG to HNL] and [HNL to OGG] had roughly 50 fewer flights than the other three routes. <p>
    However, from June to August, both routes begin to increase in their number of flights, reaching over 100 more flights than the other routes. <p>
        Between September and December, these routes continue to have more flights than the others.


<p>

## Distribution by State

To make the most of the data, the States were queried to justify the number of people flying interstate and intrastate.

```{r replacedata3}
# Create new dataframe
df_q3 <- flight_notcancelled
```

```{r extract state}
# Query state from airport_df into dataframe
df_q3$state <- airport_df$state[match(df_q3$Origin, airport_df$iata)]
```

```{r valuecounts state}
# Most popular states
df_q3 %>% 
  group_by(df_q3$state) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))
```

Top 3 popular states are:

1. California
2. Texas
3. Illinois

```{r top3states}
# Select the top 3 states
df_q3 <- df_q3[df_q3$state %in% c('CA', 'TX', 'IL'), ]
```

```{r state linecount}
# Check top 3 states travel flight counts
ggplot(df_q3, aes(x=Month, group = factor(state) , colour=factor(state))) + 
  geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
  xlab("Month") + ylab("Total Flight Counts")
```

```{r OriginToDest1}
# Create new column of state flight routes
df_q3$FlightRoute <- paste(df_q3$state, "to", df_q3$state[-1:0])
```

```{r state line}
# Flight Routes of Top 3 states
ggplot(df_q3, aes(x=Month, group = factor(FlightRoute) , colour=factor(FlightRoute))) + 
  geom_line(stat = 'count',size=1) + ggtitle("Count of flights per Flight Route by Month") +
  xlab("Month") + ylab("Total Flight Counts")
```

For both intrastate and interstate, February and September have the fewest flights with around 100 less than other months. As expected, the number of flights increases in the middle of the year, between June and August.
<p>

With two distinct methodologies, it is evident that February has the fewest flights, followed by September. The more popular travelling months are during June to August, which is likely due to the summer holidays in the USA which last 11 weeks from June to August (School Holidays USA, 2022).

```{r removedfq3}
# To save up space
rm(df_q3)
```
<br>

# Q4. Can you detect cascading failures as delays in one airport create delays in others?


Cascading failures occur when a flight delay for one plane in an airport causes a flight delay in another. 

This is explained by the existing variable 'LateAircraftDelay' which describes how a particular flight delayed in its Origin arrives late in its Destination, then affecting the next flight's departure since the same plane was used. The ripple impact of a previous delay at downstream airports hence causes cascading failures.

Since Tail Numbers are identification numbers on aircraft, it will be easier to focus on data with ‘LateAircraftDelay’, then focus on a selected aircraft and observe its flight schedule.

We will approach the question as follow:

* Methodology
* Initial Test
* Secondary Testing

```{r copydata4}
# Create new dataframe
df_q4 <- merged_df
```

```{r valuecounts tailnum}
# Find Date with highest value counts
df_q4 %>% 
  group_by(df_q4$TailNum) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))
```

'0' and '000000' are likely private confidential TailNums, so we ignore those data and focus on the top 2 highest counts of TailNum:

* N478HA
* N308SW
<br>

## Methodology

We will approach the dataset in the following way to effectively illustrate cascading failures, that is, delays in one airport will cause delays in another:

1.	Find Highest ‘TailNum’ count (Higher chance of continuous flights in a day)
2.	Extract data fitting condition of Highest ‘TailNum’ & 'LateAircraftDelay' > 15 minutes
3.	Find Highest ‘Date’ count from extracted data (Higher chance of continuous flights in a day)
4.	Extract overall data fitting conditions of Highest ‘TailNum’ & Highest ‘Date’ count
5.	Sort data by ‘DepTime’ (To see the flight schedule of same aircraft by Departure Timing)

This approach will allow us to assess if a delayed flight in one airport may cause a delay in another. 
To ensure that the data extraction and analysis approaches are valid, two separate tests will be conducted.

<p>

### Initial Test

The initial test used N308SW to identify the highest date counts.
```{r query lateaircrafts}
# First extraction of data with top most TailNum counts
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$LateAircraftDelay > 15, ]
```

```{r valuecount dates1}
df_q4_1 %>% 
  group_by(df_q4_1$date) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))
```
<br>
We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.


```{r query tailnum & date}
# Extracting dataset that matches our findings
df_q4_1 <- merged_df[merged_df$TailNum %in% c('N308SW') & merged_df$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1),]
```

It is clear that one flight's delay in an airport can cause cascading failures in another.

Taking Flight 458 as an example: has to depart at 1835 (6:35pm) but delayed for 52 mins till 1927 (7:27pm), flying from DAL to SAT. Supposed to reach 1935 (7:35pm) but delayed 50 mins and reached 2025 (8:25pm).

This 77 mins delay thereafter caused the next flight from SAT to DAL (FlightNum 178) to have departure delay for 45 minutes (43 mins for LateAircraftDelay) also since its scheduled departure time is at 2005 (8:05pm), but it only took off at 2050 (8:50pm). It also arrived 43 minutes later than expected time of 2100 at 2143.

We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.

```{r query1 tailnum & date}
# Matching with full data to confirm assumption
df_q4_1 <- mergeddf[mergeddf$TailNum %in% c('N308SW') & mergeddf$date == ('2006-03-31'), ]
df_q4_1[order(DepTime)][-c(1:8, 10,12,14),] %>% select(10,5:8,29,15:18,30,9,11) #Select important columns
```

Considering Flight 755: there was a 52 minutes ArrDelay in the Destination Airport (LIT). Due to the delay, Flight 458, scheduled to depart at 1700 (5pm), was delayed for 60 mins till 1800 (6pm), flying from LIT to DAL. 

Supposed to reach 1805 (6:05pm) but reached 1900 (7pm). With the 55-min delay, the next flight from DAL to SAT (FlightNum 458) was subsequently delayed for 52 minutes (50-min for LateAircraftDelay) since its scheduled departure time was at 1835 (6:35pm), but it only took off at 1927 (7:27pm). It then landed 50 minutes later, at 2143, than its expected time of 2100. 

Since the DepTime overlapped with its previous flight’s CRSArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failure.

With this, we can observe the cascading failure for Carrier WN of same TailNum, which persisted throughout the night at different airports (LIT, DAL, SAT, TUL etc.)
<br>

Since the CRSDepTime overlapped with its previous flight’s ArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failures.


<p>

### Secondary Testing

The secondary test used N478HA to identify the highest date counts.

```{r query2 lateaircrafts}
# First extraction of data with top most TailNum counts
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$LateAircraftDelay > 15, ]
```

```{r valuecount dates2}
df_q4_2 %>% 
  group_by(df_q4_2$date) %>%
  dplyr::summarize(Count=n()) %>%
  arrange(desc(Count))
```
<br>
We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.


```{r query2 tailnum & date}
# Extracting dataset that matches our findings
df_q4_2 <- merged_df[merged_df$TailNum %in% c('N478HA') & merged_df$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)]
```

Taking Flight 179 as an example: has to depart at 1328 (1:28pm) but delayed for 82 mins till 1450 (2:50pm), flying from KOA to OGG. Supposed to reach 1357 (1:57pm) but delayed 87 mins and reached 1524 (3:24pm).

We will further check on the full data (same date, same TailNum) to see the flights that were not included in the sample that we extracted, so as to make a more conclusive statement.

```{r query3 tailnum & date}
# Matching with full data to confirm assumption
df_q4_2 <- mergeddf[mergeddf$TailNum %in% c('N478HA') & mergeddf$date == ('2006-03-31'), ]
df_q4_2[order(DepTime)][-c(1:4),]  %>% select(10,5:8,29,15:18,30,9,11) #Select important columns

```

This 87 mins arrival delay of FlightNum 179 thereafter caused the next flight from OGG to HNL (FlightNum 179) to have departure delay for 79 minutes (89 mins for LateAircraftDelay) also since its scheduled departure time is at 1425 (2:25pm), but it only took off at 1544 (3:44pm). It also arrived 89 minutes later than expected time of 1459 at 1628.
We can also see from the day that one flight delay in an airport will cause cascading failures, contributing to ArrDelay and LateAircraftDelay for flights, hence confirming our observation.


<p>

It is also worth noting the instances where subsequent flights might not use the same TailNum, but they are assumed due to a lack of schedule information. A total of two tests were conducted in R, to demonstrate the same effect of cascading failures where delays at one airport causes delays in another. The previous flight's late arrival, which used the same plane that would be departing, caused the subsequent flight’s delay. As a result, the current flight will depart late, setting off a chain reaction causing passengers at other airports to board the plane much later as well.

```{r remove dfq4}
# To save space
rm(df_q4)
rm(df_q4_1)
rm(df_q4_2)
```
<br>

# Q5. Use the available variables to construct a model that predicts delays.


With the flight data labelled, Supervised Learning algorithms such as Multiple Linear Regression and Random Forest are used to construct Regression and Classification prediction models in R. <p>
    This works by allowing the model to predict the label of new data points based on past data. 

<p>

To predict delays, these few supervised learning models with selected variables are used:

* Multiple Linear Regression
* Random Forest

```{r copydata5}
# Create new dataframe
df_q5 <- flight_notcancelled
```

```{r select var}
# Select variable columns
df_q5 <- df_q5 %>% select(Month, UniqueCarrier, DepTime, CRSDepTime, ArrTime, CRSArrTime, ArrDelay, DepDelay,Distance,TaxiIn,TaxiOut, ActualElapsedTime,AirTime, DelayStatus)
```

Time-related and Factor columns are selected as predictor varaibles for analysis.

```{r changefactors}
# Change selected var as factors
df_q5$UniqueCarrier<- as.factor(df_q5$UniqueCarrier)
df_q5$Month <- as.factor(df_q5$Month)
```


```{r traintestsplit}
# Hold-out Validation method
library(caret)
set.seed(42)

F_sample = createDataPartition(y=df_q5$ArrDelay, p = 0.7, list = F)
train = df_q5[F_sample,]
test = df_q5[-F_sample,]
```

First, ‘CreateDataPartition’ feature will create a train-test split to prevent overfitting, with Trainset accounting for 70%, and Testset for 30%. The model will be trained using the Trainset data, and its performance will be evaluated by predicting with the unseen Testset. For reproduction, random state is set to a random seed 42.

The response variable Y, for the Regression model will comprise of ArrDelay (minutes).

<p>

## Multiple Linear Regression

MLR is a regression model which enables us to understand and estimate relationships between multiple variables.

```{r mlr model}
# Load model
mlr_model <- lm(ArrDelay ~. -DelayStatus, data = train)
summary(mlr_model)
```

The model summary is then used to identify the significant variables affecting ArrDelay, excluding DelayStatus since they are related. 

<p>

The R2 score is 0.931, where predictor variables in the model explained 93.1% of the variation in Y (ArrDelay). RMSE is also considerably fitting at 9.98. 


```{r}
# Predicting delays
delayPred <- predict(mlr_model, test)  
```

```{r}
# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$ArrDelay, Predicted=round(delayPred)))
ActualnPred
# 96.4% correlation accuracy of similar directional movement
correlation_accuracy <- cor(ActualnPred)
correlation_accuracy
```

The Correlation Accuracy of 96.4% shows how the actual and predicted values have similar directional movements.

```{r rmse}
sqrt(mean((test$ArrDelay - delayPred)^2))
```

The R2 score is 0.9369 and Root Mean Squared Error (RMSE) is 9.676569.
The R2 value shows that the predictor variables in the model are able to explain 93.69% of the variation in ArrDelay.

## Random Forest

Used for Classification and Regression, Random Forest is a Supervised Learning algorithm that constructs many decision trees. <p>
R uses classification model where the output is chosen by a majority vote among decision trees.

```{r}
set.seed(42)
require(caTools)
sample = sample.split(df_q5$DelayStatus, SplitRatio = .70)
train = subset(df_q5, select = -c(ArrDelay), sample == TRUE)
test  = subset(df_q5, select = -c(ArrDelay), sample == FALSE)
train$DelayStatus <- as.character(train$DelayStatus)
train$DelayStatus <- as.factor(train$DelayStatus)
```

Hence, a new train-test set with response variable Y as DelayStatus of similar parameters (30% Test) is created, with the ArrDelay column removed to prevent multicollinearity. 

```{r}
library(randomForest)
rf <- randomForest(DelayStatus ~ ., data=train, ntree=10, random_state =42)
```

```{r}
rf
```

The Out-of-bag error estimate is at 6.9%, where Accuracy = 1 -OOB error, so the model has an accuracy of 93.1%. 

```{r}
importance(rf)        
varImpPlot(rf)   
```

```{r}
delayPred <- predict(rf, test)  
```

```{r}
test$DelayStatus <- as.factor(test$DelayStatus)

# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred
```



### RF Model without DepDelay

Since DepDelay and ArrDelay have a positive linear relationship which might lead to multicollinearity, we will investigate to see if there are changes to the factors that might affect DelayStatus.

```{r}
rf <- randomForest(DelayStatus ~ .- DepDelay, data=train, ntree=10, random_state =42)
```

```{r}
rf
```


```{r}
importance(rf)        
varImpPlot(rf)   
```
Models with and without DepDelay show the changes in attributing factors causing DelayStatus, where the top 6 factors for both models are similar.

```{r}
delayPred <- predict(rf, test)  
```

```{r}
test$DelayStatus <- as.factor(test$DelayStatus)

# Create ActualnPred dataframe
ActualnPred <- data.frame(cbind(Actual=test$DelayStatus, Predicted=delayPred))
ActualnPred
```

In total, one regression and one classification method were tested out in R. Overall, the best model to predict the continuous ArrDelay would be MLR with 95.7% accuracy. To predict the binary outcome DelayStatus, Random Forest model is around 93.1% accurate.
