Flight Analysis in the U.S.

Table of Contents

  1. Introduction
  2. Import Data & Libraries
  3. Understanding Data
  4. Data Pre-Processing/Cleaning
  5. Exploratory Data Analysis (EDA))
  6. Q1. Best time, day and time of year to minimise flight delays?
  7. Q2. Are older planes more prone to delays?
  8. Q3. Show how the no. of people flying between various locations change with time.
  9. Q4. Can cascading failures where an airport's delay causing delay in another be detected?
  10. Q5. Construct a model predicting delays.

Introduction

Every year, nearly 25% of airline flights are delayed or cancelled, costing travellers over $30 billion in lost time and money. Flight delays have long been a cause of dissatisfaction in the airline industry, as well as a source of annoyance for passengers and carriers.

Our goal is to use the massive amount of airline data to visualise and study the flight patterns and predict if a flight will be delayed. For this study, both Python and R will be used to investigate 2 years’ worth of data, since two full business cycles are adequate in reducing bias for one cycle.

This notebook aims to look at these questions regarding flight travel:

  1. Best time, day and time of year to minimise flight delays?
  2. Are older planes more prone to delays?
  3. Show how the no. of people flying between various locations change with time.
  4. Can cascading failures where an airport's delay causing delay in another be detected?
  5. Construct a model predicting delays.

The dataset can be retrieved from Airlines - Harvard Dataverse

Import Data & Libraries

Understanding Data

There are almost 14.6 million records with 29 variable columns. The columns include the airline and flight details etc. and are mostly time-related (in mins).

Both ArrDelay and DepDelay has a moderately higher correlation with LateAircraftDelay, CarrierDelay and NASDelay.

Data Pre-Processing/Cleaning

Missing data is also handled using linear interpolation to estimate unknown data values between known data values, and duplicates are removed.

CancellationCode is filled with 14.3 million missing values since non-cancelled flights do not have a Cancellation Code.

Out of the 14.6 million records from 2007 to 2008, there were about 282k cancelled flights .

Creating Delay Status

We assume that a delayed flight is equivalent to arriving late for more than 15 minutes at its destination. (ArrDelay > 15 mins)

Since flights can be delayed on its Departure but still arrive on time, hence we do not classify those as a delayed flight.
Hence we create a DelayStatus column into the main dataframe (merged_df), where 0 = No Delay, 1 = Delay.

This shows that 76.52% have no delays (within 15 minutes), where they either arrived early or on time. Also, 23.47% of flights were delayed. Equivalent to about 1 out of every 5 flights being delayed.

Additionally, there was also a 2% increase in delayed flights in 2007 compared to 2006.

Creating sample of entire data

Due to the large dataset requiring more time to execute, we will randomly select 10% of the data for quick analysis.

We then filter the data into Cancelled and Non-cancelled flights.

Checking for outliers

Outliers are not removed since the full data would better represent the delays and our further analysis.

Checking for missing data between Origin & Dest

We can see that there is an unequal number of Origin and Destination Airport Codes.
Hence we will investigate the different airports not listed and whether they have a significant count.

The 5 airports that are not similar in the Origin and Dest data only has 6 flights in total out of the sample of 1.46 million flights, hence it is not significant enough for us to make changes.

Exploratory Data Analysis (EDA)

We will be looking at the different variables to get a better understanding of the data.

  1. Total Flight Distribution
  2. Cancellation
  3. Delayed Flights

Total Flight Distribution

Total Flight Distribution of Full Data by Month

The data is almost evenly distributed between Month and DayOfWeek, with February and Saturday having the least number of total flights.

Total Number of Flights per Airline

The top 5 airlines with the most flights are WN, AA, OO, MQ, UA.

  1. Southwest Airlines
  2. American Airlines
  3. Skywest Airlines
  4. American Eagle Airlines
  5. United Airlines

Distribution of Distance

Cancellation

Cancelled flights are analysed as well to have a better understanding of the data.

The main areas of focus:

Cancellation Reasons

CancellationCodes were previously changed to its category name to better visualise the data. We then compare the results to the sampled data of cancellation flights to see if it is suitable representative of the data.

The sample data can accurately represent the main data.
We can also conclude that Cancellations are mostly due to Carrier, Weather and NAS with 43.72%, 35.25% and 20.96% respectively.

Cancelled Flights Distribution

There were more cancelled flights in 2007 than in 2006.

The top 3 most cancelled flights throughout these 2 years are:

  1. American Eagle Airlines
  2. American Airlines
  3. Skywest Airlines

Month & Day with the most cancelled flights

The flights are mostly likely to be cancelled in December and February, on a Thursday and Friday.

Delayed Flights

Delayed flights are then analysed in these areas:

Relationship between ArrDelay & Depdelay

ArrDelay and DepDelay have a strong positive linear relationship, implying that a Departure Delay will almost certainly result in an Arrival Delay.

No exact relationship but we can infer from the mean that shorter distance flights might have higher chance of delay.

Relationship of Type of Delay & Average ArrDelay

On average, LateAircraftDelay has the highest average ArrDelay, followed by CarrierDelay and NASDelay.
We can refer to the correlation heatmap above that these factors are also the more significant variables causing delays.

Total Sum of Delay (per Airline)

The top 5 Carriers with the highest Total Delay (mins) of Delayed flights are

  1. Southwest Airlines
  2. American Airlines
  3. American Eagle Airlines
  4. United Air Lines
  5. Skywest Airlines

However, it is unsubstantiable to assume that Southwest Airline flights are usually delayed since it has the highest total delayed minutes. Hence we will further investigate the proportion and justify if this pattern was caused by Southwest Airlines having more flights than the others.

Q1. Best time, day and time of year to minimise flight delays?

We will breakdown this question into four parts, where we will find the airline carrier and time period least likely to have delayed flights:

Best Time of the Day

Distribution of Average Delay by Time Period

24 hours in a day will split into 6 different periods with at least 3 to 5-hour intervals since different timings like 5am and 11am are better not generalised together into a single timeframe. The period is split as such:

Time Interval column ‘ArrPeriod’ was created based on ‘ArrTime’.

Early Morning followed by Midnight has the lowest count for number of flights delayed. However, we will proceed on to check if having the lowest number of count will equate to the shortest average delayed time.

The best time period to minimise flight delays would be in the Early Morning from 5am to 9am.

Midnight Flights are more likely to have longer delays due to LateAircraftDelay (which was caused by cascading failure).
Early Morning are also less likely to have any of the delay types.

Best Day of the Week

Distribution of Day Delay

Best Day of the Week to minimize delays is to travel on Saturday, followed by Tuesday, with average ArrDelay timing of less than 6 minutes and 8 minutes, respectively.

Saturdays and Tuesdays are also likely to create a 9-minute delay on average for DepDelay. The longest average delays of 12-13 minutes are expected in the middle of the week, from Thursday to Friday.

Thursday to Friday have a higher average of 6 minutes caused by LateAircraftDelay, as compared to the other days.

Best Month of the Year

Distribution of Monthly Delay

Based on ArrDelay, the best time of year to minimise travel delay is November, then September, with both averaging approximately 6 minutes of delay, as opposed to June and December, with more than twice the number of minutes delayed.

Based on DepDelay, September will have a roughly 1-minute less average delay than November.

Due to the U.S. summer and winter vacation (School Holidays USA, 2022), June and December are projected to be popular months to travel, resulting in increased delays likely caused by LateAircraftDelay.

Best Day of the Month

Distribution of DayofMonth Delay

Also, travelling in the first half of the month, around the 8th – 9th, results in an average ArrDelay of around 6 minutes, DepDelay of less than 9 minutes, and the second half of the month with double the delay amount.

Delayed Flights (%) per Airline

In addition, we evaluate the airline with the highest percentage of delayed flights to complement the optimal period for minimizing delays.

According to our EDA, the total number of delayed flights was 23.47%, thus a threshold line was drawn at that point. Any airlines that exceed that line will have a higher chance of experiencing flight delays.

By comparing the number of flights per airline, we see that Southwest Airlines (WN) and SkyWest Airlines (OO) are in the Top 5 for Total Flights, yet their delayed flights are well below the threshold.

WN is the third least delayed airline, despite having the Highest Number of Flights and the Highest Total Delay (minutes), with about 5% below the threshold value, hence a reliable airline.

OO was also in the Top 5 for both Highest Number of Flights and Highest Total Delay (mins), but its delay percentage is 21.11% which is also below the threshold value.

As a result, airlines falling below the threshold delay value are more reliable and have a lower chance of experiencing flight delays.

Hence overall, the top 2 recommended time period to avoid flight delay is:

All the above periods are delayed by an average of 0 to 6 minutes. Passengers may reduce their flight delays even further by booking flights with airlines that are below the delay threshold (%).

Q2. Are older planes more prone to delays?

The issue date of planes will be extracted from the planes_df data since we can judge the age of the aircraft and make our analysis based on that.

To determine the age of the aircrafts, the engine's issue date is obtained by mapping the plane-data.csv consisting of the plane's ‘issue_date’, into a duplicated data frame of the 'flight_notcancelled' data as a new column.

Since our dataset is huge with more than 2 million records, we will proceed to clean the 352k records of missing issue_date.

Distribution of Arr/Dep Delay by Issue Date

Total Sum of Delay by Issue Date

From the graph, we can see three peak points around year 1986, 2000 and 2004. However, we checked with the value counts and discovered that these dates are in fact those with the highest issue_date counts. Hence, we proceed to check with the mean value.

Average Delay with Issue Date

Based on Issue Date, there seem to be more delay occurrences for planes issued after 1998. Also, the low total minutes delayed between 1976 and 1984 might be attributed to planes older than 24 years being removed owing to maintenance faults, or the airline scheduling them with ample time to avoid delays.

Distribution of Arr/Dep Delay by Issue Year

A new column of issue year and plane age is then extracted from the issue date.

The increasing trend of average delays as Aircraft Age increases becomes clear. Older planes of 25 years and above have a higher average delay of more than 11 minutes, compared to those from before 25 with under 11 minutes of average delay.

Older planes are also more likely to have more delays due to individual average delay factors, such as LateAircraftDelay, which has an average delay of roughly 10 minutes. Those below 20 years of plane age have less than 6 minutes of average delays.

Further investigations will then be made with Issue Date rather than its Year, as delays became increasingly prevalent after year 2000.

Comparison of Old and Normal Planes

The average age of U.S planes is 11 years, with about 25% of planes above 15 years old (Mayerowitz, 2011).

Hence, we will deem an aircraft to be old when in operation for 15+ years, where data will split into two, with planes issued before 1993 termed old, and those issued 1993 onwards termed normal.

Distribution of Individual Delays

Average Delay by Issue Date per Delay Type

The full data shows Carrier Delay being prominent with planes issued 1990-1992 and 2000-2004.

Old Planes issued around 1990 are more likely to have delays due CarrierDelay, which includes maintainence etc.

Normal Planes are more likely to have CarrierDelays.

Distance Distribution of Old & Normal Planes

Average Delay by Distance

Distance is then used to check the delays that Old and Normal Planes will have with respect to the number of miles that they have travelled.

As observed, Old planes issued before 1993 had a greater average distance travelled, ranging from 650 to 900 miles. As a result, we can assume that older planes are typically employed for long-haul trips.

Older planes are more likely than normal planes to have larger Average Delays as the Distance grows, with Average Arrival Delays surpassing 100 minutes or more. This further confirms that older planes tend to suffer from more delays when on a long-haul flight. Further investigations are done with Distance to identify the different delay factors that may impact old and normal planes.

When comparing Distance, older planes have more occurrences of having a higher average carrier delay of above 50 minutes for flights around 1500 miles and above 2000 miles.

Normal planes have lower average values of Carrier Delays of less than 80 minutes.

Hence, it is evident that older planes would suffer more delays by having a larger value of Carrier Delay minutes while flying routes longer than 1500 miles.

Lastly, the plane’s age, which ranged from 0 to 31, was correlated against other variables to identify possible related factors. There is a positive linear relationship with most of the variables, but they are all small values that are not explanatory enough to investigate.

Overall, the grouping by Issue Year was clear in showing that older planes do suffer from more delay, but only by up to 8 minutes on average, where LateAircraftDelay, followed by Carrier/NASDelay are the primary causes.

When utilized for long-haul flights, older planes are also more likely to have delays due to carrier delays that might have resulted from aircraft maintenance or inspection.

The difference in delay however is quite negligible and it might be due to airlines “padding” and scheduling extra time for flights to prevent flights from being classified as delayed (Kramer, 2019).

Q3. Show how the no. of people flying between various locations change with time.

We will find the most popular routes to gauge the number of people flying between these different locations.

Distribution by Flight Routes

To begin, the Origin and Destination are combined into a new ‘FlightRoute’ column (e.g., OGG to HNL).

The top five most popular routes are:

  1. OGG to HNL
  2. HNL to OGG
  3. LAX to LAS
  4. SAN to LAX
  5. LAX to SAN

Out of the 5196 distinct routes, the Top 5 routes with the highest count are identified to examine if the number of flights has changed over the course of the year.

For the first five months, the flight routes [OGG to HNL] and [HNL to OGG] had roughly 50 fewer flights than the other three routes.

However, from June to August, both routes begin to increase in their number of flights, reaching over 100 more flights than the other routes.

Between September and December, these routes continue to have more flights than the others.

Distribution by State

To make the most of the data, the States were queried to justify the number of people flying interstate and intrastate.

Top 3 popular states are:

  1. California
  2. Texas
  3. Illinois

For both intrastate and interstate, February and September have the fewest flights with around 100 less than other months. As expected, the number of flights increases in the middle of the year, between June and August.

With two distinct methodologies, it is evident that February has the fewest flights, followed by September. The more popular travelling months are during June to August, which is likely due to the summer holidays in the USA which last 11 weeks from June to August (School Holidays USA, 2022).

Q4. Can cascading failures where an airport's delay causing delay in another be detected?

Cascading failures occur when a flight delay for one plane in an airport causes a flight delay in another.

This is explained by the existing variable 'LateAircraftDelay' which describes how a particular flight delayed in its Origin arrives late in its Destination, then affecting the next flight's departure since the same plane was used. The ripple impact of a previous delay at downstream airports hence causes cascading failures.

Since Tail Numbers are identification numbers on aircraft, it will be easier to focus on data with ‘LateAircraftDelay’, then focus on a selected aircraft and observe its flight schedule.

We will approach the question as follow:

'0' and '000000' are likely private confidential TailNums, so we ignore those data and focus on the top 2 highest counts of TailNum:

Methodology

We will approach the dataset in the following way to effectively illustrate cascading failures, that is, delays in one airport will cause delays in another:

  1. Find Highest ‘TailNum’ count (Higher chance of continuous flights in a day)
  2. Extract data fitting condition of Highest ‘TailNum’ & 'LateAircraftDelay' > 15 minutes
  3. Find Highest ‘Date’ count from extracted data (Higher chance of continuous flights in a day)
  4. Extract overall data fitting conditions of Highest ‘TailNum’ & Highest ‘Date’ count
  5. Sort data by ‘DepTime’ (To see the flight schedule of same aircraft by Departure Timing)

This approach will allow us to assess if a delayed flight in one airport may cause a delay in another.
To ensure that the data extraction and analysis approaches are valid, two separate tests will be conducted.

Initial Test

The initial test used N308SW to identify the highest date counts.

We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.

There are 2 different carriers (WN and OO) using the plane with same TailNum at different locations, however we will focus on WN since LateAircraftDelay was detected.

It is clear that one flight's delay in an airport can cause cascading failures in another.
We can observe from FlightNum 2662 that there was a 14 minutes ArrDelay in the Destination Airport (OAK). This led to the following FlightNum 1717 that was supposed to also depart from OAK to ONT to be delayed for 22 minutes (14 being LateAircraftDelay).

Then the delay for this flight also caused the next FlightNum 1067 from ONT to SMF to be delayed for 33 minutes. The delay persisted till the end of day near midnight.

Taking Flight 1417 as an example: has to depart at 1315 (1:15pm) but delayed for 65 mins till 1420 (2:20pm), flying from SMF to ONT. Supposed to reach 1430 (2:30pm) but reached 1535 (3:25pm).

This one-hour delay thereafter caused the next flight from ONT to SJC (FlightNum 936) to be delayed for 75 minutes (60 mins for LateAircraftDelay) also since its scheduled departure time is at 1455 (2:55pm), but it only took off at 1610 (4:10pm). It also arrived 69 minutes later than expected time of 1610 at 1719.

Since the CRSDepTime overlapped with its previous flight’s ArrTime, the delay persisted and escalated till near midnight of the day, resulting in cascading failures.

Secondary Testing

The secondary test used N478HA to identify the highest date counts.

We will focus on the date with the highest count to see if there is any relation that leads to cascading failure.

Taking FlightNum 178 as an example: has to depart at 1215 (12:pm) but delayed for 71 mins till 1326 (1:26pm), flying from HNL to KOA.

Supposed to reach 1258 (12:58pm) but reached 80 minutes later 1418 (2:18pm).

This 80 mins delay thereafter caused the next flight from KOA to OGG (FlightNum179) to be delayed for 82 minutes also since its scheduled departure time is at 1328 (1:28pm), but it only took off at 1450 (2:50pm). It also arrived 87 minutes later than expected time of 1357 at 1524.

Hence, a 38-min ArrDelay from FlightNum 315 triggered a snowball effect, causing all subsequent flights until FlightNum 307 to have ArrDelay ranging from 38 minutes to 89 minutes, with the majority attributable to LateAircraftDelay.

It is also worth noting the instances where subsequent flights might not use the same TailNum, but they are assumed due to a lack of schedule information. A total of two tests were conducted in Python, to demonstrate the same effect of cascading failures where delays at one airport causes delays in another. The previous flight's late arrival, which used the same plane that would be departing, caused the subsequent flight’s delay. As a result, the current flight will depart late, setting off a chain reaction causing passengers at other airports to board the plane much later as well.

Q5. Construct a model predicting delays.

With the flight data labelled, Supervised Learning algorithms such as Multiple Linear Regression and Random Forest are used to construct Regression and Classification prediction models in Python.

This works by allowing the model to predict the label of new data points based on past data.

To predict delays, these few supervised learning models with selected variables are used:

First, new X and Y features will be created, with Y as the response variable to be predicted and X as the independent predictor variable comprising of remaining relevant factors. X will contain 11 columns of predictor variables (Time-related & Factor columns e.g., ArrTime, DepTime, TaxiIn and TaxiOut etc.). For Y, the Regression model will comprise of ArrDelay (minutes).

Then, a train-test split is done to prevent overfitting, with Trainset accounting for 70%, and Testset for 30%. The model will be trained using the Trainset data, and its performance will be evaluated by predicting with the unseen Testset. For reproduction, random_state is set to a random number 42.

Multiple Linear Regression

MLR is a regression model which enables us to understand and estimate relationships between multiple variables.

With Y: ArrDelay, the Linear Regression model is loaded first, then fitted to the training data.
Next, the model is tested against unseen X_test data to get predictions.

The R2 score is 0.957, where predictor variables in the model explained 95.7% of the variation in Y.
Root Mean Square Error (RMSE) representing the standard deviation of prediction errors is considerably fitting with 7.88.

Then, the model is predicted on the feature X to get the full prediction set.

Random Forest

Used for Classification and Regression, Random Forest is a Supervised Learning algorithm that constructs many decision trees.
For this, the Regression method will be used, with Y being continuous (ArrDelay), giving an output of the mean prediction of decision trees.

The Random Forest Regression model is first loaded then fitted to the Trainset. Next, the model is tested against unseen X_test data for predictions.

The R2 score is 0.962, where predictor variables in the model explains 96.2% of the variation in Y.
RMSE is considerably fitting with 7.38.

Then, the model is predicted on the feature X to get the full prediction set, resulting in the above predicted output.

Logistic Regression

Logistic Regression is a Classification model that predicts a binary outcome (DelayStatus).

With an R2 value of 0.98, predictor variables can explain 98% of the variation in Y (DelayStatus).

In total, two regression and one classification methods were tested out in Python. Overall, the best model to predict the continuous ArrDelay would be Random Forest with 96.2% accuracy, followed by MLR with 95.7% accuracy. To predict the binary outcome DelayStatus, Logistic Regression model is around 98% accurate.