Comfortable reading and full navigation (recommended)

Web report (HTML):
https://crgalleguillos.github.io/flight-price-prediction/

Note: GitHub may not reliably support internal navigation inside .ipynb files.
The web report is intended as the primary reading version.

Flight Price Prediction (Economy & Business)¶

Author: Cristián Andrés Galleguillos Vega
Objective: Build and evaluate machine learning models to predict flight prices in Economy and Business segments.

Notebook map¶

  1. Context and objective
  2. Data and preparation
  3. EDA (univariate and bivariate)
  4. Feature engineering
  5. Modeling and validation
  6. Results
  7. Conclusions and next steps
  8. Credits and technologies

1. Context and objective¶

Flight price estimation is a key task for decision-making in pricing strategies and commercial planning.
In this project, we address flight price prediction for two relevant market segments: Economy and Business.

Main objective:
Train and compare machine learning models capable of predicting the final flight price, quantifying prediction error and evaluating performance stability.

Success criteria (KPIs):

  • Low MAE (higher accuracy)
  • RMSE to measure error magnitude
  • High R² (greater explained variance)
  • Stability between cross-validation and test results (generalization)

Business value:
This analysis helps identify which variables drive flight prices and supports dynamic pricing strategies and market monitoring.

2. Data and preparation¶

Data sources:
Files business.xlsx and economy.xlsx provided as part of the challenge.

Structure:
Each file contains detailed flight information (origin, destination, airline, duration, stops, etc.) and the target variable price.

Steps performed:

  1. Review of missing values and duplicates
  2. Detection and correction of outliers using IQR capping
  3. Conversion of duration into total minutes
  4. Encoding of categorical variables (One-Hot Encoding)
  5. Train/test split using random_state=123

3. Exploratory Data Analysis (EDA)¶

In this stage, we perform an exploratory data analysis (EDA) to understand the structure, distributions, and relationships between variables.

EDA allows us to detect outliers, correlations, and differences between the Economy and Business segments before applying predictive models.

EDA objectives:

  1. Assess data quality (missing values, duplicates, outliers)
  2. Explore numerical variable distributions
  3. Analyze relationships between relevant variables and price
  4. Identify segment-specific patterns

3.1 Initial dataset review¶

In [35]:
import os
print(os.listdir('../data'))
['.gitkeep', 'business.xlsx', 'economy.xlsx', 'get_data.py']
In [36]:
# MAIN LIBRARIES

# Data analysis and manipulation
import pandas as pd
import numpy as np
import re

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling and evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# General configuration
import warnings
warnings.filterwarnings('ignore')
In [37]:
# Load data

econ = pd.read_excel('../data/economy.xlsx')
bus  = pd.read_excel('../data/business.xlsx')

# Display basic structure
print("Economy:", econ.shape, "rows x columns")
print("Business:", bus.shape, "rows x columns")

# First rows
display(econ.head(5))
display(bus.head(5))

# Data types
print("\nData types – Economy")
display(econ.dtypes)
Economy: (206774, 11) rows x columns
Business: (93487, 11) rows x columns
date airline ch_code num_code dep_time from time_taken stop arr_time to price
0 2022-02-11 SpiceJet SG 8709 18:55:00 Delhi 02h 10m non-stop 21:05:00 Mumbai 5953
1 2022-02-11 SpiceJet SG 8157 06:20:00 Delhi 02h 20m non-stop 08:40:00 Mumbai 5953
2 2022-02-11 AirAsia I5 764 04:25:00 Delhi 02h 10m non-stop 06:35:00 Mumbai 5956
3 2022-02-11 Vistara UK 995 10:20:00 Delhi 02h 15m non-stop 12:35:00 Mumbai 5955
4 2022-02-11 Vistara UK 963 08:50:00 Delhi 02h 20m non-stop 11:10:00 Mumbai 5955
date airline ch_code num_code dep_time from time_taken stop arr_time to price
0 2022-02-11 Air India AI 868 18:00:00 Delhi 02h 00m non-stop 20:00:00 Mumbai 25612
1 2022-02-11 Air India AI 624 19:00:00 Delhi 02h 15m non-stop 21:15:00 Mumbai 25612
2 2022-02-11 Air India AI 531 20:00:00 Delhi 24h 45m 1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t... 20:45:00 Mumbai 42220
3 2022-02-11 Air India AI 839 21:25:00 Delhi 26h 30m 1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t... 23:55:00 Mumbai 44450
4 2022-02-11 Air India AI 544 17:15:00 Delhi 06h 40m 1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t... 23:55:00 Mumbai 46690
Data types – Economy
date          datetime64[ns]
airline               object
ch_code               object
num_code               int64
dep_time              object
from                  object
time_taken            object
stop                  object
arr_time              object
to                    object
price                  int64
dtype: object

Initial observations¶

  • Both datasets (Economy and Business) share the same structure:
    date, airline, ch_code, num_code, dep_time, from, time_taken, stop, arr_time, to, price
  • Columns such as date, dep_time, arr_time, and time_taken are stored as text or mixed formats and must be converted into datetime or numeric values.
  • Variables from and to represent origin and destination cities; they will be renamed to origin and destination to avoid conflicts with Python reserved words.
  • price is the target variable, stored as an integer (int64), confirming this is a supervised regression problem.
  • num_code is a flight identifier and likely provides no predictive value.

Data quality: missing values and duplicates¶

In [38]:
# Percentage of missing values
nulls_econ = econ.isna().mean() * 100
nulls_bus = bus.isna().mean() * 100

print("Missing values in Economy:")
display(nulls_econ[nulls_econ > 0].sort_values(ascending=False))

print("\nMissing values in Business:")
display(nulls_bus[nulls_bus > 0].sort_values(ascending=False))

# Duplicates
dup_econ = econ.duplicated().sum()
dup_bus = bus.duplicated().sum()

print(f"\nDuplicates – Economy: {dup_econ} | Business: {dup_bus}")
Missing values in Economy:
Series([], dtype: float64)
Missing values in Business:
Series([], dtype: float64)
Duplicates – Economy: 2 | Business: 0
In [39]:
# Remove duplicates from the Economy dataset
econ = econ.drop_duplicates().reset_index(drop=True)
print(f"Economy after removing duplicates: {econ.shape}")
Economy after removing duplicates: (206772, 11)

Note¶

Results:

  • No missing values were found in either Economy or Business datasets.
  • 2 duplicate records were detected in the Economy dataset and 0 duplicates in Business.
  • The final Economy dataset was reduced from 206,774 to 206,772 rows, while Business remained at 93,487 rows.

Action taken:

  • Duplicate records in Economy were removed to preserve data integrity.

Conclusion:

  • The datasets show excellent initial quality, with no missing values or major inconsistencies.
  • This allows us to proceed directly to exploratory analysis and feature transformation without the need for imputation.

3.2 Distribution of numerical variables¶

We analyze the distribution of key numerical variables:

  • price, to assess dispersion and skewness across segments
  • duration_mins, to understand the distribution of flight durations in minutes

This helps identify biases, long tails, and extreme values prior to outlier treatment.

Data preparation and cleaning¶

In [40]:
# 1) Ensure `price` is numeric (handles comma formats like "1,14,434", spaces, or other symbols)
def clean_price(series: pd.Series) -> pd.Series:
    s = series.astype(str).str.strip()
    # Remove all non-numeric characters (supports "Indian-style" comma formats)
    s = s.str.replace(r'[^0-9]', '', regex=True)
    # Empty strings -> NA; use Int64 to tolerate NA during cleaning
    s = s.replace('', pd.NA).astype('Int64')
    return s

econ['price'] = clean_price(econ['price'])
bus['price']  = clean_price(bus['price'])

# Remove rows with NA price
before_e, before_b = len(econ), len(bus)
econ = econ.dropna(subset=['price']).reset_index(drop=True)
bus  = bus.dropna(subset=['price']).reset_index(drop=True)
print(f"[Economy] rows: {before_e} -> {len(econ)} (after normalizing 'price')")
print(f"[Business] rows: {before_b} -> {len(bus)} (after normalizing 'price')")

# Final cast to int (no NA values remaining)
econ['price'] = econ['price'].astype(int)
bus['price']  = bus['price'].astype(int)

# 2) Convert `time_taken` -> total minutes in `duration_mins`
def to_minutes(duration_str):
    if isinstance(duration_str, str):
        hours = re.findall(r'(\d+)h', duration_str)
        mins  = re.findall(r'(\d+)m', duration_str)
        h = int(hours[0]) if hours else 0
        m = int(mins[0])  if mins  else 0
        return h * 60 + m
    return pd.NA

econ['duration_mins'] = econ['time_taken'].apply(to_minutes).astype('Int64')
bus['duration_mins']  = bus['time_taken'].apply(to_minutes).astype('Int64')

print("Types (price, duration_mins) →",
      econ['price'].dtype, econ['duration_mins'].dtype,
      "|", bus['price'].dtype, bus['duration_mins'].dtype)
[Economy] rows: 206772 -> 206772 (after normalizing 'price')
[Business] rows: 93487 -> 93487 (after normalizing 'price')
Types (price, duration_mins) → int32 Int64 | int32 Int64

Price histograms by segment¶

In [41]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(econ['price'].dropna(), bins=50, kde=True, ax=axes[0])
axes[0].set_title('Price distribution – Economy')
axes[0].set_xlabel('Price')

sns.histplot(bus['price'].dropna(), bins=50, kde=True, ax=axes[1])
axes[1].set_title('Price distribution – Business')
axes[1].set_xlabel('Price')

plt.tight_layout()
plt.show()
No description has been provided for this image

Duration_mins histograms by segment¶

In [42]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(econ['duration_mins'].dropna(), bins=50, kde=True, ax=axes[0])
axes[0].set_title('Flight duration (min) – Economy')
axes[0].set_xlabel('Duration (min)')

sns.histplot(bus['duration_mins'].dropna(), bins=50, kde=True, ax=axes[1])
axes[1].set_title('Flight duration (min) – Business')
axes[1].set_xlabel('Duration (min)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Boxplots for visual outlier detection¶

In [43]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(y=econ['price'].dropna(), ax=axes[0])
axes[0].set_title('Outliers in price – Economy')

sns.boxplot(y=bus['price'].dropna(), ax=axes[1])
axes[1].set_title('Outliers in price – Business')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observations from numerical analysis¶

Price distribution

  • In the Economy segment, prices are mainly concentrated between 3,000 and 10,000 units, with a long right tail indicating the presence of significantly more expensive flights.

  • Several local peaks are observed, likely associated with specific airlines or route types.

  • In the Business segment, the distribution is wider and centered around 50,000–60,000 units.

  • It also exhibits positive skewness with much greater dispersion, reflecting heterogeneous pricing based on service level or flight class.

  • Both histograms confirm that price does not follow a normal distribution, making scaling or outlier capping necessary before modeling.

Flight duration (duration_mins)

  • In both segments, most flights fall between 100 and 2,000 minutes (approximately 1.5 to 33 hours).
  • Long right tails indicate routes with multiple stops or anomalous records, especially in Economy.

Boxplots

  • Economy prices cluster below 10,000 units, with many high outliers above 30,000.
  • Business prices show a median around 55,000 units, with extreme values reaching up to 120,000.

Conclusion Both price and duration_mins show skewed distributions and the presence of outliers.
This is common in flight data and does not indicate data errors, but it does require controlled statistical treatment.

In the next step, IQR capping (winsorization) will be applied to mitigate the impact of extreme values without removing observations.

3.3 Outlier treatment (IQR capping)¶

Objective¶

The outliers detected in price and duration_mins may negatively affect the performance of Machine Learning models, especially in regression tasks.
We apply the Interquartile Range (IQR) method to limit extreme values within a statistically reasonable range, without removing observations.

In [44]:
# IQR-based winsorization function

def iqr_capping(df, column, multiplier=1.5):
    """
    Applies IQR-based winsorization.
    Caps values below (Q1 - 1.5*IQR)
    and above (Q3 + 1.5*IQR).
    Returns the modified DataFrame and the limits.
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    capped = df[column].clip(lower, upper)
    print(f"{column}: IQR range [{lower:.2f}, {upper:.2f}]")
    print(f"Capped values: {(df[column] != capped).sum()}")
    df[column] = capped
    return df, lower, upper

# Apply to both variables in both segments
econ, l1, u1 = iqr_capping(econ, 'price')
bus,  l2, u2 = iqr_capping(bus,  'price')
econ, l3, u3 = iqr_capping(econ, 'duration_mins')
bus,  l4, u4 = iqr_capping(bus,  'duration_mins')
price: IQR range [-1186.50, 13105.50]
Capped values: 13750
price: IQR range [22368.50, 83212.50]
Capped values: 3650
duration_mins: IQR range [-435.00, 1725.00]
Capped values: 2652
duration_mins: IQR range [-280.00, 1880.00]
Capped values: 90

Outlier treatment (IQR capping)¶

Results¶

  • Economy – price: IQR range applied [-1,186.5, 13,105.5], therefore 13,750 values capped (≈ 6.6%).
  • Business – price: IQR range applied [22,368.5, 83,212.5], therefore 3,650 values capped (≈ 3.9%).
  • Economy – duration_mins: IQR range applied [-435, 1,725] min, therefore 2,652 records adjusted.
  • Business – duration_mins: IQR range applied [-280, 1,880] min, therefore 90 records adjusted.

Note: negative lower bounds are corrected using clip() and truncated to the actual minimum (> 0).

Visualization before and after¶

In [45]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

sns.boxplot(y=econ['price'], ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Economy – price (after IQR capping)')

sns.boxplot(y=bus['price'], ax=axes[0,1], color='salmon')
axes[0,1].set_title('Business – price (after IQR capping)')

sns.boxplot(y=econ['duration_mins'], ax=axes[1,0], color='steelblue')
axes[1,0].set_title('Economy – duration_mins (after IQR capping)')

sns.boxplot(y=bus['duration_mins'], ax=axes[1,1], color='coral')
axes[1,1].set_title('Business – duration_mins (after IQR capping)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Boxplot interpretation¶

  • After capping, the distributions become more compact and symmetric, with no visible outliers.
  • Economy (price) preserves a median around ~6,000, indicating that adjustments mainly affected marginal values.
  • Business (price) maintains a median around ~55,000, with controlled dispersion after removing extreme values (> 100,000).
  • duration_mins is mainly stabilized between ~300 and 1,500 minutes, consistent with domestic and international routes.

Conclusion¶

  • IQR capping reduced the impact of extreme values without removing observations, preserving dataset integrity.
  • Variables are now more robust for modeling (lower sensitivity to extremes in regression/boosting).
  • price and duration_mins are ready for bivariate analysis and the modeling stage.

3.4 Bivariate analysis¶

Objective¶

Explore the relationships between price (target variable) and other dataset variables to identify patterns and factors influencing flight prices.

The following are analyzed:

  • Numerical relationships (duration_mins vs price) using correlation and scatter plots.
  • Categorical relationships (stop, airline, from, to) using comparative boxplots.

Correlation between numerical variables¶

In [46]:
# Calculate correlations for numerical variables
corr_econ = econ[['price', 'duration_mins']].corr()
corr_bus  = bus[['price', 'duration_mins']].corr()

print("Correlation – Economy:")
display(corr_econ)

print("Correlation – Business:")
display(corr_bus)
Correlation – Economy:
price duration_mins
price 1.000000 0.313434
duration_mins 0.313434 1.000000
Correlation – Business:
price duration_mins
price 1.000000 0.243108
duration_mins 0.243108 1.000000

Correlation between numerical variables¶

Segment Correlation price – duration_mins
Economy 0.31
Business 0.24

Observations¶

  • In both segments, a positive correlation is observed: longer flights tend to be more expensive.
  • The relationship strength is moderate in Economy (0.31) and weaker in Business (0.24).
  • This suggests that in Business, price also depends on other factors (service level, airline, class, schedules), while in Economy, flight duration carries slightly more weight.

Visualization: price vs duration_mins¶

In [18]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.scatterplot(data=econ, x='duration_mins', y='price', alpha=0.3, s=20, ax=axes[0], color='steelblue')
axes[0].set_title('Economy – price vs duration_mins')
axes[0].set_xlabel('Duration (min)')
axes[0].set_ylabel('Price')

sns.scatterplot(data=bus, x='duration_mins', y='price', alpha=0.3, s=20, ax=axes[1], color='salmon')
axes[1].set_title('Business – price vs duration_mins')
axes[1].set_xlabel('Duration (min)')
axes[1].set_ylabel('Price')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Relationship between price and duration (duration_mins)¶

[Scatter plot – Price vs Duration]

Observations¶

  • In Economy, a slightly upward trend is observed: flights longer than 1,000 minutes tend to concentrate higher prices, although dispersion increases.
  • In Business, the point cloud is more scattered, confirming that price does not depend solely on duration, but also on airline and flight type.
  • Both plots show a non-linear relationship, suggesting that logarithmic or polynomial transformations could be useful in predictive models.

Cleaning the stop variable¶

In [19]:
def normalize_stop(s):
    if pd.isna(s):
        return pd.NA
    t = str(s).lower()
    t = re.sub(r'\s+', ' ', t).strip()
    t = t.replace('non stop', 'non-stop').replace('non-stop', '0-stop')
    m = re.search(r'(\d+)\s*[- ]*\s*stop', t)
    if '0-stop' in t:
        n = 0
    elif m:
        n = int(m.group(1))
    else:
        mplus = re.search(r'(\d+)\s*\+\s*stop', t)
        if mplus: 
            n = int(mplus.group(1)) + 1
        else:
            return pd.NA
    if n == 0:
        return 'non-stop'
    elif n == 1:
        return '1-stop'
    else:
        return '2+-stops'

econ['stop_clean'] = econ['stop'].apply(normalize_stop)
bus['stop_clean']  = bus['stop'].apply(normalize_stop)

print(econ['stop_clean'].value_counts())
print(bus['stop_clean'].value_counts())
stop_clean
1-stop      166625
non-stop     27942
Name: count, dtype: int64
stop_clean
1-stop      84302
non-stop     8102
Name: count, dtype: int64

Price by number of stops (clean boxplots)¶

In [20]:
order = ['non-stop', '1-stop', '2+-stops']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(
    x='stop_clean', y='price',
    data=econ.dropna(subset=['stop_clean']),
    order=order, ax=axes[0]
)
axes[0].set_title('Economy – price by number of stops')
axes[0].set_xlabel('Stops')
axes[0].set_ylabel('Price')

sns.boxplot(
    x='stop_clean', y='price',
    data=bus.dropna(subset=['stop_clean']),
    order=order, ax=axes[1]
)

axes[1].set_title('Business – price by number of stops')
axes[1].set_xlabel('Stops')
axes[1].set_ylabel('Price')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Price by number of stops (stop_clean)¶

| Segment | Flight distribution | |----------|---------------------| | Economy | 1-stop: 166,625 | non-stop: 27,942 | | Business | 1-stop: 84,302 | non-stop: 8,102 |

[Boxplot – Price vs Stops]

Observations¶

  • In both segments, non-stop flights show higher median prices and lower dispersion.
  • In Economy, the difference between non-stop and 1-stop flights is very pronounced, indicating that direct flights are usually more expensive.
  • In Business, the difference exists but with greater internal variability, likely due to different service levels and schedules.
  • The 2+-stops category does not appear visually in the boxplots, indicating that such flights are extremely rare or nonexistent in the sample.

Average price by airline¶

In [21]:
# Top 10 airlines by average price (segment-wise)
mean_price_econ = (
    econ.groupby('airline', dropna=False)['price']
        .mean()
        .sort_values(ascending=False)
        .head(10)
)
mean_price_bus = (
    bus.groupby('airline', dropna=False)['price']
       .mean()
       .sort_values(ascending=False)
       .head(10)
)

# Comprobación 
display(mean_price_econ.to_frame('Average price – Economy'))
display(mean_price_bus.to_frame('Average price – Business'))
Average price – Economy
airline
Vistara 7484.399829
Air India 7001.725559
SpiceJet 6046.644601
GO FIRST 5587.036027
Indigo 5206.398098
StarAir 4860.909836
AirAsia 4007.103305
Trujet 3244.634146
Average price – Business
airline
Vistara 55373.605324
Air India 47195.887622
In [22]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.barplot(
    x=mean_price_econ.values, 
    y=mean_price_econ.index, 
    hue=mean_price_econ.index, 
    palette='Blues_r', 
    legend=False, 
    ax=axes[0]
)
axes[0].set_title('Economy – Top 10 airlines by average price')
axes[0].set_xlabel('Average price')
axes[0].set_ylabel('Airline')

sns.barplot(
    x=mean_price_bus.values, 
    y=mean_price_bus.index, 
    hue=mean_price_bus.index, 
    palette='Reds_r', 
    legend=False, 
    ax=axes[1]
)
axes[1].set_title('Business – Top 10 airlines by average price')
axes[1].set_xlabel('Average price')
axes[1].set_ylabel('Airline')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Average price by airline¶

[Barplot – Average price by airline]

Observations¶

  • In Economy, airlines such as Vistara, Air India, and SpiceJet stand out with higher average prices.
  • In Business, the same airlines lead the ranking, but with much larger differences in average price.
  • Low-cost airlines (e.g., Trujet, AirAsia) show significantly lower prices, reinforcing their market positioning.
  • This result suggests that the airline variable could be a relevant predictor in machine learning models.

Conclusions¶

  • Coherent relationships between variables are confirmed:
    • Flight duration (duration_mins): positive correlation with price.
    • Number of stops (stop_clean): direct flights are more expensive.
    • Airline (airline): significantly influences average price.
  • Differences between segments (Economy vs Business) reflect distinct market behaviors:
    • Economy shows more structured and predictable relationships.
    • Business exhibits greater dispersion, suggesting additional non-linear effects.
  • These findings support the validity and quality of the dataset after cleaning and provide a solid foundation for the predictive modeling stage.

Overall EDA conclusion¶

The exploratory analysis made it possible to understand the structure, consistency, and most relevant relationships within flight data for the Economy and Business segments.

Key findings:

  • The datasets showed excellent quality, with no missing values and only 2 duplicate records in Economy, which were removed.
  • Numerical variables price and duration_mins exhibited positive skewness, indicating high prices and long durations in a small subset of flights.
  • Outlier treatment using IQR capping improved statistical stability by reducing extreme dispersion.
  • The correlation between price and duration_mins was positive but moderate (≈0.31 in Economy, ≈0.24 in Business), suggesting that longer flights tend to be more expensive, with variability by airline and number of stops.
  • In bivariate analysis, non-stop flights were consistently more expensive than 1-stop flights, with lower price dispersion.
  • Traditional airlines (Air India, Vistara) dominated the upper range of average prices, especially in the Business segment.
  • The 2+-stops category showed very low frequency, resulting in no meaningful graphical representation.

Conclusion:
The EDA confirms that the data are clean, well-structured, and exhibit interpretable relationships among key variables.
Clear differences exist between segments and airlines, as well as pricing patterns associated with flight duration and number of stops.

4. Feature engineering¶

Objective¶

Optimize the representation of variables, generate relevant new features, and prepare the datasets for predictive modeling.
The goal is to improve the models’ ability to capture patterns in the data without introducing redundancy or multicollinearity.

4.1 Data preprocessing¶

Objective¶

Standardize and prepare input data for predictive modeling, ensuring column consistency, clean text fields, and the creation of time-based variables that add value to the model.

The purpose is to ensure that variables are ready to be interpreted by Machine Learning algorithms by removing sources of noise or redundancy and improving the quality of feature representations.

Cleaning and transforming categorical variables¶

Main tasks:

  1. Column renaming:
    Reserved or ambiguous names are replaced (from → origin, to → destination) to avoid conflicts with keywords and improve dataset readability.

  2. Removing irrelevant columns:
    Identifier fields (ch_code, num_code) are removed because they add no predictive value and may introduce noise.

  3. Category normalization:
    The stop column is cleaned from residual characters (line breaks, tabs, extra spaces) and standardized into a new variable stop_clean.
    This enables correct differentiation between direct flights (non-stop), one-stop flights (1-stop), and multi-stop flights (2+-stops).

  4. Time feature extraction:
    From date, three attributes are derived:

    • month: captures seasonality.
    • day: reflects potential day-of-month patterns.
    • weekday: helps identify different behaviors between weekdays and weekends.
  5. Deep text cleaning:
    Special characters and unnecessary spaces are removed across all text columns, standardizing categorical values for correct downstream encoding.

  6. Removing raw time columns:
    Original time columns (date, dep_time, arr_time, time_taken) are removed since their relevant components have already been extracted and transformed into modeling-ready features.

In [23]:
# Clean copies
econ_clean = econ.copy()
bus_clean  = bus.copy()

# Rename columns
econ_clean = econ_clean.rename(columns={'from': 'origin', 'to': 'destination'})
bus_clean  = bus_clean.rename(columns={'from': 'origin', 'to': 'destination'})

# Drop irrelevant identifier columns
econ_clean = econ_clean.drop(columns=['ch_code', 'num_code'])
bus_clean  = bus_clean.drop(columns=['ch_code', 'num_code'])

# Normalize stops (preserve all possible categories)
econ_clean['stop_clean'] = econ_clean['stop'].astype(str).str.replace(r'[\n\t]+', ' ', regex=True).str.strip()
bus_clean['stop_clean']  = bus_clean['stop'].astype(str).str.replace(r'[\n\t]+', ' ', regex=True).str.strip()

# Extract month, day, and weekday from the date
econ_clean['month'] = pd.to_datetime(econ_clean['date'], errors='coerce').dt.month
econ_clean['day'] = pd.to_datetime(econ_clean['date'], errors='coerce').dt.day
econ_clean['weekday'] = pd.to_datetime(econ_clean['date'], errors='coerce').dt.weekday

bus_clean['month'] = pd.to_datetime(bus_clean['date'], errors='coerce').dt.month
bus_clean['day'] = pd.to_datetime(bus_clean['date'], errors='coerce').dt.day
bus_clean['weekday'] = pd.to_datetime(bus_clean['date'], errors='coerce').dt.weekday

# Remove raw date/time columns (no longer needed in X)
cols_time = ['date', 'dep_time', 'arr_time', 'time_taken']
econ_clean = econ_clean.drop(columns=[c for c in cols_time if c in econ_clean.columns])
bus_clean  = bus_clean.drop(columns=[c for c in cols_time if c in bus_clean.columns])

# Deep string cleaning: remove line breaks, tabs, and extra spaces
def limpiar_texto(col):
    if col.dtype == 'object':
        return col.astype(str).str.replace(r'[\n\t]+', ' ', regex=True).str.strip()
    return col

econ_clean = econ_clean.apply(limpiar_texto)
bus_clean  = bus_clean.apply(limpiar_texto)

4.2 Encoding categorical variables¶

Objective¶

Transform categorical variables into numerical representations without losing information or interpretability, allowing algorithms to process them correctly.

Applied strategy¶

One-Hot Encoding (OHE) is applied to the main categorical variables:

  • airline → Represents the operating airline.
  • origin and destination → Capture route and airport effects on prices.
  • stop_clean → Indicates the number of stops (a key factor in the final fare).

OHE creates new binary columns for each category (e.g., airline_Indigo, destination_Mumbai, etc.), assigning values 1 or 0 accordingly.
This way, numerical models can correctly interpret these variables without imposing an artificial order, which would occur with Label Encoding.

Numerical variables (duration_mins, month, day, weekday) are kept unchanged at this stage, as they will be scaled later.

In [24]:
cat_vars_e = [c for c in ['airline','origin','destination','stop_clean'] if c in econ_clean.columns]
cat_vars_b = [c for c in ['airline','origin','destination','stop_clean'] if c in bus_clean.columns]

num_vars_e = [c for c in ['duration_mins','month','day','weekday'] if c in econ_clean.columns]
num_vars_b = [c for c in ['duration_mins','month','day','weekday'] if c in bus_clean.columns]

# OHE without touching numeric variables
X_econ = pd.concat([econ_clean[num_vars_e], pd.get_dummies(econ_clean[cat_vars_e], drop_first=True)], axis=1)
X_bus  = pd.concat([bus_clean[num_vars_b],  pd.get_dummies(bus_clean[cat_vars_b],  drop_first=True)], axis=1)

y_econ = econ_clean['price'].astype(int)
y_bus  = bus_clean['price'].astype(int)

print("Economy – X:", X_econ.shape, "| y:", y_econ.shape)
print("Business – X:", X_bus.shape,  "| y:", y_bus.shape)
Economy – X: (206772, 57) | y: (206772,)
Business – X: (93487, 39) | y: (93487,)

Note¶

The increase in the number of columns reflects the creation of binary variables from the original categories.
This is expected: the Economy dataset, with more airlines and routes, produces 57 columns after encoding, while Business, with fewer combinations, produces 39.

4.3 Scaling numerical variables¶

Objective¶

Normalize numerical variables so they operate within the same magnitude range and no single feature dominates the learning process.

Variables with different units (e.g., duration in minutes vs. month index) can negatively affect training for models based on distances or gradients, since larger values tend to exert more influence.

Applied method¶

We use MinMaxScaler from sklearn, which transforms each variable to the [0, 1] range, preserving the original distribution shape and facilitating algorithm convergence.

Unlike earlier versions, scaling is applied only to continuous numerical variables, avoiding any alteration to one-hot encoded categorical variables (dummies), which are already binary.

Scaled variables:

plaintext
['duration_mins', 'month', 'day', 'weekday']
In [25]:
scaler_e = MinMaxScaler()
scaler_b = MinMaxScaler()

X_econ_scaled = X_econ.copy()
X_bus_scaled  = X_bus.copy()

X_econ_scaled[num_vars_e] = scaler_e.fit_transform(X_econ_scaled[num_vars_e])
X_bus_scaled[num_vars_b]  = scaler_b.fit_transform(X_bus_scaled[num_vars_b])

print("Scaling applied only to:", num_vars_e)
Scaling applied only to: ['duration_mins', 'month', 'day', 'weekday']

Note¶

Only numerical variables have been rescaled, while dummy variables remain unchanged.

This ensures that the model:

  • Does not distort relationships between categories.
  • Learns magnitude-driven patterns in a balanced way.
  • Becomes more stable and efficient during parameter optimization.

In practical terms, this step improves training robustness and prevents biases caused by scale differences across heterogeneous variables.

4.4 Train/test split¶

Objective¶

Evaluate the model’s generalization capability by splitting the dataset into two subsets:

  • Training (80%): used to fit model parameters.
  • Test (20%): reserved to measure performance on unseen data.

This split helps detect overfitting and ensures the model can predict correctly in new scenarios.

Implementation¶

We use train_test_split from sklearn.model_selection, ensuring reproducibility with random_state=123.

In [26]:
X_train_e, X_test_e, y_train_e, y_test_e = train_test_split(
    X_econ_scaled, y_econ, test_size=0.2, random_state=123
)
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_bus_scaled, y_bus, test_size=0.2, random_state=123
)

print("Economy – Train/Test:", X_train_e.shape, X_test_e.shape)
print("Business – Train/Test:", X_train_b.shape, X_test_b.shape)
Economy – Train/Test: (165417, 57) (41355, 57)
Business – Train/Test: (74789, 39) (18698, 39)

Note¶

Each segment retains an appropriate ratio for supervised learning, with enough examples for models to capture general patterns and validate performance under realistic conditions.

In addition, the final feature dimensions (57 columns for Economy and 39 for Business) confirm that encoding and scaling were performed correctly, ensuring consistency across train and test matrices.

4.5 Removing residual columns (object / datetime)¶

Objective¶

Verify that no incompatible data types (object or datetime) remain in the training and test sets before modeling.

Many Machine Learning algorithms—such as Ridge, Random Forest, and Gradient Boosting—only accept numerical inputs.
Therefore, this step ensures the entire feature matrix is in the proper format for training.

Implementation¶

The code checks both segments (Economy and Business) and removes any residual columns with object type:

In [27]:
### 4.5 Removing residual columns (object / datetime)

# Remove any residual object-type columns (e.g., 'stop')
cols_obj = X_train_e.select_dtypes(include='object').columns
if len(cols_obj) > 0:
    print("Text columns removed (Economy):", list(cols_obj))
    X_train_e = X_train_e.drop(columns=cols_obj)
    X_test_e  = X_test_e.drop(columns=cols_obj)

cols_obj_b = X_train_b.select_dtypes(include='object').columns
if len(cols_obj_b) > 0:
    print("Text columns removed (Business):", list(cols_obj_b))
    X_train_b = X_train_b.drop(columns=cols_obj_b)
    X_test_b  = X_test_b.drop(columns=cols_obj_b)

Note¶

In this case, no residual columns were detected, confirming that preprocessing was complete and consistent.
This means all predictors (X_train and X_test) are numeric and ready to be used by the models in the next stage.

Overall conclusion of Feature Engineering¶

After the transformation and preparation process, the Economy and Business datasets are fully ready for supervised modeling.

Summary of steps performed:¶

  • Columns were renamed (from → origin, to → destination) to avoid conflicts with reserved words and improve dataset semantics.
  • Irrelevant variables (ch_code, num_code) were removed, and categorical text fields were cleaned (line breaks, tabs, extra spaces).
  • The stop_clean column was normalized, preserving all valid categories (non-stop, 1-stop, 2+-stops) without collapsing information.
  • From the date (date), new time-based features were extracted:
    • month (flight month)
    • day (day of month)
    • weekday (day of week)
      These features capture potential seasonal and temporal patterns.
  • One-Hot Encoding was applied to categorical variables (airline, origin, destination, stop_clean) to convert them into binary indicators.
  • Only numerical variables (duration_mins, month, day, weekday) were scaled using Min–Max Scaling, keeping dummy variables unchanged.
  • Data were split into training (80%) and test (20%) sets, ensuring reproducibility (random_state=123).
  • A final check confirmed that no residual object or datetime columns remained.

Results:¶

  • Economy: 206,772 records → 57 final features.
  • Business: 93,487 records → 39 final features.
  • Splits:
    • Economy: 165,417 train / 41,355 test
    • Business: 74,789 train / 18,698 test

Conclusion¶

Feature engineering produced datasets that are numerically consistent, enriched, and modeling-ready, combining clean transformations, useful time features, and efficient encoding.

These decisions increase predictive capacity and model stability, improve interpretability, and ensure a coherent data pipeline from extraction through the machine learning stage.

5. Modeling and Validation¶

Objective¶

Train, evaluate, and compare different supervised Machine Learning models with the goal of predicting flight prices in the Economy and Business segments.

The objective is to determine which algorithm achieves the best balance between accuracy, generalization capability, and computational efficiency, considering the tabular and heterogeneous nature of the dataset.

The results allow us to identify which model performs best for each market segment, with potential applications in real-world contexts such as:

  • dynamic pricing systems (pricing engines),
  • route and schedule optimization, or
  • profitability analysis tools by airline or destination.

5.1 Selected models¶

Three representative algorithms were selected, each reflecting a different supervised learning approach.
The goal is to evaluate both linear models and more complex ensembles, with different ways of handling non-linearity and overfitting.

Model Type Main characteristics
Ridge Regression Linear Introduces L2 regularization to control overfitting. Used as a baseline model due to its simplicity and interpretability.
Random Forest Regressor Ensemble (bagging) Combines multiple decision trees trained on random subsets. Captures non-linear relationships and handles variance effectively.
Gradient Boosting Regressor Ensemble (boosting) Trains trees sequentially, correcting residual errors from previous models. Often achieves high accuracy on tabular data at the cost of higher computational complexity.

Interpretation¶

This selection balances interpretability and predictive power:

  • Ridge helps understand the contribution of each variable to price.
  • Random Forest provides robustness and stability against noise and outliers.
  • Gradient Boosting aims to maximize final accuracy by leveraging non-linear relationships and feature interactions.

Together, these models allow us to evaluate how linear and ensemble techniques respond to the same data, providing a solid foundation for comparative analysis.

5.2 Model training and evaluation¶

Objective¶

Train and evaluate each model on both segments (Economy and Business) using three key metrics that capture both accuracy and generalization capability:

  • MAE (Mean Absolute Error): measures the average absolute error. Easy to interpret, as it is expressed in the same units as the target variable (price).
  • RMSE (Root Mean Squared Error): penalizes large errors more heavily, making it sensitive to highly deviated predictions.
  • R² (Coefficient of determination): indicates the proportion of price variability explained by the model (values closer to 1 imply better fit).

Implementation¶

The following function generalizes the training, prediction, and metric computation process for any supervised regression model:

In [28]:
def evaluar_modelo(nombre, modelo, X_train, y_train, X_test, y_test):
    modelo.fit(X_train, y_train)
    y_pred = modelo.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    resultados = pd.DataFrame({
        "Modelo": [nombre],
        "MAE": [mae],
        "RMSE": [rmse],
        "R²": [r2]
    })
    
    return resultados

Implementation note¶

This function standardizes the evaluation of the three selected models, ensuring consistent comparisons.

Additionally:

  • A low MAE indicates that average predictions closely match true values.
  • An RMSE significantly larger than MAE may reveal outliers or isolated large errors.
  • An R² close to 1 indicates that the model captures most of the variability in price, serving as a strong global performance indicator.
In [29]:
# Models
ridge = Ridge(alpha=1.0)
rf = RandomForestRegressor(n_estimators=200, random_state=123)
gb = GradientBoostingRegressor(n_estimators=300, learning_rate=0.05, random_state=123)
In [30]:
# Economy evaluation
resultados_econ = pd.concat([
    evaluar_modelo("Ridge", ridge, X_train_e, y_train_e, X_test_e, y_test_e),
    evaluar_modelo("Random Forest", rf, X_train_e, y_train_e, X_test_e, y_test_e),
    evaluar_modelo("Gradient Boosting", gb, X_train_e, y_train_e, X_test_e, y_test_e)
], ignore_index=True)
In [31]:
# Business evaluation
resultados_bus = pd.concat([
    evaluar_modelo("Ridge", ridge, X_train_b, y_train_b, X_test_b, y_test_b),
    evaluar_modelo("Random Forest", rf, X_train_b, y_train_b, X_test_b, y_test_b),
    evaluar_modelo("Gradient Boosting", gb, X_train_b, y_train_b, X_test_b, y_test_b)
], ignore_index=True)

display(resultados_econ)
display(resultados_bus)
Modelo MAE RMSE R²
0 Ridge 1467.493852 1920.838018 0.615342
1 Random Forest 585.683525 1110.255849 0.871489
2 Gradient Boosting 1135.728870 1556.337215 0.747477
Modelo MAE RMSE R²
0 Ridge 6690.613904 8614.589314 0.534524
1 Random Forest 2549.234778 4782.555600 0.856534
2 Gradient Boosting 5644.624314 7436.028760 0.653175

Observation: model evaluation¶

Below are the results obtained by the three evaluated techniques — Ridge, Random Forest, and Gradient Boosting — on the Economy and Business datasets.

Performance by segment¶

Segment Model MAE RMSE R²
Economy Ridge 1467.49 1920.84 0.615
Random Forest 585.68 1110.26 0.871
Gradient Boosting 1135.73 1556.34 0.747
Business Ridge 6690.61 8614.59 0.535
Random Forest 2549.23 4782.56 0.857
Gradient Boosting 5644.62 7436.03 0.653

Interpretation¶

The results confirm expected patterns in tabular regression problems:

  1. Random Forest emerges as the best-performing model overall in both segments:

    • It achieves the lowest errors (MAE and RMSE).
    • It reaches the highest R² values (0.87 in Economy and 0.86 in Business), indicating excellent explanatory power.
    • Its ensemble nature allows it to capture non-linear relationships and handle outliers without overfitting.
  2. Gradient Boosting delivers intermediate results:

    • Performance is solid but inferior to Random Forest.
    • This behavior suggests that more refined hyperparameter tuning (e.g., learning rate, max depth, n_estimators) could further improve results.
  3. Ridge Regression fulfills its role as a baseline model:

    • Its results are significantly lower, indicating that the relationship between predictors and price is not purely linear.
    • Nevertheless, it maintains a reasonable R² (0.61 in Economy and 0.53 in Business), confirming that prior feature transformations were appropriate.

Conclusion¶

The superior performance of ensemble models indicates that:

  • Flight price patterns depend on non-linear interactions between variables (such as airline, origin city, destination, and duration).
  • The Random Forest model offers the best balance between accuracy, robustness, and generalization, making it the most suitable candidate for practical implementation.

These results provide a solid foundation for future steps, such as hyperparameter optimization or the use of more advanced models (e.g., XGBoost, LightGBM).

5.3 Visual comparison of results¶

In [32]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.barplot(x='Modelo', y='R²', hue='Modelo', data=resultados_econ,
            palette='Blues', legend=False, ax=axes[0])
axes[0].set_title('Economy – R² comparison')

sns.barplot(x='Modelo', y='R²', hue='Modelo', data=resultados_bus,
            palette='Reds', legend=False, ax=axes[1])
axes[1].set_title('Business – R² comparison')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: visual comparison of R²¶

The following chart compares the performance of the three models across both segments (Economy and Business), using the coefficient of determination R² as the primary metric.

Interpretation¶

  • In both segments, Random Forest clearly outperforms the other models, with R² values above 0.85, reflecting its strong ability to explain most of the variability in flight prices.

  • Gradient Boosting remains competitive but slightly inferior, suggesting that the model could benefit from more refined hyperparameter tuning (e.g., learning_rate, max_depth, or n_estimators).

  • Ridge Regression, by contrast, shows the lowest R² values in both cases. This confirms that predictor–price relationships are not strictly linear, limiting the performance of simple linear methods.

Conclusion¶

The visual comparison reinforces the numerical results:

  • Random Forest is the most robust and generalizable model for both Economy and Business, achieving an optimal balance between accuracy and stability.
  • This positions it as the best candidate for deployment in real-world price prediction environments, where interpretability can be balanced with high predictive power.

Overall modeling conclusion¶

The results confirm that the Random Forest Regressor delivers the best overall performance across both segments.
Its ability to capture non-linear relationships and handle complex feature interactions allows it to significantly outperform linear models and more configuration-sensitive ensembles.

Consequently, Random Forest is established as the recommended baseline model for future project iterations, due to both its accuracy and its robustness to data variability.

¶

6. Results¶

6.1 Overall Summary¶

The predictive analysis enabled the construction of models capable of estimating flight prices with good accuracy, clearly differentiating between the Economy and Business segments.
The Random Forest Regressor showed the best performance in both cases, achieving an R² above 0.85, which indicates excellent explanatory power.

Segment MAE (Mean Absolute Error) RMSE (Root Mean Squared Error) R²
Economy 586 1110 0.87
Business 2550 4783 0.86

6.2 Interpretation of Results¶

  • In the Economy segment, the model predicts prices with an average error below 600 monetary units, representing high accuracy given the variability of the airline market.
  • In Business, although the absolute error is higher (due to higher and more dispersed prices), the overall model fit remains strong, with a similar proportion of explained variance.

These results suggest that temporal variables (month, day, weekday), together with airline, origin city, destination, and number of stops, are key factors in explaining flight price variability.

6.3 Business Implications¶

  • Airlines or travel agencies could anticipate price fluctuations based on seasonality, route, and service type, optimizing their pricing strategies.
  • Customers could benefit from price alert systems or recommendation engines, driven by predictions tailored to their routes and search behavior.
  • From an operational perspective, this type of model can be integrated into revenue management platforms to improve dynamic pricing planning.

6.4 Next Steps¶

To further strengthen model accuracy and scalability, the following actions are recommended:

  • Experiment with hyperparameter optimization techniques (e.g., GridSearchCV, Optuna).
  • Incorporate additional features (e.g., aircraft type, actual flight duration, holidays).
  • Evaluate more advanced models such as XGBoost or LightGBM, which may provide marginal performance improvements.

General Conclusion¶

This study demonstrates that, with proper feature engineering and a well-configured ensemble model, it is possible to reliably predict commercial flight prices.
These results lay the groundwork for real-world predictive intelligence applications in the airline industry, supporting both commercial efficiency and customer experience.

¶

7. Conclusions and Next Steps¶

7.1 General Conclusions¶

This project successfully built a complete Machine Learning pipeline for flight price prediction, from data exploration to model validation.
The analysis showed that:

  • Economy and Business segment data exhibit distinct patterns, yet both can be effectively modeled using ensemble algorithms.
  • The Random Forest Regressor emerged as the most balanced model, achieving high levels of accuracy (low MAE) and explanatory power (R² ≈ 0.85–0.87).
  • Temporal variables (month, day, weekday), together with airline, origin, destination, and stop type, proved to be decisive predictors of price.
  • The use of controlled feature engineering and selective scaling improved model stability and generalization.

7.2 Key Learnings¶

This project reinforces the importance of:

  • Applying thorough and consistent data cleaning prior to modeling.
  • Balancing interpretability and performance by selecting models that deliver both technical value and business relevance.
  • Designing notebooks with a reproducible and well-documented structure, facilitating deployment in collaborative environments such as GitHub or Kaggle.

7.3 Next Steps¶

  • Implement an automated hyperparameter tuning phase to further optimize ensemble models.
  • Evaluate the integration of more advanced algorithms such as XGBoost, LightGBM, or CatBoost.
  • Develop an interactive dashboard or prediction API to query estimated prices in real time.
  • Explore additional data sources (weather, historical demand, special events) to enrich predictive features.

Final Conclusion¶

This work demonstrates how Data Science can deliver tangible value to the airline industry by transforming large volumes of data into actionable insights.
The resulting model not only predicts prices accurately but also provides a foundation for recommendation systems, dynamic pricing strategies, and market analysis tools, strengthening strategic decision-making in real-world environments.

¶

8. Credits and Technologies Used¶

8.1 Authorship¶

Author: Cristián Andrés Galleguillos Vega
Role: Data Scientist | Biologist | MSc in Data Science & Big Data | MSc in Natural Resources Engineering
Repository: GitHub – Flight Price Prediction (update link once published)

This project was developed as part of a professional portfolio consolidation process in Data Science applied to predictive analytics and resource management.
It combines technical modeling expertise with a strong focus on result interpretation and real-world business and sustainability applications.


8.2 Main Technologies and Libraries¶

Category Tools
Language Python 3.11
Data analysis & manipulation pandas, numpy
Visualization matplotlib, seaborn
Predictive modeling scikit-learn
Preprocessing & scaling MinMaxScaler, OneHotEncoder
Environment management Anaconda, Jupyter Notebook
Version control Git & GitHub

8.3 Project Structure¶

The notebook follows a modular and reproducible structure:

  1. Context and analysis objectives
  2. Data loading and preparation
  3. EDA (univariate and bivariate exploratory analysis)
  4. Feature engineering and preprocessing
  5. Supervised modeling and validation
  6. Results and interpretation
  7. Conclusions and next steps
  8. Credits and technologies used

8.4 Final Reflection¶

This project represents a comprehensive applied Data Science implementation, combining analytical principles with the scientific perspective developed through biology and natural resources engineering.
Each step—from data cleaning to model validation—reflects a systematic, transparent, and continuous-learning-oriented approach.