Patrick Mugisha, "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."

Table of Contents

ETL (Extract Transform Loading): What? Why?

ETL represents a broad range of data transformation and processing techniques. ETL involves cleaning and transforming the data for data analysis.

Titanic Dataset

Essential Data Transformation

Finding out data type & missing values

Numerical columns: PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

Categorial columns: Name, Sex, Ticket, Cabin, Embarked

There are some missing values (Age, Cabin, Embarked)

This dataset needs data cleaning & transformation

Basic statistics

There are a number of data quality issues

Remove (unnessary) columns

Handling missing value

There are different ways to handle missing values

Recommended heuristics for handling missing values in real-world datasets

  1. Remove unnecessary (or unmeaningful) columns (e.g., Ticket number): "unnecessary" means those columns you don't consider for the analysis. For example, it is obvious that there is no need to analyze "ticker numbers".
  2. Some columns' missing values could be replaced with a specific value (e.g., mean, median), while replacement is not possible for other columns. For the latter case, drop the rows with missing values "mindfully".

Now, the dataset has 712 rows. No rows with missing values. This is an aggressive approach.

Handling Inaccurate/Inconsistent Data

After applying multiple techniques of data cleaning, the data looks good now. Perhaps, this cleaned data can be used for analysis & builing predictive model. The predictive model would predict who is likely to survive, while using Survived as y value

Intermediate-level Data Transformation

Some passengers in upper class did not survived.

References