ads-banner-mobile ads-banner-desktop

Investigate Titanic Dataset

Last Updated on

Project Description

GITHUB: URL

Dataset chosen: Titanic

The latter question will be analyzed via other (more detailed questions), such as:

  1. Does gender has any impact to the survival rate?
  2. Does passenger survive because of higher passenger class?
  3. What’s the range of age have highest rate to survived?
import pandas as pd
import numpy as np
import matplotlib
import csv
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
def display_piechart(survive, death, xlabel):
    # Data to plot
    labels = 'Survive', 'Death'
    sizes = [survive, death]
    colors = ['yellowgreen', 'gold']
    explode = (0.1, 0)
    # Plot
    plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)
    plt.axis('equal')
    plt.xlabel(xlabel)
    plt.show()

READ FROM DATASET

  • Survived: (0:No, 1:Yes)
  • Pclass: Passenger class (1:First Class, 2:Second Class, 3: 3rd Class)
  • Name: Name of the passenger
  • Sex: Gender of the passenger
  • Age: Age of the passenger
  • Fare: Passenger Fare
  • Cabin: cabin of the passenger
  • Embarked: Embarkation Port (C = Cherbourg, Q = Queenstown, S = Southampton)
df = pd.read_csv('titanic-data.csv')
display(df.head())
print("Data information")
df.info()
print("\nFigure out missing data")
display(df.isnull().sum())
Image
Image

Data Decided to be drop

  • PassengerId = The PassengerId just an Id of passenger.
  • Cabin – Due to high ammount of NaN data, so decided to drop it. ##### Data don’t relate to the question
  • SibSp – Passenger of siblings or spouses aboard.
  • Parch – Passenger of parents or children aboard.
  • Fare – Passenger fare.
  • Embarked – Embarkation Port.
df = pd.read_csv('titanic-data.csv')
# Drop unnecessary column
df.drop(["PassengerId", "SibSp", "Parch", "Ticket", "Cabin", "Embarked","Fare"], axis=1,inplace=True)
rows = df.shape[0]
columns = df.shape[1]
print("The Dataset consist of ",rows," rows of record and ",columns,"columns of the variable.")
display(df.head())
display(df.describe())
Image 1

Does gender has any impact to the survival rate?

num_passenger = df.shape[0]
num_male = df.loc[df['Sex'] == "male"].shape[0]
num_female = df.loc[df['Sex'] == "female"].shape[0]
print ("We had total number of {} record with {} male and {} female.\n".format(num_passenger, num_male, num_female))
print(df.groupby('Sex').size(), '\n')
print(pd.crosstab(df['Sex'], df['Survived']))
# Visualize Survivability
table = pd.crosstab(df['Survived'],df['Sex'])
axes = table.plot.pie(subplots=True, labels=['Death','Survived'], autopct='%1.1f%%');
plt.suptitle('Survivability across sex',y=0.8)
for ax in axes:
    ax.legend_.remove()
    ax.set_aspect('equal')
We had total number of 891 record with 577 male and 314 female.

Sex
female    314
male      577
dtype: int64 

Survived    0    1
Sex               
female     81  233
male      468  109
Survivability Across Sex
Survivability Across Sex

Answer: Wow! There’s a 74.2% survival rate for female and only 18.9% for male.

Does passenger survive because of higher passenger class?

num_passenger = df.shape[0]
num_Pclass_1 = df.loc[df['Pclass'] == 1].shape[0]
num_Pclass_2 = df.loc[df['Pclass'] == 2].shape[0]
num_Pclass_3 = df.loc[df['Pclass'] == 3].shape[0]
print ("We had total number of {} record with {} for first class and {} for second class and {} for third class.".format(num_passenger, num_Pclass_1, num_Pclass_2,num_Pclass_3))
survived_Pclass_1 = (df['Pclass'] == 1) & (df['Survived'] == 1)
survived_Pclass_2 = (df['Pclass'] == 2) & (df['Survived'] == 1)
survived_Pclass_3 = (df['Pclass'] == 3) & (df['Survived'] == 1)
# Calculate number of survival and death
num_survived_Pclass_1 = df.loc[survived_Pclass_1].shape[0]
num_survived_Pclass_2 = df.loc[survived_Pclass_2].shape[0]
num_survived_Pclass_3 = df.loc[survived_Pclass_3].shape[0]
num_not_survived_Pclass_1 = num_Pclass_1 - num_survived_Pclass_1
num_not_survived_Pclass_2 = num_Pclass_2 - num_survived_Pclass_2
num_not_survived_Pclass_3 = num_Pclass_3 - num_survived_Pclass_3
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_1, num_survived_Pclass_1, num_not_survived_Pclass_1))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_2, num_survived_Pclass_2, num_not_survived_Pclass_2))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_3, num_survived_Pclass_3, num_not_survived_Pclass_3))
display_piechart(num_survived_Pclass_1, num_not_survived_Pclass_1, 'Survival Rate for First Class')
display_piechart(num_survived_Pclass_2, num_not_survived_Pclass_2, 'Survival Rate for Second Class')
display_piechart(num_survived_Pclass_3, num_not_survived_Pclass_3, 'Survival Rate for Third Class')
We had total number of 891 record with 216 for first class and 184 for second class and 491 for third class.
Total of 216 in first class, 136 survive and 80 not survive.
Total of 184 in first class, 87 survive and 97 not survive.
Total of 491 in first class, 119 survive and 372 not survive.
Survival Rate For First Class
Survival Rate For First Class
Survival Rate For Second Class
Survival Rate For Second Class
Survival Rate For Third Class
Survival Rate For Third Class

Answer: Wow! According to the pie chart, first class has the highest survival rate of 63%. The second class has only 47.3% and the third class of passenger has only 24.2% survival rate.

What’s the range of age have highest rate to survived?

# Drop lost passenger age record
df = df.dropna()
df[df.Survived==1].Age.plot.hist(bins=range(0,81,10),alpha=0.5,color="blue",figsize=(6,4),label='Survived')
df[df.Survived==0].Age.plot.hist(bins=range(0,81,10),alpha=0.5,color="red",figsize=(6,4),label='Death')
plt.legend()
plt.xlabel("Age distribution of people who survived and death")
Age Distribution Of People Who Survived And Death
Age Distribution Of People Who Survived And Death

Answer: The 0-10 range of age have the highest rate to survive. However, the 20-30 range of age has the highest risk.

Conclusion

According to the above analysis, the report shows there are many factors would affect the survival rates. As we analysed, the age, class and gender are giving lots of impact in this analysis.

Limitations

This analysis has some limitations of:

  • Missing data: Tinanic dataset has a few missing value about the passenger age and most of the cabins data have gone.
  • Data Ingore: I drop the whole cabin data and the unnessary column due to a large amount of missing data.

Other variables

  • Passenger Career
  • Life boat number
  • Passenger Health Status
  • Wrong recorded
  • Passenger Reputation(Maybe some of them is a superstar?)
  • Ship maintenance report
  • Passenger Background

It would be interesting if the other variables exist!

Reference

https://www.kaggle.com/c/titanic/data

BUY ME A COFFEE ☕

Leave a Comment