Titanic Training Machine learning#

用鐵達尼號資料集作為 Machine learning 的基礎練習

我需要完成的步驟:

  • 了解 Competition 的目標是什麼?

  • 理解資料的意義

  • 依照條件我可以去查看資料的 Pattern

需要用到的套件

  • Pandas (CSV)

  • metaplotlib (視覺化 EDA)

  • Scikit-learn (ML 演算法)

import numpy as np
import pandas as pd

## option for displayed rows
pd.set_option('display.max_rows', 20)

## load titanic titanic_clean 
titanic = pd.read_csv('./data/titanic_train.csv')

## 數值資料可以善用 describe 查看資料的平均、中位數...等統計資訊
titanic.columns.values
array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)
## 查看缺失值
titanic.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
Family           0
dtype: int64
titanic[(titanic.Survived == 0) & (titanic.Cabin == 'Unknown')].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Pclass', ascending=True)

# titanic[(titanic.Age.isnull() == True)].loc[:100,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']]
Survived Pclass Sex Age Family Fare Cabin Embarked
83 0 1 male 28.0 0 47.1000 Unknown S
168 0 1 male NaN 0 25.9250 Unknown S
793 0 1 male NaN 0 30.6958 Unknown C
694 0 1 male 60.0 0 26.5500 Unknown S
270 0 1 male NaN 0 31.0000 Unknown S
... ... ... ... ... ... ... ... ...
352 0 3 male 15.0 2 7.2292 Unknown C
350 0 3 male 23.0 0 9.2250 Unknown S
349 0 3 male 42.0 0 8.6625 Unknown S
519 0 3 male 32.0 0 7.8958 Unknown S
890 0 3 male 32.0 0 7.7500 Unknown Q

481 rows × 8 columns

titanic[(titanic.Survived == 1) & (titanic.Cabin == 'Unknown')].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Pclass', ascending=True)
Survived Pclass Sex Age Family Fare Cabin Embarked
660 1 1 male 50.0 2 133.6500 Unknown S
513 1 1 female 54.0 1 59.4000 Unknown C
290 1 1 female 26.0 0 78.8500 Unknown S
708 1 1 female 22.0 0 151.5500 Unknown S
306 1 1 female NaN 0 110.8833 Unknown C
... ... ... ... ... ... ... ... ...
328 1 3 female 31.0 2 20.5250 Unknown S
330 1 3 female NaN 2 23.2500 Unknown Q
338 1 3 male 45.0 0 8.0500 Unknown S
286 1 3 male 30.0 0 9.5000 Unknown S
414 1 3 male 44.0 0 7.9250 Unknown S

206 rows × 8 columns

目前得知 Cabin 為 Unknown 的情況下,生存的人數少於死亡人數

avgAge = titanic.Age.dropna().mean()
avgAge
29.69911764705882
titanic[(titanic.Survived == 0) & (titanic.Age < avgAge)].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
Survived Pclass Sex Age Family Fare Cabin Embarked
302 0 3 male 19.0 0 0.0000 Unknown S
378 0 3 male 20.0 0 4.0125 Unknown C
371 0 3 male 18.0 1 6.4958 Unknown S
143 0 3 male 19.0 0 6.7500 Unknown Q
654 0 3 female 18.0 0 6.7500 Unknown Q
... ... ... ... ... ... ... ... ...
297 0 1 female 2.0 3 151.5500 C22 C26 S
498 0 1 female 25.0 3 151.5500 C22 C26 S
377 0 1 male 27.0 2 211.5000 C82 C
118 0 1 male 24.0 1 247.5208 B58 B60 C
27 0 1 male 19.0 5 263.0000 C23 C25 C27 S

228 rows × 8 columns

titanic[(titanic.Survived == 1) & (titanic.Age < avgAge)].loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
271 1 3 male 25.0 0 0 0.0000 Unknown S
804 1 3 male 27.0 0 0 6.9750 Unknown S
127 1 3 male 24.0 0 0 7.1417 Unknown S
875 1 3 female 15.0 0 0 7.2250 Unknown C
553 1 3 male 22.0 0 0 7.2250 Unknown C
... ... ... ... ... ... ... ... ... ...
700 1 1 female 18.0 1 0 227.5250 C62 C64 C
311 1 1 female 18.0 2 2 262.3750 B57 B59 B63 B66 C
742 1 1 female 21.0 2 2 262.3750 B57 B59 B63 B66 C
341 1 1 female 24.0 3 2 263.0000 C23 C25 C27 S
88 1 1 female 23.0 3 2 263.0000 C23 C25 C27 S

156 rows × 9 columns

avg_fare = titanic.Fare.mean()
avg_fare
32.204207968574636
print('用 Survived 跟 Fare 做比較,並以票價平均做基準')
print('Fare 平均: ',avg_fare)
print('票價低於平均的死亡人數',len(titanic[(titanic.Survived == 0) & (titanic.Fare < avg_fare )]))
print('票價高於平均的死亡人數',len(titanic[(titanic.Survived == 0) & (titanic.Fare > avg_fare )]))


titanic[(titanic.Survived == 0) & (titanic.Fare > avg_fare )].loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
用 Survived 跟 Fare 做比較,並以票價平均做基準
Fare 平均:  32.204207968574636
票價低於平均的死亡人數 464
票價高於平均的死亡人數 85
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
625 0 1 male 61.0 0 0 32.3208 D50 S
848 0 2 male 28.0 0 1 33.0000 Unknown S
170 0 1 male 61.0 0 0 33.5000 B19 S
515 0 1 male 47.0 0 0 34.0208 D46 S
436 0 3 female 21.0 2 2 34.3750 Unknown S
... ... ... ... ... ... ... ... ... ...
527 0 1 male NaN 0 0 221.7792 C95 S
557 0 1 male NaN 0 0 227.5250 Unknown C
118 0 1 male 24.0 0 1 247.5208 B58 B60 C
438 0 1 male 64.0 1 4 263.0000 C23 C25 C27 S
27 0 1 male 19.0 3 2 263.0000 C23 C25 C27 S

85 rows × 9 columns

得知 Fare 在小於平均值的情形下,在死亡人口中有明顯差距,約佔了死亡人口的 84 %

titanic[['Pclass','Survived']].groupby('Pclass',as_index=False).mean().sort_values('Survived',ascending=False)
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
titanic[['Sex','Survived']].groupby('Sex',as_index=False).mean().sort_values('Survived',ascending=False)
Sex Survived
0 female 0.742038
1 male 0.188908
titanic[['Embarked','Survived']].groupby('Embarked',as_index=False).mean().sort_values('Survived',ascending=False)
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957
titanic[['Parch','Survived']].groupby('Parch',as_index=False).mean().sort_values('Survived',ascending=False)
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000
titanic[['SibSp','Survived']].groupby('SibSp',as_index=False).mean().sort_values('Survived',ascending=False)
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
titanic[['Family','Survived']].groupby('Family',as_index=False).mean().sort_values('Survived',ascending=False)
Family Survived
3 3 0.724138
2 2 0.578431
1 1 0.552795
6 6 0.333333
0 0 0.303538
4 4 0.200000
5 5 0.136364
7 7 0.000000
8 10 0.000000

Data 前處理#

需要將會用到的資料轉成數值進行分析

  • Sex (性別)

  • Fare (票價) 轉成 qt

  • Age 將 missing value 填入平均數

  • Embarked 先捨棄

  • 捨棄 Cabin (missing value 過多)

titanic_clean = titanic.drop(['PassengerId','Name','Ticket','Cabin','Embarked'],axis=1)

# Sex convert to 0 and 1
titanic_clean.Sex = titanic_clean.Sex.map({'female': 1, 'male': 0}).astype(int)
# Age fill in mean (average)
titanic_clean.Age.fillna(titanic_clean.Age.dropna().mean(),inplace=True)

titanic_clean.loc[ titanic_clean['Fare'] <= 7.91, 'Fare'] = 0
titanic_clean.loc[(titanic_clean['Fare'] > 7.91) & (titanic_clean['Fare'] <= 14.454), 'Fare'] = 1
titanic_clean.loc[(titanic_clean['Fare'] > 14.454) & (titanic_clean['Fare'] <= 31), 'Fare']   = 2
titanic_clean.loc[ titanic_clean['Fare'] > 31, 'Fare'] = 3
titanic_clean['Fare'] = titanic_clean['Fare'].astype(int)

titanic_clean
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 0 22.000000 1 0 0
1 1 1 1 38.000000 1 0 3
2 1 3 1 26.000000 0 0 1
3 1 1 1 35.000000 1 0 3
4 0 3 0 35.000000 0 0 1
... ... ... ... ... ... ... ...
886 0 2 0 27.000000 0 0 1
887 1 1 1 19.000000 0 0 2
888 0 3 1 29.699118 1 2 2
889 1 1 0 26.000000 0 0 2
890 0 3 0 32.000000 0 0 0

891 rows × 7 columns

#titanic_clean['AgeBand'] = pd.qcut(titanic_clean['Age'],4)
#titanic_clean[['AgeBand','Survived']].groupby('AgeBand',as_index=False).mean().sort_values('AgeBand',ascending=False)

titanic_clean.loc[titanic_clean['Age'] <= 16, 'Age'] = 0
titanic_clean.loc[(titanic_clean['Age'] > 16) & (titanic_clean['Age'] <= 32), 'Age'] = 1
titanic_clean.loc[(titanic_clean['Age'] > 32) & (titanic_clean['Age'] <= 48), 'Age'] = 2
titanic_clean.loc[(titanic_clean['Age'] > 48) & (titanic_clean['Age'] <= 64), 'Age'] = 3
titanic_clean.loc[ titanic_clean['Age'] > 64, 'Age'] = 4
titanic_clean['Age'].astype(int)

titanic_clean.drop(['AgeBand'],axis=1)
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 0 0.0 1 0 0
1 1 1 1 0.0 1 0 3
2 1 3 1 0.0 0 0 1
3 1 1 1 0.0 1 0 3
4 0 3 0 0.0 0 0 1
... ... ... ... ... ... ... ...
886 0 2 0 0.0 0 0 1
887 1 1 1 0.0 0 0 2
888 0 3 1 0.0 1 2 2
889 1 1 0 0.0 0 0 2
890 0 3 0 0.0 0 0 0

891 rows × 7 columns