Titanic Training Machine learning#
用鐵達尼號資料集作為 Machine learning 的基礎練習
我需要完成的步驟:
了解 Competition 的目標是什麼?
理解資料的意義
依照條件我可以去查看資料的 Pattern
需要用到的套件
Pandas (CSV)
metaplotlib (視覺化 EDA)
Scikit-learn (ML 演算法)
import numpy as np
import pandas as pd
## option for displayed rows
pd.set_option('display.max_rows', 20)
## load titanic titanic_clean
titanic = pd.read_csv('./data/titanic_train.csv')
## 數值資料可以善用 describe 查看資料的平均、中位數...等統計資訊
titanic.columns.values
array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)
## 查看缺失值
titanic.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 2
Family 0
dtype: int64
titanic[(titanic.Survived == 0) & (titanic.Cabin == 'Unknown')].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Pclass', ascending=True)
# titanic[(titanic.Age.isnull() == True)].loc[:100,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']]
Survived | Pclass | Sex | Age | Family | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
83 | 0 | 1 | male | 28.0 | 0 | 47.1000 | Unknown | S |
168 | 0 | 1 | male | NaN | 0 | 25.9250 | Unknown | S |
793 | 0 | 1 | male | NaN | 0 | 30.6958 | Unknown | C |
694 | 0 | 1 | male | 60.0 | 0 | 26.5500 | Unknown | S |
270 | 0 | 1 | male | NaN | 0 | 31.0000 | Unknown | S |
... | ... | ... | ... | ... | ... | ... | ... | ... |
352 | 0 | 3 | male | 15.0 | 2 | 7.2292 | Unknown | C |
350 | 0 | 3 | male | 23.0 | 0 | 9.2250 | Unknown | S |
349 | 0 | 3 | male | 42.0 | 0 | 8.6625 | Unknown | S |
519 | 0 | 3 | male | 32.0 | 0 | 7.8958 | Unknown | S |
890 | 0 | 3 | male | 32.0 | 0 | 7.7500 | Unknown | Q |
481 rows × 8 columns
titanic[(titanic.Survived == 1) & (titanic.Cabin == 'Unknown')].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Pclass', ascending=True)
Survived | Pclass | Sex | Age | Family | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
660 | 1 | 1 | male | 50.0 | 2 | 133.6500 | Unknown | S |
513 | 1 | 1 | female | 54.0 | 1 | 59.4000 | Unknown | C |
290 | 1 | 1 | female | 26.0 | 0 | 78.8500 | Unknown | S |
708 | 1 | 1 | female | 22.0 | 0 | 151.5500 | Unknown | S |
306 | 1 | 1 | female | NaN | 0 | 110.8833 | Unknown | C |
... | ... | ... | ... | ... | ... | ... | ... | ... |
328 | 1 | 3 | female | 31.0 | 2 | 20.5250 | Unknown | S |
330 | 1 | 3 | female | NaN | 2 | 23.2500 | Unknown | Q |
338 | 1 | 3 | male | 45.0 | 0 | 8.0500 | Unknown | S |
286 | 1 | 3 | male | 30.0 | 0 | 9.5000 | Unknown | S |
414 | 1 | 3 | male | 44.0 | 0 | 7.9250 | Unknown | S |
206 rows × 8 columns
目前得知 Cabin 為 Unknown 的情況下,生存的人數少於死亡人數
avgAge = titanic.Age.dropna().mean()
avgAge
29.69911764705882
titanic[(titanic.Survived == 0) & (titanic.Age < avgAge)].loc[:,['Survived','Pclass','Sex','Age','Family','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
Survived | Pclass | Sex | Age | Family | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
302 | 0 | 3 | male | 19.0 | 0 | 0.0000 | Unknown | S |
378 | 0 | 3 | male | 20.0 | 0 | 4.0125 | Unknown | C |
371 | 0 | 3 | male | 18.0 | 1 | 6.4958 | Unknown | S |
143 | 0 | 3 | male | 19.0 | 0 | 6.7500 | Unknown | Q |
654 | 0 | 3 | female | 18.0 | 0 | 6.7500 | Unknown | Q |
... | ... | ... | ... | ... | ... | ... | ... | ... |
297 | 0 | 1 | female | 2.0 | 3 | 151.5500 | C22 C26 | S |
498 | 0 | 1 | female | 25.0 | 3 | 151.5500 | C22 C26 | S |
377 | 0 | 1 | male | 27.0 | 2 | 211.5000 | C82 | C |
118 | 0 | 1 | male | 24.0 | 1 | 247.5208 | B58 B60 | C |
27 | 0 | 1 | male | 19.0 | 5 | 263.0000 | C23 C25 C27 | S |
228 rows × 8 columns
titanic[(titanic.Survived == 1) & (titanic.Age < avgAge)].loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
271 | 1 | 3 | male | 25.0 | 0 | 0 | 0.0000 | Unknown | S |
804 | 1 | 3 | male | 27.0 | 0 | 0 | 6.9750 | Unknown | S |
127 | 1 | 3 | male | 24.0 | 0 | 0 | 7.1417 | Unknown | S |
875 | 1 | 3 | female | 15.0 | 0 | 0 | 7.2250 | Unknown | C |
553 | 1 | 3 | male | 22.0 | 0 | 0 | 7.2250 | Unknown | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
700 | 1 | 1 | female | 18.0 | 1 | 0 | 227.5250 | C62 C64 | C |
311 | 1 | 1 | female | 18.0 | 2 | 2 | 262.3750 | B57 B59 B63 B66 | C |
742 | 1 | 1 | female | 21.0 | 2 | 2 | 262.3750 | B57 B59 B63 B66 | C |
341 | 1 | 1 | female | 24.0 | 3 | 2 | 263.0000 | C23 C25 C27 | S |
88 | 1 | 1 | female | 23.0 | 3 | 2 | 263.0000 | C23 C25 C27 | S |
156 rows × 9 columns
avg_fare = titanic.Fare.mean()
avg_fare
32.204207968574636
print('用 Survived 跟 Fare 做比較,並以票價平均做基準')
print('Fare 平均: ',avg_fare)
print('票價低於平均的死亡人數',len(titanic[(titanic.Survived == 0) & (titanic.Fare < avg_fare )]))
print('票價高於平均的死亡人數',len(titanic[(titanic.Survived == 0) & (titanic.Fare > avg_fare )]))
titanic[(titanic.Survived == 0) & (titanic.Fare > avg_fare )].loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']].sort_values('Fare', ascending=True)
用 Survived 跟 Fare 做比較,並以票價平均做基準
Fare 平均: 32.204207968574636
票價低於平均的死亡人數 464
票價高於平均的死亡人數 85
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
625 | 0 | 1 | male | 61.0 | 0 | 0 | 32.3208 | D50 | S |
848 | 0 | 2 | male | 28.0 | 0 | 1 | 33.0000 | Unknown | S |
170 | 0 | 1 | male | 61.0 | 0 | 0 | 33.5000 | B19 | S |
515 | 0 | 1 | male | 47.0 | 0 | 0 | 34.0208 | D46 | S |
436 | 0 | 3 | female | 21.0 | 2 | 2 | 34.3750 | Unknown | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
527 | 0 | 1 | male | NaN | 0 | 0 | 221.7792 | C95 | S |
557 | 0 | 1 | male | NaN | 0 | 0 | 227.5250 | Unknown | C |
118 | 0 | 1 | male | 24.0 | 0 | 1 | 247.5208 | B58 B60 | C |
438 | 0 | 1 | male | 64.0 | 1 | 4 | 263.0000 | C23 C25 C27 | S |
27 | 0 | 1 | male | 19.0 | 3 | 2 | 263.0000 | C23 C25 C27 | S |
85 rows × 9 columns
得知 Fare 在小於平均值的情形下,在死亡人口中有明顯差距,約佔了死亡人口的 84 %
titanic[['Pclass','Survived']].groupby('Pclass',as_index=False).mean().sort_values('Survived',ascending=False)
Pclass | Survived | |
---|---|---|
0 | 1 | 0.629630 |
1 | 2 | 0.472826 |
2 | 3 | 0.242363 |
titanic[['Sex','Survived']].groupby('Sex',as_index=False).mean().sort_values('Survived',ascending=False)
Sex | Survived | |
---|---|---|
0 | female | 0.742038 |
1 | male | 0.188908 |
titanic[['Embarked','Survived']].groupby('Embarked',as_index=False).mean().sort_values('Survived',ascending=False)
Embarked | Survived | |
---|---|---|
0 | C | 0.553571 |
1 | Q | 0.389610 |
2 | S | 0.336957 |
titanic[['Parch','Survived']].groupby('Parch',as_index=False).mean().sort_values('Survived',ascending=False)
Parch | Survived | |
---|---|---|
3 | 3 | 0.600000 |
1 | 1 | 0.550847 |
2 | 2 | 0.500000 |
0 | 0 | 0.343658 |
5 | 5 | 0.200000 |
4 | 4 | 0.000000 |
6 | 6 | 0.000000 |
titanic[['SibSp','Survived']].groupby('SibSp',as_index=False).mean().sort_values('Survived',ascending=False)
SibSp | Survived | |
---|---|---|
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
titanic[['Family','Survived']].groupby('Family',as_index=False).mean().sort_values('Survived',ascending=False)
Family | Survived | |
---|---|---|
3 | 3 | 0.724138 |
2 | 2 | 0.578431 |
1 | 1 | 0.552795 |
6 | 6 | 0.333333 |
0 | 0 | 0.303538 |
4 | 4 | 0.200000 |
5 | 5 | 0.136364 |
7 | 7 | 0.000000 |
8 | 10 | 0.000000 |
Data 前處理#
需要將會用到的資料轉成數值進行分析
Sex (性別)
Fare (票價) 轉成 qt
Age 將 missing value 填入平均數
Embarked 先捨棄
捨棄 Cabin (missing value 過多)
titanic_clean = titanic.drop(['PassengerId','Name','Ticket','Cabin','Embarked'],axis=1)
# Sex convert to 0 and 1
titanic_clean.Sex = titanic_clean.Sex.map({'female': 1, 'male': 0}).astype(int)
# Age fill in mean (average)
titanic_clean.Age.fillna(titanic_clean.Age.dropna().mean(),inplace=True)
titanic_clean.loc[ titanic_clean['Fare'] <= 7.91, 'Fare'] = 0
titanic_clean.loc[(titanic_clean['Fare'] > 7.91) & (titanic_clean['Fare'] <= 14.454), 'Fare'] = 1
titanic_clean.loc[(titanic_clean['Fare'] > 14.454) & (titanic_clean['Fare'] <= 31), 'Fare'] = 2
titanic_clean.loc[ titanic_clean['Fare'] > 31, 'Fare'] = 3
titanic_clean['Fare'] = titanic_clean['Fare'].astype(int)
titanic_clean
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.000000 | 1 | 0 | 0 |
1 | 1 | 1 | 1 | 38.000000 | 1 | 0 | 3 |
2 | 1 | 3 | 1 | 26.000000 | 0 | 0 | 1 |
3 | 1 | 1 | 1 | 35.000000 | 1 | 0 | 3 |
4 | 0 | 3 | 0 | 35.000000 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | 0 | 27.000000 | 0 | 0 | 1 |
887 | 1 | 1 | 1 | 19.000000 | 0 | 0 | 2 |
888 | 0 | 3 | 1 | 29.699118 | 1 | 2 | 2 |
889 | 1 | 1 | 0 | 26.000000 | 0 | 0 | 2 |
890 | 0 | 3 | 0 | 32.000000 | 0 | 0 | 0 |
891 rows × 7 columns
#titanic_clean['AgeBand'] = pd.qcut(titanic_clean['Age'],4)
#titanic_clean[['AgeBand','Survived']].groupby('AgeBand',as_index=False).mean().sort_values('AgeBand',ascending=False)
titanic_clean.loc[titanic_clean['Age'] <= 16, 'Age'] = 0
titanic_clean.loc[(titanic_clean['Age'] > 16) & (titanic_clean['Age'] <= 32), 'Age'] = 1
titanic_clean.loc[(titanic_clean['Age'] > 32) & (titanic_clean['Age'] <= 48), 'Age'] = 2
titanic_clean.loc[(titanic_clean['Age'] > 48) & (titanic_clean['Age'] <= 64), 'Age'] = 3
titanic_clean.loc[ titanic_clean['Age'] > 64, 'Age'] = 4
titanic_clean['Age'].astype(int)
titanic_clean.drop(['AgeBand'],axis=1)
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 0.0 | 1 | 0 | 0 |
1 | 1 | 1 | 1 | 0.0 | 1 | 0 | 3 |
2 | 1 | 3 | 1 | 0.0 | 0 | 0 | 1 |
3 | 1 | 1 | 1 | 0.0 | 1 | 0 | 3 |
4 | 0 | 3 | 0 | 0.0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | 0 | 0.0 | 0 | 0 | 1 |
887 | 1 | 1 | 1 | 0.0 | 0 | 0 | 2 |
888 | 0 | 3 | 1 | 0.0 | 1 | 2 | 2 |
889 | 1 | 1 | 0 | 0.0 | 0 | 0 | 2 |
890 | 0 | 3 | 0 | 0.0 | 0 | 0 | 0 |
891 rows × 7 columns