[ML-16]Catboost

16. Catboost(Categorical Boost)

what?

ordering-principle을 통해 data-leakage로 인한 prediction-shift 문제를 해결하고 high cardinality를 가진 categorical 변수에 대한 전처리 문제를 해결한 Gradient Boosting 알고리즘
Catboost는 Ordered TBS와 Ordered Boosting을 적용하여 위와 같은 문제를 해결하였다.

1) Ordered TBS(Target-based Statistics)

기존에는 categorical를 변환하기 위해 mean encoding을 사용하였다. 예를 들어, X 변수에 {A, A, A, B, B}가 있고, Target이 {1, 0, 0, 1, 0}이면, A 범주에서 Target 비율은 0.3333, B 범주에서 Target 비율은 0.5 이므로, {0.3333, 0.3333, 0.3333, 0.5, 0.5}로 변환이 된다.
하지만 이로 인해 target label에 대한 정보가 유출되는 target leakage가 발생하고 train 데이터와 test 데이터가 서로 다른 분포를 가지게 되어 성능이 떨어지게 된다.
이를 해결하기 위해 categrocial 변수를 변환하기 전에 랜덤으로 섞은 후 아래와 같은 식으로 변환한다. \(avg target = \frac {countInClass + prior} {totalCount+1}\)
denote
avgtarget : 변환하고자 하는 단일 categorical value
countInClass : 현재 행에 있는 범주 기준으로 이전까지 target ‘1’의 갯수
prior : 미리 지정한 paramter
totalCount : 현재 행에 있는 범주 기준으로 이전까지 해당 범주 갯수

2) Ordered Boosting

일반 GBM은 다음 스텝의 새로운 트리를 만들 때 현재 모델에서 쓰인 데이터를 Gradient로 다시 쓰기 때문에 Overfitting이 발생하는 Prediction shift 현상이 나타난다.
이를 해결하기 위해 기존의 tree structure를 고른 후 leaf 구하는 방식을 반대로 해서 leaf 구하고 tree structure 고르는 과정을 통해 해결한다.
1. (t-1)번째 트리를 학습시키고
2. 이를 토대로 t번째 데이터를 (t-1)번째 트리를 통해 예측하여
3. t번째 residual 값을 update하여 t번째 모형을 만든다
이를 위해서는 t번째 트리를 만들기 위해서 그 전의 트리들을 메모리에 저장해야 하고 이를 위해서 oblivious tree 구조를 사용하였다. 이는 기존 xgboost, lightgbm과 달리 depth-wise 접근법을 사용하여 다소 느리지만 overfitting을 방지하는데 효과적이다. 뿐만 아니라 정보를 이진화된 벡터로 저장하여 전체 tree 구조를 저장할 필요가 없어 메모리 면에서도 효율성을 가진다.

Cap 2019-10-09 18-28-28-184

why?

범주형 변수를 변환할 때 기존 방법대비 target leakage를 덜 했다

why not?

encoding을 할 때 배열된 순서에 따라 달라지기 때문에, 과연 경우의수를 몇번 뽑아서 학습을 해야 괜찮은지에 대한 기준이 애매하다는 점이 단점

how?

`Input`

1)Data(\(x_i, y_i\))
2) diffrentiable Loss function : \(L(y_i, \gamma) = \frac 1 2 \sum_{i=1}^n(y_i - \hat{y})^2\)
3) prior
4) preprocessing categorical values by Ordered TBS

`Step 1` Initialize model with a constant value

\(F_0(x) = \underset \gamma {argmin} \sum_{i=1}^nL(y_i, \gamma)\)

\(\gamma\) : predicted value
\(F_0(x)\) : initial predicted value(=average of all target values)

`Step 2` build trees

1) Compute \(r_{im} = - [\frac {\partial L(y_i, F(x_i))} {\partial F(x_i)}]_{F(x)=F_{m-1}(x)}\) for i = 1,…,n

\(r_{im}\) 계산 결과 \((y_i - \hat(y_i))\)라는 residual이 나온다. 이를 pseudo residual이라고 한다.
모든 샘플에 대하여 residual을 구한다.
denote \(r\) : residual
\(i\) : sample number
\(n\) : max sample number \(m\) : tree index

for m = 1 to M(tree number 1 to M)

2) Fit a regression tree to the \(r_{im}\) values and create terminal regions \(R_{jm}\), for j = 1,…,\(J_m\)

target label이 아닌 residual을 예측하기 위한 결정 트리를 만든다.
denote \(j\) : leaf node index
\(R_{jm}\) : m번째 트리의 j번째 leaf node

3) For j = 1,…,\(J_m\) compute \(r_{jm} = \underset \gamma {argmin} \underset {x_i \in R_{ij}}{\sum} L(y_i, F_{m-1}(x_i) + \gamma)\)

각 leaf node마다 residual에 대한 output을 계산
\(r_{im}\) : leaf node에 속한 샘플의 평균

4) Update \(F_M(x) = F_{m-1}(x) + \nu\sum_{j=1}^{J_m}r_{jm}I(x\in R_{jm})\)

predict target label for each samples
denote \(F_{m-1}(x)\) : last prediction
\(\nu\) : learning rate
\(r_{jm}I(x\in R_{jm})\) : sample x가 있는 leaf node의 \(r_{jm}\)(predicted residual)
작은 learning rate는 각각의 트리가 전체 예측에 미치는 영향력을 줄여준다. 이를 통해 gradient boosting은 장기적으로 높은 정확도를 가지게 된다.

5) Ordered Boosting

predict (m+1)th data by using \(F_{m}\) model and get \(r_{im}\)
2)번 과정으로 돌아감

`Step 3` output \(F_M(x)\)

Code usage

# data preparation
import pandas as pd, numpy as np, time
from sklearn.model_selection import train_test_split

data = pd.read_csv("flights.csv")
data = data.sample(frac = 0.1, random_state=10)

data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1
 
train, test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"],
                                                random_state=10, test_size=0.25)

import catboost as cb
cat_features_index = [0,1,2,3,4,5,6]

def auc(m, train, test): 
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
cb = cb.CatBoostClassifier()
cb_model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)
cb_model.fit(train, y_train)

With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train)
auc(clf, train, test)

With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC",one_hot_max_size=31, \
                            depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train, cat_features= cat_features_index)
auc(clf, train, test)

Reference

Catboost 설명
Ordered TBS 잘 설명
Catboost 코드
사용한 데이터셋