imbalanced dataset을 이용해서 maching learning modeling을 할때 크게 두가지 방법이 있음
1. hyperparameter tuning 과정에서 scale weight
2. oversampling
Oversampling 기법으로 SMOTE(synthetic minority oversampling technique)가 일반적으로 많이 쓰임
그 일환으로 ADASYN(Adaptive synthetic sampling approach for imbalanced learning)을 이용해서 oversampling 해보겠음.
#1 데이터 준비
X_train_features_imputed.info()
#2 ADASYN 이용 oversampling
- hyper parameter에서 smpling_strategy는 비율
- k(default=5)
- random_state
- 세가지 설정만 유의하면 됨 나머지는 url참조
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import ADASYN # doctest: +NORMALIZE_WHITESPACE
SEED=42
print('Original dataset shape %s' % Counter(y_train))
ada = ADASYN(sampling_strategy=0.02, random_state=SEED)
X_res, y_res = ada.fit_resample(X_train_features_imputed, y_train)
print('Resampled dataset shape %s' % Counter(y_res))
adasyn docs : imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#imblearn.over_sampling.ADASYN
#3 result visualization
# visualization oversampling
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2)
X_vis = pca.fit_transform(X_train_features_imputed)
X_res_vis = pca.fit_transform(X_res)
# Two subplots, unpack the axes array immediately
f, (ax1, ax2) = plt.subplots(1, 2)
c0 = ax1.scatter(X_vis[y_train == 0, 0], X_vis[y_train == 0, 1], label="Class #0",
alpha=0.5)
c1 = ax1.scatter(X_vis[y_train == 1, 0], X_vis[y_train == 1, 1], label="Class #1",
alpha=0.5)
ax1.set_title('Original set')
ax2.scatter(X_res_vis[y_res == 0, 0], X_res_vis[y_res == 0, 1],
label="Class #0", alpha=.5)
ax2.scatter(X_res_vis[y_res == 1, 0], X_res_vis[y_res == 1, 1],
label="Class #1", alpha=.5)
ax2.set_title('ADASYN')
# make nice plotting
for ax in (ax1, ax2):
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('outward', 10))
ax.set_xlim([-6, 8])
ax.set_ylim([-6, 6])
plt.rcParams['figure.figsize'] = (15, 10)
plt.figlegend((c0, c1), ('Class #0', 'Class #1'), loc='lower center',
ncol=2, labelspacing=0.)
plt.tight_layout(pad=3)
plt.show()