scikit-learn.preprocessing 에서 KBinsDiscretizer 를 이용해서 Discretization 하기
continuous variable을 grouping 할 때 유용함
continuous 단일 column을 이용해서 grouping
bins=2일때 세가지 방법 쉽게 이해하기
#1. strategy = 'uniform'
#2. strategy = 'quantile'
#3. strategy = 'kmeans'
purpose : y_train_pd 라는 데이터프레임의 pred-true라는 column으로 grouping
#0. data
train_pt = pd.DataFrame(y_train_pd['pred-true'])
#1. strategy = 'uniform' : mean 기준
#discretizer strategy = uniform
est_uni = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')
est_uni.fit(train_pt)
Xt_uni=est_uni.transform(train_pt)
print("edges : ", est_uni.bin_edges_[0])
unique, counts = np.unique(Xt_uni, return_counts=True)
print(dict(zip(unique, counts)))
plt.hist(Xt_uni, bins='auto')
plt.xticks([0,0.5,1])
plt.rcParams['figure.figsize']=(5.5,6)
plt.show
#2. strategy = 'quantile' : median 기준
#discretizer strategy = quantile
est_quan = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='quantile')
est_quan.fit(train_pt)
Xt_quan=est_quan.transform(train_pt)
print("edges : ", est_quan.bin_edges_[0])
unique_q, counts_q = np.unique(Xt_quan, return_counts=True)
print(dict(zip(unique_q, counts_q)))
plt.hist(Xt_quan, bins='auto')
plt.xticks([0,0.5,1])
plt.rcParams['figure.figsize']=(5.5,6)
plt.show
#3. strategy = 'kmeans' : normal distribution이 두개 생성되게
#discretizer strategy = kmeans
est_km = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='kmeans')
est_km.fit(train_pt)
Xt_km=est_km.transform(train_pt)
print("edges : ", est_km.bin_edges_[0])
unique_k, counts_k = np.unique(Xt_km, return_counts=True)
print(dict(zip(unique_k, counts_k)))
plt.hist(Xt_km, bins='auto')
plt.xticks([0,0.5,1])
plt.rcParams['figure.figsize']=(5.5,6)
plt.show()
Detail : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html