아마 거의 모든 데이터 셋에는 missing value가 존재
missing value 처리 방법은 간단히 두가지
- 지운다. : deletion
- 채운다. : imputation
imputation 방법은 크게 두가지
- simple
- multivariate
simple imputation
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
multivariate imputation (MICE, Multiple Imputation by Chained Equations)
simple imputation 방법보다 더 정교한 방법
sklearn의 IterativeImputer와 함께 4가지 알고리즘으로 활용가능
- BayesianRidge: regularized linear regression
- DecisionTreeRegressor: non-linear regression
- ExtraTreesRegressor: similar to missForest in R
- KNeighborsRegressor: comparable to other KNN imputation approaches
ref: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html