[Note] 개인 공부

Regression

y = b0 + b1X1 + b2X2 + ... + bkXk

https://support.minitab.com/ko-kr/minitab/18/help-and-how-to/modeling-statistics/regression/how-to/fit-regression-model/interpret-the-results/all-statistics-and-graphs/regression-equation/

R² : 결정계수, 적합한 정도를 재는 척고, 0~1 회귀모형의 유용성

SST : total sum fo squares, 총변동. 개별 ydml 편차 제곱의 합

SSE : explained sum of squres, 설명된 변동. 회귀식 추정 y의 편차 제곱의합

SSR : Residual sum of squres, 잔차의 제곱의 합. 설명안된 변동.

https://datalabbit.tistory.com/54

https://igija.tistory.com/256

Leverage

개별 데이터가 회귀분석 결과에 미치는 영향을 알 수 있는 표본.

= self-influence, self-sensitivity

https://datascienceschool.net/03%20machine%20learning/05.03%20%EB%A0%88%EB%B2%84%EB%A6%AC%EC%A7%80%EC%99%80%20%EC%95%84%EC%9B%83%EB%9D%BC%EC%9D%B4%EC%96%B4.html

ridge(L2), lasso(L1, reduction), elasticnet

overfit 방지

https://tyami.github.io/machine%20learning/regularization-Ridge-Lasso-ElasticNet/

Normalization [0,1]

Standardization

https://realblack0.github.io/2020/03/29/normalization-standardization-regularization.html

performance metrics

sensitivity, specificity, precision, recall

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

odds ratio, 승산비

https://bioinformaticsandme.tistory.com/287

support vector

1) c(cost) boundary 기준으로 outlier를 몇개 정도 허용할 것이냐라는 hyperparameter

c↑ => overfit↑

c↓ => overfit↓ --> underfit↑

2) RBF kernel (Radial Basis Function).

gamma : 데이터 샘플 하나의 영향력 거리.

gamma↑ => overfit↓ --> underfit↑

gamma↓ => overfit↑

https://hleecaster.com/ml-svm-concept/

missing value

MCAR(Missing Completely At Random) : 원인에 연관성 X, ex) 전산오류

MAR(Missing At Random) : 원인에 연관성 O, ex) 남성들이 여성관련 설문에 답을 못하는 경우

MNAR(Missing Not At Random) : 위 둘다 아닌경우, ex) 우울감 때문에 설문을 못하는 경우

처리

1. deletion

1) listwise : 하나의 NA값 존재 -> 전체 레코드 삭제

2) pairwise : 특정 NA값에 해당하는 column에서 record 삭제

2. imputation

1) categorical

random
NA
most frequent
KNN
deep learning

2) continuous

mean
median
MICE(Multivariate Imputationly Chained Equation))
Random Forest
deep learning

seed 관련

random 숫자 : 컴퓨터에서 random이란 (pseudo-random)

완벽한 랜덤 숫자를 만들 수 없지만 거의 랜덤과 가까운 숫자를 만들 수는 있다.

계산법 - 평방 채증법

PRNG (PseudoRandomness Number Generator)

대표적인 랜덤 방법

1) 난수테이블 + data time

2) 외부노이즈. ex)대기, 하드웨어 노이즈

BigData에서 seed 영향 받는 분류

랜덤성의 모든것들

1) train test split

2) data order

3) algorithm : weight

4) sampling

대표적으로 0, 42 많이 쓰임 (42 - 히치하이커 과학소설)

Train-Test-Split

stratify : outcome 비율에 맞는 계층적 데이터 추출 (모분포에 맞게 train test를 나누는 것이 목적)

genetic programming

알고리즘 선택 툴,

natural selection (자연선택) vs artificial selection (품종개량)

natural selection (자연선택) : 특수한 환경에서 생존에 적합한 형질을 지닌 개체군이 부적합한 개체군에 비해 생존, 번식 에 더욱 능함.

https://ko.wikipedia.org/wiki/%EC%9C%A0%EC%A0%84_%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D

TPOT (auto-ML tool)

http://epistasislab.github.io/tpot/

hyperparameter

1) generaion : run pipeline optimization process, default=100

2) population-size : indivisuals, every generation, default=100

3) CV : Cross-Validation, default=5

interative imputation vs simple imputer

MICE (Multivariate imputer by Chained Equation)

estimator

BayesianRidge - linear regression
ExtraTreesRegressor - non linear
kNeighbors Regressor - non linear
Decision Tree Regressor - non linear

output

change : imputation 후 변동값
scaled tolerance : 목표

scaled tolerance 값을 목표로 iterration을 계속 진행

Discretization

python : kBinsDiscretizer

hyperparameter

1) n-bins : 나눌 집단의 수

2) encode : onehot-dense

3) strategy

uniform (if n==2, mean)
quantile (if n==2, median)
kmenas (normal distribution*n)

SHAPley Value

summary plot에서 기본적으로 빨간 부분 : outcome에 양의 영향, 파란 부분 : outcome에 음의 영향

Regression 모델 평가지표

MSE (Mean Squared Error) : 단위가 더블링되어 outlier에 민감
RMSE (Root Mean Squared Error) : 단위는 같지만 역시 outlier에 민감
MAE (Mean Absolute Error) : 가장직관적인 지표, 단위가 같고 에러표현도 적절, outlier에 robust
MAPE (Mean Absoulute Percentage Error) : 전체적으로 모델의 error y 파악 가능
MPE (Mean Percentage Error) : underperformance, overperformance 파악 가능

https://machinelearningmastery.com/regression-metrics-for-machine-learning/

ROC Curve에 대한 이해

sensitivity 편향 (위로 볼록형) vs specificity 편향 (옆으로 불록형)

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

통계와 머신러닝

통계가 발전하여 인공지능이 가능

차이점을 꼽으라면 실사용이 용도의 쓰임

통계 : 집단에 초점, 현상에 대한 이해 (집단이 개인을 대표)

머신러닝 : 개인, 개인 맞춤 가능한 모든 경우의 수 계산 => 실사용에 용이 (개인을 대표할 수 있는 집단 데이터 규모 즉 빅데이터로 인해 가능한 모든 경우의 수에 대해서 모델 학습이 가능해 짐)

따라서 어느 하나를 택하는 것이 아닌 용도에 맞게 사용하면 됨.

저작자표시 (새창열림)

Data Insider

[Note] 개인 공부

티스토리툴바

[Note] 개인 공부

관련글

티스토리툴바