본문 바로가기
Python

[python][mljar] automated machine learning - Part2 : mljar

by Chandler.j 2021. 5. 26.
반응형

fig1. title

순서

1. mljar : automated machine learning

2. install - pip

3. run code

4. report


1. mljar : automated machine learning

- 머신러닝 자동화 프레임 워크 그림참고

fig2. summary of mljar

참고 : https://mljar.com/automated-machine-learning/

 

What is Automated Machine Learning?

The MLJAR Automated Machine Learning (AutoML) Framework

mljar.com


2. install - pip, conda

설치는 pip로만 가능 아직 conda 불가능

pip install mljar-supervised

참고 : https://pypi.org/project/mljar-supervised/

 

mljar-supervised

Automated Machine Learning for Humans

pypi.org

conda install -c conda-forge mljar-supervised

참고 : https://anaconda.org/conda-forge/mljar-supervised

 

Mljar Supervised :: Anaconda.org

Description The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and

anaconda.org


3. run code

소스코드 : https://github.com/mljar/mljar-supervised

 

mljar/mljar-supervised

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket: - mljar/mljar-supervised

github.com

 

 

보통 3가지 방법이 쓰임

1.optuna, 2.explain, 3.perform 성격에 맞게 골라서 쓰면됨 사실 성능이 제일 좋은걸 쓰면됨.

 hyperparameter 튜닝하는 개수가 달라지는 걸로 알고 있음.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.utils.class_weight import compute_sample_weight

# data load and split X, y
df=pd.read_csv("data/60model_train_share.csv")

X_train = df.drop('eGFR_ab', axis=1)
y_train = df['eGFR_ab'].astype("int64")

weights = compute_sample_weight(class_weight="balanced", y=y_train)

# 1. optuna

automl = AutoML(mode="Optuna", ml_task="binary_classification", 
    algorithms=["CatBoost"], eval_metric='auc',
    optuna_time_budget=10*60,
    total_time_limit = 24*3600,
    golden_features = False, 
    features_selection = False,
    train_ensemble= True,
    stack_models = 'auto',
    random_state=SEED, results_path="optuna")

automl.fit(X_train, y_train, weights)

# 2. explain

automl = AutoML(mode="Explain", ml_task="binary_classification", 
    algorithms=["Baseline", "CatBoost", "Xgboost", "Random Forest", "Extra Trees", "LightGBM", "Neural Network"], 
    eval_metric='auc',
    train_ensemble= False, 
    random_state=SEED,
    results_path="explain-wt")

automl.fit(X_train, y_train, weights)

# 3. Perform

automl = AutoML(mode="Perform", ml_task="binary_classification", 
    algorithms=["CatBoost", "Xgboost"], 
    eval_metric='auc',
    golden_features=False,
    features_selection=False,
    train_ensemble = False,
    stack_models = False, 
    random_state=SEED,
    results_path="perform")

automl.fit(X_train, y_train, weights)

4. report

주피터나 다른 환경에서도 돌아가지만 결과를 확인할때 Visual studio code가 제일 호환의 좋음

fig3. report of mljar

 - report가 아주 보기 편하게 뽑혀나옴

 - 성능이 가장 좋은 모델을 꺼내올 수 있음

 - 비교적 짧은 시간에 좋은 모델을 뽑을 수 있음

 


추가

MLJAR 다른 AutoML과 비교해서 우수한 성능을 보였다는 보고도 있음

ref : https://towardsdatascience.com/compare-popular-automl-frameworks-on-10-tabular-kaggle-competitions-9b1420e8942d


TOP

Designed by 티스토리