본문 바로가기
Python

[python][TPOT] automated machine learning - Part1 : TPOT(GPU)

by Chandler.j 2021. 4. 26.
반응형

fig1. title

 


- XGBoost 환경 셋팅

#0. tpot and xgboost

#1. cpu : conda install

#2. single gpu : cuml

#3. multi gpu : rapids, conda install tpot and dask_ml


#0 TPOT and XGBoost

#0.0 TPOT for Automated Machine Learning

 

Automated Machine Learning (AutoML)

 사용자 개입이 거의없이 예측 모델링 작업을 위해 성능이 우수한 모델을 자동으로 검색하는 기술

 

TPOT

- Python에서 AutoML을 수행하기위한 오픈 소스 라이브러리

- 데이터 변환 및 기계 학습 알고리즘에 인기있는 Scikit-Learn 기계 학습 라이브러리를 사용

- 유전 프로그래밍 확률 적 글로벌 검색 절차를 사용

- 주어진 데이터 세트에 대한 최고 성능의 모델 파이프 라인을 효율적으로 검색

fig2. Overview of the TPOT Pipeline Search Taken from: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

Documents : epistasislab.github.io/tpot/

 

TPOT

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT will automate the most tedious part of machine learning by intelligently exploring thousan

epistasislab.github.io

Tutorial : machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/

 

TPOT for Automated Machine Learning in Python

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user […]

machinelearningmastery.com

#0.1 XGBoost

XGBoost

- 고도로 효율적이고 유연하며 이식 가능하도록 설계된 최적화 된 분산 그라디언트 부스팅 라이브러리

- Gradient Boosting 프레임 워크에서 기계 학습 알고리즘을 구현

- XGBoost는 많은 데이터 과학 문제를 빠르고 정확하게 해결하는 병렬 트리 부스팅 (GBDT, GBM)을 제공

- 동일한 코드가 주요 분산 환경 (Hadoop, SGE, MPI)에서 실행가능

 

Documents : xgboost.readthedocs.io/en/latest/

 

XGBoost Documentation — xgboost 1.5.0-SNAPSHOT documentation

 

xgboost.readthedocs.io


1. 성능이 우수한 XGBoost 알고리즘을 이용해 machine learning modeling을 하고 싶음

2. 그러나 xgboost의 hyperparameters가 너무 많아 tuning하기 쉽지 않음

3. TPOT을 이용해서 best hyperparameters를 찾아 보자


#1 CPU : conda install

 

해당 guide에 따라 install : epistasislab.github.io/tpot/installing/

 

Installation - TPOT

Installation TPOT is built on top of several existing Python libraries, including: Most of the necessary Python packages can be installed via the Anaconda Python distribution, which we strongly recommend that you use. Support for Python 3.4 and below has b

epistasislab.github.io

필자는 conda를 선호

 

#1.0 TPOT install

conda install -c conda-forge tpot

#1.1 XGBoost install

conda install -c conda-forge xgboost

#1.2 config 수정

classifier_config_dict = {

    # xgboost just CPU
    
    'xgboost.XGBClassifier': {
        'n_estimators': [100, 250 ,500, 750, 1000],
        'learning_rate': [1e-2, 1e-1, 0.3],
        'max_depth': range(2, 11),
        'min_child_weight': range(1, 21),
        'gamma':np.arange(0, 1.01, 0.2),
        'subsample': np.arange(0.2, 1.01, 0.2),
        'colsample_bytree': np.arange(0.4,1.01,0.2),
        "reg_alpha": [0, 0.25, 0.5, 0.75, 1],
        "reg_lambda": [1, 2, 4, 6, 8],
        'scale_pos_weight': [estimate],
        'objective': ['binary:logistic'],
        'n_jobs': [1],
        'verbosity': [0]
    },

}

XGBoost의 best parameters찾는게 목적 / classifier config만 적당한 변수들로 수정

XGBoost parameter 참고 : xgboost.readthedocs.io/en/latest/parameter.html#

 

XGBoost Parameters — xgboost 1.5.0-SNAPSHOT documentation

XGBoost Parameters Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and task parameters. General parameters relate to which booster we are using to do boosting, commonly tree or linear model Booster para

xgboost.readthedocs.io

#1.3 Using TPOT

tpot = TPOTClassifier(scoring="roc_auc",
                      cv=5,
                      random_state=SEED,
                      n_jobs=4,
                      verbosity=3,
                      generations=100,
                      population_size=100,
                      use_dask=False,
                      warm_start=False,
                      config_dict=classifier_config_dict,
                      template='Classifier')

training_features=X_train_features.copy(deep=True)
tpot.fit(training_features, y_train)

Parameters 참고 : epistasislab.github.io/tpot/api/

 

TPOT API - TPOT

Parameters: generations: int or None optional (default=100) Number of iterations to the run pipeline optimization process. It must be a positive number or None. If None, the parameter max_time_mins must be defined as the runtime limit. Generally, TPOT will

epistasislab.github.io

#1.4 export best hyperparameters

tpot.export('model_best_parameters.py')

 

 

#2 Single GPU : cuml

#2.0 TPOT-cuml install

conda env create -f tpot-cuml.yml -n tpot-cuml

참고 : epistasislab.github.io/tpot/installing/#installation-for-using-tpot-cuml-configuration

 

Installation - TPOT

Installation TPOT is built on top of several existing Python libraries, including: Most of the necessary Python packages can be installed via the Anaconda Python distribution, which we strongly recommend that you use. Support for Python 3.4 and below has b

epistasislab.github.io

#2.1 config 수정

XGBoost의 tree_method 옵션'gpu_hist'를 추가

classifier_config_dict = {

    # xgboost tree method = gpu hist
    
    'xgboost.XGBClassifier': {
        'n_estimators': [100, 250 ,500, 750, 1000],
        'learning_rate': [1e-2, 1e-1, 0.3],
        'max_depth': range(2, 11),
        'min_child_weight': range(1, 21),
        'gamma':np.arange(0, 1.01, 0.2),
        'subsample': np.arange(0.2, 1.01, 0.2),
        'colsample_bytree': np.arange(0.4,1.01,0.2),
        "reg_alpha": [0, 0.25, 0.5, 0.75, 1],
        "reg_lambda": [1, 2, 4, 6, 8],
        'scale_pos_weight': [estimate],
        'objective': ['binary:logistic'],
        'tree_method' : ['gpu_hist'],
        'n_jobs': [1],
        'verbosity': [0]
    },

}

config 참고 : github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_cuml.py

 

EpistasisLab/tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. - EpistasisLab/tpot

github.com

#2.2 using TPOT and export output

tpot = TPOTClassifier(scoring="roc_auc",
                      cv=5,
                      random_state=SEED,
                      n_jobs=4,
                      verbosity=3,
                      generations=100,
                      population_size=100,
                      use_dask=False,
                      warm_start=False,
                      config_dict=classifier_config_dict,
                      template='Classifier')

training_features=X_train_features.copy(deep=True)
tpot.fit(training_features, y_train)

tpot.export('model_best_parameters.py')

 

#3 Multi GPU : rapids, conda install tpot and dask_ml

#3.0 install rapids

참고 : rapids.ai/start.html#get-rapids

 

Getting Started | RAPIDS

Get started with RAPIDS using conda, docker, or from source builds.

rapids.ai

rapids 페이지 접속 후 버전 클릭 후 COMMAND copy and paste 

 

#3.1 install TPOT

conda install -c conda-forge tpot

 

#3.2 dask setting

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# Create a Dask Cluster with one worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)

참고: dask-cuda.readthedocs.io/en/latest/quickstart.html

 

Quickstart — dask-cuda 0.20.0a+14.g67d536e.dirty documentation

A Dask-CUDA cluster can be created using either LocalCUDACluster or dask-cuda-worker from the command line. dask-cuda-worker To create an equivalent cluster from the command line, Dask-CUDA workers must be connected to a scheduler started with dask-schedul

dask-cuda.readthedocs.io

#3.3 config 수정

XGBoost의 tree_method 옵션'gpu_hist'를 추가

classifier_config_dict = {

    # xgboost tree method = gpu hist
    
    'xgboost.XGBClassifier': {
        'n_estimators': [100, 250 ,500, 750, 1000],
        'learning_rate': [1e-2, 1e-1, 0.3],
        'max_depth': range(2, 11),
        'min_child_weight': range(1, 21),
        'gamma':np.arange(0, 1.01, 0.2),
        'subsample': np.arange(0.2, 1.01, 0.2),
        'colsample_bytree': np.arange(0.4,1.01,0.2),
        "reg_alpha": [0, 0.25, 0.5, 0.75, 1],
        "reg_lambda": [1, 2, 4, 6, 8],
        'scale_pos_weight': [estimate],
        'objective': ['binary:logistic'],
        'tree_method' : ['gpu_hist'],
        'n_jobs': [1],
        'verbosity': [0]
    },

}

#3.3 using TPOT and export output

use_dask=True로 해줘야됨

tpot = TPOTClassifier(scoring="roc_auc",
                      cv=5,
                      random_state=SEED,
                      n_jobs=4,
                      verbosity=3,
                      generations=100,
                      population_size=100,
                      use_dask=True,
                      warm_start=False,
                      config_dict=classifier_config_dict,
                      template='Classifier')

training_features=X_train_features.copy(deep=True)
tpot.fit(training_features, y_train)

tpot.export('model_best_parameters.py')

 

 

 

 

 

 


TOP

Designed by 티스토리