Bir Boru Hattı

'daki bir sınıflandırıcıdan sonra bir metrik kullanın Boru hattı hakkında araştırma yapmaya devam ediyorum. Amacım, makine öğreniminin her adımını yalnızca boru hattı ile yürütmektir. Boru hattımı başka bir kullanım durumu ile daha esnek ve kolay hale getireceğim. Yani ben ne:Bir Boru Hattı

Adım 1: Sayılar içine Kategorik Değerler Dönüşüm
Adım 3: Sınıflandırıcı
Adım 4: GridSearch
Adım 5: Bir ekleme doldurun NaN

Adım 2 Değerleri

import pandas as pd 
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.feature_selection import SelectKBest 
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.pipeline import Pipeline 
from sklearn.metrics import roc_curve, auc 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import f1_score 


class FillNa(BaseEstimator, TransformerMixin): 

    def transform(self, x, y=None): 
      non_numerics_columns = x.columns.difference(
       x._get_numeric_data().columns) 
      for column in x.columns: 
       if column in non_numerics_columns: 
        x.loc[:, column] = x.loc[:, column].fillna(
         df[column].value_counts().idxmax()) 
       else: 
        x.loc[:, column] = x.loc[:, column].fillna(
         x.loc[:, column].mean()) 
      return x 

    def fit(self, x, y=None): 
     return self 


class CategoricalToNumerical(BaseEstimator, TransformerMixin): 

    def transform(self, x, y=None): 
     non_numerics_columns = x.columns.difference(
      x._get_numeric_data().columns) 
     le = LabelEncoder() 
     for column in non_numerics_columns: 
      x.loc[:, column] = x.loc[:, column].fillna(
       x.loc[:, column].value_counts().idxmax()) 
      le.fit(x.loc[:, column]) 
      x.loc[:, column] = le.transform(x.loc[:, column]).astype(int) 
     return x 

    def fit(self, x, y=None): 
     return self 


class Perf(BaseEstimator, TransformerMixin): 

    def fit(self, clf, x, y, perf="all"): 
     """Only for classifier model. 

     Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df 
     You can put a list of eval instead a string for eval paramater. 
     Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4 
     evals. 
     """ 
     evals = {} 
     y_pred_proba = clf.predict_proba(x)[:, 1] 
     y_pred = clf.predict(x) 
     perf_list = perf.split(',') 
     if ("all" or "roc") in perf.split(','): 
      fpr, tpr, _ = roc_curve(y, y_pred_proba) 
      roc_auc = round(auc(fpr, tpr), 3) 
      plt.style.use('bmh') 
      plt.figure(figsize=(12, 9)) 
      plt.title('ROC Curve') 
      plt.plot(fpr, tpr, 'b', 
        label='AUC = {}'.format(roc_auc)) 
      plt.legend(loc='lower right', borderpad=1, labelspacing=1, 
         prop={"size": 12}, facecolor='white') 
      plt.plot([0, 1], [0, 1], 'r--') 
      plt.xlim([-0.1, 1.]) 
      plt.ylim([-0.1, 1.]) 
      plt.ylabel('True Positive Rate') 
      plt.xlabel('False Positive Rate') 
      plt.show() 

     if "all" in perf_list or "auc" in perf_list: 
      fpr, tpr, _ = roc_curve(y, y_pred_proba) 
      evals['auc'] = auc(fpr, tpr) 

     if "all" in perf_list or "cm" in perf_list: 
      evals['cm'] = confusion_matrix(y, y_pred) 

     if "all" in perf_list or "f1" in perf_list: 
      evals['f1'] = f1_score(y, y_pred) 

     return evals 


path = '~/proj/akd-doc/notebooks/data/' 
df = pd.read_csv(path + 'titanic_tuto.csv', sep=';') 
y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'], 
             value=[0., 1.]) 
X = df.copy() 
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42) 

percent = 0.50 
nb_features = round(percent * df.shape[1]) + 1 
clf = RandomForestClassifier() 
pipeline = Pipeline([('fillna', FillNa()), 
        ('categorical_to_numerical', CategoricalToNumerical()), 
        ('features_selection', SelectKBest(k=nb_features)), 
        ('random_forest', clf), 
        ('perf', Perf())]) 

params = dict(random_forest__max_depth=list(range(8, 12)), 
       random_forest__n_estimators=list(range(30, 110, 10))) 
cv = GridSearchCV(pipeline, param_grid=params) 
cv.fit(X_train, y_train)

: ölçümlerini İşte

benim kodudur (başarısız)

Bir roc eğrisi yazdırmanın ideal olmadığını biliyorum, ancak şu anda sorun değil. Bu kodu çalıştırdığınızda

Yani, var:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', 
      max_depth=None,...=1, oob_score=False, random_state=None, 
      verbose=0, warm_start=False)), ('perf', Perf())]) does not.

Ben tüm fikirler ilgileniyorum ...

kaynak

2017-05-04 Jeremie Guez

hata devletler olarak, GridSearchCV içinde puanlama parametresini belirtmek gerekir.

Kullanım

GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy')

Düzenleme (yorumlardaki sorulara dayanarak):

Eğer roc, herkes için auc eğrisi ve f1 tüm X_train ve y_train (ve gerekirse GridSearchCV'nin bölünmeleri, Perf sınıfını boru hattından uzak tutmak için daha iyidir.

pipeline = Pipeline([('fillna', FillNa()), 
        ('categorical_to_numerical', CategoricalToNumerical()), 
        ('features_selection', SelectKBest(k=nb_features)), 
        ('random_forest', clf)]) 

#Fit the data in the pipeline 
pipeline.fit(X_train, y_train) 

performance_meas = Perf() 
performance_meas.fit(pipeline, X_train, y_train)

kaynak

2017-05-04 15:35:56

Harika! Ama roc eğrimi bu şekilde çizmek mümkün değil mi? Aynı boru hattında doğruluk ve f1 puanı elde etmek mümkün olacak mı? –

Evet, mümkün. Sonuçları almıyor musunuz? Kodunuzun daha fazla incelenmesi üzerine, bunu çözdükten sonra bile başka bir hata verecek gibi görünüyor. –

Eğer 'Class Perf'imi silersem ve' cv = GridSearchCV (pipeline, param_grid = params, scoring = 'doğruluk') 'ı çağırırsam cv.fit (X_train, y_train) 'Hatam yok. Ben aynı anlama sahip –

cevap

İlgili konular