How to properly pickle sklearn pipeline when using custom transformer

Question

I found a pretty straightforward solution. Assuming you are using Jupyter notebooks for training:

Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.

This is the file custom_transformer.py

from sklearn.pipeline import TransformerMixin

class FilterOutBigValuesTransformer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.biggest_value = X.c1.max()
        return self

    def transform(self, X):
        return X.loc[X.c1 <= self.biggest_value]

Train your model importing this class from the .py file and save it using joblib.

import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler

pipeline = Pipeline([
    ('filter', FilterOutBigValuesTransformer()),
    ('encode', MinMaxScaler()),
])

X=load_some_pandas_dataframe()
pipeline.fit(X)

joblib.dump(pipeline, 'pipeline.pkl')

When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:

import joblib
from utils import custom_transformer # decided to save it in a utils directory

pipeline = joblib.load('pipeline.pkl')

Leave a Comment Cancel reply