Auto-Sklearn: AutoML in Python

Machine learning is the driving force of modern technology and smart applications. While highly efficient methods and implementations are broadly available, their successful application is hard. It is hard because a myriad of design decisions have to be made correctly before an ML pipeline achieves peak performance.
Such decisions include how to preprocess features (e.g. how to replace missing values), which model class to use (e.g. neural networks or boosted trees), and finally, how to set the hyperparameters of this model class (e.g. the learning rate and number of epochs). Manually searching this vast design space either requires a lot of experience, a lot of computing resources, or both. AutoML is here to help!
AutoML automatically finds well-performing machine learning pipelines and thus frees the human expert from this tedious task. This reduces the barrier to broadly apply machine learning and makes it available for everyone. In this post, we’ll have a look at the AutoML tool Auto-sklearn.
Auto-sklearn is an open-source tool, so we are happy to receive stars, pull requests, and issues: www.github.com/automl/auto-sklearn.
What you’ll get out of this post and what you’ll need to run the code
You’ll learn how to replace a manually designed scikit-learn pipeline with an Auto-sklearn estimator. We provide all code in this Colab Notebook.
Step 1: Load data
As a first step, we’ll use the built-in data loading method from scikit-learn to load the credit-g dataset and split it into train and test data.
import sklearn.datasets import sklearn.model_selection # We fetch the data using openml.org X, y = sklearn.datasets.fetch_openml(data_id=31, return_X_y=True, as_frame=True) # Split the data into train and test X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( X, y, test_size=0.4, random_state=42 ) X_train.info()

Step 2: Manually build a pipeline
Now, we turn to building our pipeline. We’ll use a Support Vector Machine (SVM). However, in order to get good performance with an SVM one needs to preprocess the data, and in particular, we need to apply one-hot encoding to deal with categorical values and scale the features (such as the features credit-amount, which goes up to 20.000, and the feature duration, which does not go above 80).
Note: For demonstration, we use the default hyperparameters set by scikit-learn for this pipeline; however, in practice, these need to be tuned to achieve top performance.
from sklearn.compose import ColumnTransformer from sklearn.metrics import accuracy_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Create the estimator using the default parameters from the library estimator_svc = SVC(C=1.0, kernel='rbf', gamma='scale', shrinking=True, tol=1e-3, cache_size=200, verbose=False, max_iter=-1, random_state=42 ) # build and fit the pipeline categorical_columns = [col for col in X_train.columns if X[col].dtype.name == 'category'] encoder = ColumnTransformer(transformers = [ ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns) ], remainder='passthrough') pipeline_svc = Pipeline([ ('encoder', encoder), ('scaler', StandardScaler()), ('svc', estimator_svc), ]) pipeline_svc.fit(X_train, y_train)
After constructing the pipeline and training it on the training data, we measure the performance on the test set and obtain an accuracy of 76.75%.
# Score the model prediction = pipeline_svc.predict(X_test) performance_svc = accuracy_score(y_test, prediction) print(f"SVC performance is {performance_svc}")
We also tried other classifiers such as a Gradient Boosting Classifier and a Decision Tree and their performance was 73.5% and 70.75%
Step 3: Use Auto-sklearn as a drop-in-replacement
Finally, we’ll demonstrate how easy it is to use auto-sklearn as a drop-in replacement for the manually constructed estimator pipelines discussed above.
Instead of manually specifying a pipeline, we can just use the Auto-sklearn estimator object and all that’s left is to decide how much resources should be spent on searching for the best pipeline. We set this limit to 5 minutes and 1 CPU core. As we have a small dataset at hand, we also turn on cross-validation.
Note: Large datasets require more computational resources to achieve good results.
Then, we’ll have an estimator object that can be handled like any scikit-learn object or pipeline and predict labels for new data; in this case, this achieves a test accuracy of 77.5%, better than the manually designed pipeline and without any need for manual work.
import autosklearn.classification # Create and train the estimator estimator_askl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=300, seed=42, resampling_strategy='cv', n_jobs=1, ) estimator_askl.fit(X_train, y_train) # Score the model prediction = estimator_askl.predict(X_test) performance_askl = accuracy_score(y_test, prediction) print(f"Auto-Sklearn Classifier performance is {performance_askl}")
Wrapping up on Auto-sklearn
You might wonder, what does Auto-sklearn do internally? Well, the short answer is: It searches a huge space with more than 100 dimensions for a pipeline that does well on your dataset and then automatically ensembles the best performing pipelines for prediction.
If this sounds interesting to you and you want to take a deep dive into the methodology behind Auto-sklearn and other up-to-date AutoML systems, and learn how to apply Auto-sklearn to your machine learning problem, we have two events for you at the upcoming ODSC Europe on Wednesday, June 9th:
Frank Hutter will present the methods behind Auto-sklearn and other recent AutoML systems in his presentation (10.50-11.35).
- Automated Machine Learning with Python – from scikit-learn to auto-sklearn
Afterward, Matthias Feurer and Katharina Eggensperger will do a deep dive into how to apply Auto-sklearn to your machine learning problem (11.55-13.25)
Also, if you like Auto-sklearn, give us a star at www.github.com/automl/auto-sklearn!
About the authors/ODSC Europe speakers:



Frank holds a PhD from the University of British Columbia (UBC, 2009) and a Diplom (eq. MSc) from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning.



