Lesson 40 – Train models with Scikit-Learn

Embark on a data science journey with scikit-learn and MLflow within the Fabric Data Science experience. This blog unveils the power of scikit-learn’s machine learning capabilities, guiding you through installation, data preparation, model training, and evaluation. Experience the effortless integration with MLflow, where parameters, metrics, and models seamlessly log to streamline and enhance experiment tracking efficiency.

Scikit-Learn

Scikit-learn is a popular open-source machine learning library for Python. It provides a comprehensive set of tools for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, and more. Scikit-learn is built on NumPy, SciPy, and Matplotlib, and it is designed to be user-friendly, efficient, and extensible.

Key features of Scikit-Learn

Some of the key features of Scikit -Learn are

Scikit-learn provides a wide array of machine learning algorithms, from traditional methods to advanced models, for various tasks.
Its user-friendly and consistent API simplifies the implementation and evaluation of machine learning models.
Leveraging NumPy and SciPy, scikit-learn is optimized for efficient numerical computations, enabling effective handling of large datasets.
Scikit-learn offers a rich set of tools encompassing data preprocessing, feature selection, model selection, and performance evaluation, facilitating a comprehensive machine learning workflow.
Scikit-learn’s integration with Pandas, Matplotlib, and IPython facilitates a unified and efficient machine learning workflow, enhancing data manipulation, visualization, and interactive coding capabilities.

Prerequisites

Launch https://app.fabric.microsoft.com/
Sign in to Microsoft Fabric with your subscription or trial account. Refer Lesson 3 – Getting started with Microsoft Fabric – Pearl Innovations

Steps

Navigate to the menu bar on the left side and choose “Workspaces” to create new workspace.
Create Lakehouse or you can work in existing Lakehouse.
Click Open notebook -> New notebook to create new notebook.

Within a few seconds, a new notebook will open, comprising a single cell. Notebooks consist of cells that can contain either code or markdown (formatted text).

Install scikit-learn

To begin using scikit-learn, make sure it is installed in your notebook environment. You can install or upgrade scikit-learn by executing the following command:

!pip install scikit-learn

Create Machine Learning Experiment

Initiate a machine learning experiment using the MLflow API. The mlflow.set_experiment () function will generate a new machine learning experiment in case it doesn’t already exist.

Set up the environment

Import necessary libraries including mlflow, mlflow.sklearn, numpy, pandas, and scikit-learn modules.

Load data

Loadthe Iris dataset from scikit-learn.
load_iris() is a function from scikit-learn that loads the Iris dataset.
The loaded dataset is stored in the variable iris.

Selecting Features and Splitting Data

Select relevant features for the model, including the label column (“target”). Split the data into training and testing sets.
In the context of feature selection, X represents the feature matrix extracted from the Iris dataset. It contains measurements for sepal length, sepal width, petal length, and petal width. Similarly, y is the target variable containing the class labels (0 for setosa, 1 for versicolor, and 2 for virginica).

The train_test_split function is used to divide the feature matrix (X) and target variable (y) into training and testing sets. The split ratio is 80% for training (X_train, y_train) and 20% for testing (X_test, y_test), and a random seed of 42 is set for reproducibility.
This partition enables training the model on one subset and evaluating it on another, promoting an impartial evaluation of the model’s performance.

Training a model is the process of instructing it to make predictions based on input data, adjusting internal parameters to minimize prediction errors during learning.

The given Python code initializes a RandomForestClassifier model with 100 trees and a fixed random seed (random_state=42). It then fits (trains) the model using the training features (X_train) and corresponding labels (y_train).

Make Predictions

Use the trained model to make predictions on the test set.

The given Python code uses the trained machine learning model (RandomForestClassifier) to make predictions on the test set (X_test), and stores the predicted labels in the variable `y_pred`.

Assessing the model’s performance involves evaluating its accuracy by comparing its predictions to the actual outcomes on a validation or test dataset.

The provided Python code calculates the accuracy of the trained model by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`). The resulting accuracy is stored in the variable `accuracy`, and it is then printed with two decimal places for precision.

Result

The displayed accuracy of 1.00 (100%) signifies that the trained model achieved perfect accuracy on the test set, indicating complete alignment between the model’s predictions and the actual labels in the test data.

Log Parameters and Metrics with MLflow

Use mlflow.start_run() to start a new MLflow run and log parameters, metrics, and the trained model.

The provided Python code uses `mlflow.start_run()` to initiate a new MLflow run. It logs parameters (`n_estimators` and `random_state`), the accuracy metric, and the trained RandomForestClassifier model (`model`).
This step captures essential experiment details in the MLflow tracking server for comparison and reproduction of experiments.

Tags	Microsoft Fabric
Useful links	How to train models with scikit-learn in Microsoft Fabric
Test Your Knowledge	Quiz