How to Build a Machine Learning Model in Python

Authored By: Ankita Prajapati

Building a machine learning model in Python involves several steps. In this tutorial, we’ll go through the process of building a simple model to predict whether a patient has diabetes or not, using the popular diabetes dataset.

Here are the steps we’ll cover:

  1. Importing the necessary libraries
  2. Loading the dataset
  3. Exploring the dataset
  4. Preprocessing the data
  5. Splitting the data into training and testing sets
  6. Building the machine learning model
  7. Evaluating the model’s performance

Let’s get started!

Step 1: Importing the necessary libraries

First, we need to import the necessary libraries. We’ll be using the pandas library to load and manipulate the dataset, and scikit-learn library to build the machine learning model. Here’s the code to import these libraries:

				
					import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
				
			

Step 2: Loading the dataset

Next, we need to load the dataset into our Python environment. The diabetes dataset is available in scikit-learn, so we can easily load it using the following code:

				
					from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
				
			

Step 3: Exploring the dataset

Before we start preprocessing the data, let’s take a closer look at the dataset to see what we’re working with. We can use pandas to load the dataset into a dataframe and explore the data using various methods.

				
					df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target
print(df.head())
				
			

This code will print the first few rows of the dataset, along with the target variable. We can also use pandas to get some basic statistics about the dataset:

				
					print(df.describe())
				
			

This code will print the mean, standard deviation, and other statistics for each feature in the dataset.

Step 4: Preprocessing the data

Now that we’ve explored the dataset, we can preprocess the data to prepare it for machine learning. In this case, we’ll simply scale the data using the StandardScaler from scikit-learn. This is an important step to ensure that all features are on the same scale, which can improve the performance of some machine learning algorithms.

				
					from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(diabetes.data)
y = diabetes.target
				
			

Step 5: Splitting the data into training and testing sets

Before we build the machine learning model, we need to split the data into training and testing sets. This will allow us to evaluate the performance of the model on data that it hasn’t seen before. We’ll use the train_test_split function from scikit-learn to split the data into 80% training data and 20% testing data.

				
					X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
				
			

Step 6: Building the machine learning model

Now we’re ready to build the machine learning model. In this case, we’ll use a decision tree classifier from scikit-learn. This is a simple algorithm that works well for classification problems.

				
					clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
				
			

Step 7: Evaluating the model's performance

Finally, we can evaluate the performance of the machine learning model on the testing data. We’ll use the accuracy_score and confusion_matrix functions from scikit-learn to get a sense of how well the model is performing.

				
					y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion matrix:\n{conf_matrix}")
				
			

The accuracy_score function compares the predicted values (y_pred) with the actual values (y_test) and returns the percentage of correct predictions. The confusion_matrix function creates a confusion matrix that shows the number of true positives, true negatives, false positives, and false negatives.

By looking at the accuracy and confusion matrix, we can get a sense of how well our model is performing. If the accuracy is high and the confusion matrix shows a small number of false positives and false negatives, then our model is doing a good job of predicting whether a patient has diabetes or not.

Conclusion

That’s it! We’ve successfully built a machine learning model in Python to predict whether a patient has diabetes or not.

Of course, this is a very simple example, and there are many other machine learning algorithms and techniques that we could use to improve the performance of our model.

This should give you a good starting point for building your own machine learning models in Python.

What is YourEngineer?

YourEngineer is the first Engineering Community Worldwide that focuses on spreading Awareness, providing Collaboration and building a focused Career Approach for Engineering Students.

Deep dive into upskilling with YourEngineer
Join millions like you

campus cover
  • Create an Account and Earn 1000 Coins
  • Pass a Quiz and Earn 20 Coins
  • Earn 10 Coins for Daily Visit 
  • Earn 50 Coins for invite someone to join a group
  • Earn 100 Coins for finishing a course