Machine Learning Life Cycle:
Machine learning has given the computer systems the abilities to
automatically learn without being explicitly programmed. But how does a machine
learning system work? So, it can be described using the life cycle of machine
learning. Machine learning life cycle is a cyclic process to build an efficient
machine learning project. The main purpose of the life cycle is to find a
solution to the problem or project.
Machine learning life cycle involves seven major steps, which are
given below:
- Gathering
Data
- Data
preparation
- Data
Wrangling
- Analyse
Data
- Train
the model
- Test
the model
- Deployment
- Deployment
The most important thing in the complete process
is to understand the problem and to know the purpose of the problem. Therefore,
before starting the life cycle, we need to understand the problem because the
good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a
problem, we create a machine learning system called "model", and this
model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.
1.
Gathering Data:
Data Gathering is the first step of the machine
learning life cycle. The goal of this step is to identify and obtain all
data-related problems.
In this step, we need to identify the different
data sources, as data can be collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the
life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be
the prediction.
This step includes the below tasks:
- Identify
various data sources
- Collect
data
- Integrate
the data obtained from different sources
By performing the above task, we get a coherent
set of data, also called as a dataset. It will be used in further
steps.
2. Data
preparation
After collecting the data, we need to prepare it
for further steps. Data preparation is a step where we put our data into a
suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together,
and then randomize the ordering of data.
This step can be further divided into two
processes:
- Data
exploration:
It is used to understand the nature of data that we have to work with. We need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and outliers. - Data
pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data
Wrangling
Data wrangling is the process of cleaning and
converting raw data into a useable format. It is the process of cleaning the
data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the
most important steps of the complete process. Cleaning of data is required to
address the quality issues.
It is not necessary that data we have collected
is always of our use as some of the data may not be useful. In real-world
applications, collected data may have various issues, including:
- Missing
Values
- Duplicate
data
- Invalid
data
- Noise
So, we use various filtering techniques to clean
the data.
It is mandatory to detect and remove the above
issues because it can negatively affect the quality of the outcome.
4. Data
Analysis
Now the cleaned and prepared data is passed on
to the analysis step. This step involves:
- Selection
of analytical techniques
- Building
models
- Review
the result
The aim of this step is to build a machine
learning model to analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared
data, and evaluate the model.
Hence, in this step, we take the data and use
machine learning algorithms to build the model.
5.
Train Model
Now the next step is to train the model, in this
step we train our model to improve its performance for better outcome of the
problem.
We use datasets to train the model using various
machine learning algorithms. Training a model is required so that it can
understand the various patterns, rules, and, features.
6. Test
Model
Once our machine learning model has been trained
on a given dataset, then we test the model. In this step, we check for the
accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage
accuracy of the model as per the requirement of project or problem.
7.
Deployment
The last step of machine learning life cycle is
deployment, where we deploy the model in the real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with acceptable speed, then we deploy
the model in the real system. But before deploying the project, we will check
whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.
How to get
datasets for Machine Learning
The key to success in the field of machine learning or to become a
great data scientist is to practice with different types of datasets. But
discovering a suitable dataset for each kind of machine learning project is a
difficult task. So, in this topic, we will provide the detail of the sources
from where you can easily get the dataset according to your project.
Before knowing the sources of the machine learning dataset, let's
discuss datasets.
What is a dataset?
A dataset is
a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table
shows an example of the dataset:
Country |
Age |
Salary |
Purchased |
India |
38 |
48000 |
No |
France |
43 |
45000 |
Yes |
Germany |
30 |
54000 |
No |
A tabular dataset can be understood as a database table or matrix,
where each column corresponds to a particular
variable, and each row corresponds to the fields of the dataset. The
most supported file type for a tabular dataset is "Comma Separated File," or CSV. But to
store a "tree-like data," we can use the JSON file more efficiently.
Types of data in datasets
- Numerical
data:Such
as house price, temperature, etc.
- Categorical
data:Such
as Yes/No, True/False, Blue/green, etc.
- Ordinal
data:These
data are similar to categorical data but can be measured on the basis of
comparison.
Note: A
real-world dataset is of huge size, which is difficult to manage and process at
the initial level. Therefore, to practice machine learning algorithms, we can use
any dummy dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of
data, because, without the data, one cannot train ML/AI models. Collecting and
preparing the dataset is one of the most crucial parts while creating an ML/AI
project.
The technology applied behind any ML projects cannot work properly
if the dataset is not well prepared and pre-processed.
During the development of the ML project, the developers
completely rely on the datasets. In building ML applications, datasets are
divided into two parts:
- Training
dataset:
- Test
Dataset
Note: The
datasets are of large size, so to download these datasets, you must have fast
internet on your computer.
Popular sources for Machine Learning datasets
Below is the list of datasets which are freely available for the
public to work on it:
1. Kaggle
Datasets
2. UCI Machine
Learning Repository
Google dataset search engine is
a search engine launched by Google on September 5, 2018. This
source helps researchers to get online datasets that are freely available for
use.
The link for the Google dataset search engine is https://toolbox.google.com/datasetsearch.
POST Answer of Questions and ASK to Doubt