Supervised Learning Algorithms (K NN, SVM and Naive Bayes)

In this blog we shall see a few Supervised learning algorithms with images to understand better. This blog provides a basic understanding of the algorithms. The implementation of the same will clear concepts further.

Supervised Machine Learning Algorithm

In this type of machine learning algorithm we have both the input and the output data. The algorithm trains the model (data) to map the input to the output.

Depending upon problem categories we apply algorithm to our data set into a regression or a classification problem.

If we have a continuous data then Regression algorithms are more suitable. For example: predicting the stock prices, house prices, weather forecasting are examples of continuous data which cannot be classified into binary data, or in form of yes or no.

For categorical data we use classification algorithms. Before we start it is important to understand the term Classification. It is the process of dividing the data sets into different categories or groups by giving them labels. Classification algorithm is used in fraud detection, categorizing fruits or other products, species classification etc.

There are different types of classification like: Decision Trees, Random Forest, K-NN, Naive Bayes, SVM, Logistic Regression etc.

Classification Algorithms

1.K-Nearest Neighbor

It is a non parametric method used for classification and regression. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition

The k-NN classifier extends this idea by taking the k nearest points and assigning the sign of the majority to the object.

K here is the number of neighbors the algorithm searches to identify the nearest or closest match.

Pros of K-NN

Its easy to implement. Only two parameters are required i.e. value of ‘k’ and distance function.
Its much faster unlike SVM , Linear Regression.
No Training Period: KNN is called Lazy Learner (Instance based learning). It does not learn anything in the training period. It does not derive any discriminative function from the training data.
Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

Cons of K-NN

Not suitable for large datasets as cost of calculating the distance between new points is huge and degrades the performance of the algorithm.
Does not work well with high dimensions as it becomes difficult to calculate distance in each dimension.
Need feature scaling We need to do feature scaling (standardization and normalization) before applying KNN algorithm to any dataset. If we don’t do so, KNN may generate wrong predictions.
Sensitive to noisy , missing values and outliers we have to manually remove them for K-NN to work well.
Does not create any model for the machine

2. Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

To put it simply: suppose we want to separate two classes say dogs and cats from each other. we draw a line(hyperplane) that separates dogs and cats from one another.

Pros of SVM

It is really effective in the higher dimension.
Can be used for non-linear data sets with the help of kernel function
Effective when the number of features are more than training examples.
Best algorithm when classes are separable
The hyperplane is affected by only the support vectors thus outliers have less impact.
SVM is suited for extreme case binary classification.
Works equally well for small data sets

Cons of SVM

For larger dataset, it requires a large amount of time to process.
Does not perform well in case of overlapped classes.
Selecting, appropriately hyperparameters of the SVM that will allow for sufficient generalization performance.
Selecting the appropriate kernel function can be tricky.

Preparing data for SVM:

1. Numerical Conversion:

SVM assumes that you have inputs are numerical instead of categorical. So you can convert them using one of the most commonly used “one hot encoding , label-encoding etc”.

2. Binary Conversion:

Since SVM is able to classify only binary data so you would need to convert the multi-dimensional dataset into binary form using (one vs the rest method / one vs one method) conversion method.

Note: SVM is famous for ‘Kernel trick‘. The kernel is a way of computing the dot product of two vectors x and y in some (very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product.

Applying kernel trick means just to the replace dot product of two vectors by the kernel function. Note: SVM is famous for ‘Kernel trick’. The kernel is a way of computing the dot product of two vectors x and y in some (very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product.

Types of Kernel:

linear kernel
polynomial kernel
Radial basis function kernel (RBF)/ Gaussian Kernel

3. Naive Bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

The fundamental Naive Bayes assumption is that each feature makes an: a) independent , b) equal contribution to the outcome. This is not true in real world thus it gets it name as ‘naive'(innocent).

Eg. Weather conditions , the features are humidity, temperature, wind speed, pressure , jet streams etc. These features are somehow related to one another and are not independent and equal in real world.

Pros of Naive Bayes

It is easy to understand.
It can also be trained on small dataset.
widely used in text classification , Spam filtering in Emails, Sentiment Analysis , Recommender Systems

Cons of Naive Bayes

Often not suitable to real life problems.
Zero conditional probability problem for features having zero frequency the total probability also becomes zero. This can be corrected by sample correction techniques like ‘Laplacian Correction’.

Hope you have understood the Classification algorithms. We have more to learn. Keep Learning!

Visit the You Tube videos for more details.

Making Data Science Easy for Beginners