Avyana
posted a blog.
Since a Harvard Business Review article titled 'Data Scientist' as the 'Sexiest Job of the 21st Century', interest in machine learning has risen dramatically. It can be difficult to get started in machine learning if you are just getting started. We've republished our popular article on machine learning algorithms that are good for beginners.
(This post was originally published on KDNuggets as The 10 Algorithms Machine Learning Engineers Need to Know. This post has been republished with per lesson and was last updated in 2019.
This post is for beginners. This post is for beginners. If you're not clear yet on the differences between "data science" and "machine learning," this article offers a good explanation: machine learning and data science -- what makes them different?
To learn more about Artificial Intelligence and Machine Learning, then read the Artificial Intelligence tutorial. Also, enroll in Post Graduate program of AI ML courses by E&ICT, NIT Warangal, India to become proficient.
Machine learning algorithms can use data to learn and then improve upon their experience. Some of the learning tasks include: learning the function that maps input to output, and learning the hidden structure within unlabeled data. Or, 'instance-based learning' where a class label for a new instance is created by comparing the row (new instance) with instances from the training data. "Instance-based Learning" does not create abstractions from specific instances.
Different types of Machine Learning Algorithms
There are three types of machine-learning (ML) algorithms.
Algorithms for Supervised Learning:
Supervised learning uses labelsed training data for the learning of the mapping function that converts input variables (X), into output variables (Y). It solves for and in this equation:
Y = f(X)
This allows us to generate accurate outputs from new inputs.
We will discuss two types of supervised learning: regression and classification.
ClassificationWhen the output variable is in the format of categories, it is used to predict what will happen to a sample. A classification model may look at the input data to try and predict labels such as "sick" or “healthy".
RegressionWhen the output variable is real, it is used to predict the outcome for a sample. A regression model could process input data to predict rainfall or height, for example.
These algorithms are examples of supervised-learning.
EnsemblingAnother type of supervised-learning is This is when multiple machine learning models are combined to predict a better sample. Ensemble techniques are illustrated in the Algorithms 9-10 of this article: Bagging with Random Forests and Boosting using XGBoost.
Unsupervised Learning Algorithms
Unsupervised learning models can be used when only the input variables (X), and not the corresponding output variables, are available. To model the data's underlying structure, they use unlabeled training information.
Three types of unsupervised learning will be discussed:
AssociationIt is used to determine the likelihood of co-occurrence of items within a collection. It is widely used in market-basket analyses. An association model could be used to determine that customers who buy bread are 80% more likely to purchase eggs.
ClusteringThis is used to group objects so that they are more alike than objects in another cluster.
Reduce DimensionalityIt is used to reduce the number of variables in a data set, while still ensuring important information is conveyed. Dimensionality reduction can be achieved using either Feature Extraction or Feature Selection methods. Feature Selection chooses a subset from the original variables. Feature extraction transforms data from a high-dimensional space into a low-dimensional one. Example: The Feature Extraction approach is the PCA algorithm.
We will be covering algorithms 6-8 -- Apriori and K-means as well as PCA -- which are examples of Unsupervised Learning.
Reinforcement learning:
Reinforcement learning, a type of machine-learning algorithm, allows agents to determine the best next action based upon their current state. It does this by learning behavior that maximizes a reward.
The best reinforcement algorithms learn their optimal actions by trial and error. For example, imagine a video game where the player must move at specific times and places to earn points. The reinforcement algorithm would play that game by randomly moving, but it would eventually learn from trial and error where and when to move the character.
Quantifying Machine Learning Algorithms' Popularity
These ten algorithms came from where? Any such list is inherently subjective. These studies have quantified the top 10 data mining algorithms. However, they still depend on the subjective responses of survey respondents, who are often advanced academics. In the linked study, for example, the respondents were the ACM KDD Innovation Award winners, the IEEE ICDM Research Contributions Award recipients, the Program Committee members of KDD ’06, ICDM ’06 and SDM ‘06, as well as the 145 attendees to the ICDM 06.
This post lists the top 10 algorithms for machine learning beginners. These are algorithms that I learned in the Data Warehousing and Mining (DWM) course at the University of Mumbai. I have included the last 2 algorithms (ensemble methods) particularly because they are frequently used to win Kaggle competitions.
These are the Top 10 Machine Learning Algorithms For Beginners.
1. Linear Regression
Machine learning uses a set (x) of input variables to generate an output variable (y). There is a relationship between the output variable and the input variables. This relationship is the goal of ML.
Figure 1: Linear regression is represented by a line that follows the formula y = a+ bx.Source
Linear Regression expresses the relationship between input variables (x), and output variables (y) as an equation with the form y = (a + bx). The goal of linear regression, therefore, is to determine the coefficients a or b. Here, a represents the intercept and B the slope of the line.
Figure 1 shows how x and y are plotted for a data set. It is important to find a line that fits the most points. This would decrease the error (or distance) between the y value and the line.
2. Logistic Regression
Linear regression predictions can be described as continuous values (e.g., rain in cm), while logistic regression predictions can be described as discrete values (e.g., whether a student passes/failed) after applying transformation functions.
Logistic regression works best for binary classification. Data sets where y is 0 or 1 are used, where 1 denotes default class. In predicting the outcome of an event, we can only predict whether it will happen (which we denote with 1) or not ((0)). If we wanted to predict whether a patient would become sick, we would use the 1 value in our data set to label sick patients.
Logistic regression gets its name from the transformation function it uses. This is called the logistic function (h(x= 1) (1 + ex). This creates an S-shaped curve.
Logistic regression produces the output in the form of probabilities for the default class, which is different from linear regression where it is directly produced. The output ranges from 0-1 because it is probabilistic. For example, let's say we want to predict whether patients will get sick. Since it is a probability, the output lies in the range of 0-1.
Log transforming the x value using the logistic function (h(x),= 1/ (1 +e -x) to generate the output (y-value). This threshold is used to convert this probability into a binary classification.
Figure 2: Logistic regression to determine whether a tumor has malignant or benign. If the probability h (x)= 0, it is classified as malignant.
Figure 2 shows that a tumor's malignancy is determined by y = 1. The x variable could represent a measurement of a tumor, such the size. The figure shows how the logistic function transforms x-values of various instances in the data set into a range from 0 to 1. The threshold of 0.5 is the threshold at which the probability of malignancy. This can be seen by the horizontal line.
The logistic regression equation P (x) = E (b0+b1x/ (1 +e(b0+b1x),) can be converted into ln[p(x/1) = b0+ b1x.
Logistic regression uses the training data to determine the coefficients b0 or b1 that minimize the error between the expected outcome and the actual outcome. These coefficients can be estimated using Maximum Likelihood Estimation.
3. CART
One implementation of Decision Trees is Classification and Regression Trees.
The root node and internal node are non-terminal in Classification and Regression Trees. The terminal nodes are called the leaf nodes. Each non-terminal nude represents one input variable (x), and a splitting point. The leaf nodes represent the output variables (y). To make predictions, the model can be used in the following way: Walk the branches to reach a leaf node; then output the value at that leaf node.
Figure 3 shows how the decision tree classifies whether someone will purchase a sports car or minivan based on their marital status and age. If the person is older than 30 and not married, the decision tree is as follows: "Over 30 years?" - yes - "married?" - no. The model produces a sports car.
Figure 3: The parts of a decision tree. Source
4. Naive Bayes
Bayes's Theory is used to calculate the probability of an event occurring, given that an existing event has occurred. Bayes's Theory is used to calculate the probability that hypothesis(h), given our prior knowledge(d), will be true.
d)= (P(d
Where:
d) = Posterior probability. d)= P(d1
P(d|h) = Likelihood. Given that hypothesis h is true, the probability of data d.
P(h), = Class prior probability. The probability that hypothesis h is true, regardless of the data
P(d), = Predictor of prior probability. Probability of data, regardless of hypothesis
This algorithm is called "naive" because it assumes all variables are independent. This assumption is not valid in real-world situations.
Figure 4: Naive Bayes can be used to predict the status 'play' based on the variable weather'.
Take Figure 4. What is the result if the weather is'sunny?
sunny) and P(no
-P(yes|sunny)= (P(sunny|yes) * P(yes)) / P(sunny) = (3/9 * 9/14 ) / (5/14) = 0.60
- P(no|sunny)= (P(sunny|no) * P(no)) / P(sunny) = (2/5 * 5/14 ) / (5/14) = 0.40
If the weather is sunny, then play = "yes".
5. KNN
K-Nearest Neighbors uses the whole data set instead of splitting it into a test and training set.
The KNN algorithm searches the entire data set for the closest instances to the new instance or the k most similar instances to the new record. It then outputs either the mean of the outcomes for a regression problem or the mode (most common class) for a classification issue. The value of k can be set by the user.
You can calculate the similarity of two instances using measures like Euclidean distance or Hamming distance.
Unsupervised learning algorithms
6. Apriori
In order to find common item sets, the Apriori algorithm can be used in transactional databases to generate association rules. It is commonly used for market basket analysis where it is possible to check for products that are frequently found in the same database. The association rule is simply a statement that if someone purchases item X then he also purchases item Y.
Ex: A person who buys milk and sugar is more likely to purchase coffee powder. This could be expressed in the form a rule of association: "milk, sugar" - "coffee powder". After a person has reached a certain level of support or confidence, association rules can be created.
Figure 5: Formulae to support, confidence, and lift the association rule X-Y.
The Support measure reduces the number of candidate items sets that are considered for frequent item set generation. The Apriori principle guides this support measure. According to the Apriori principle, if an itemet is frequently used, all its subsets must be also.
7. K-means
K-means, an iterative algorithm that group similar data into groups, calculates the centers of k clusters and assigns data points to the cluster with the shortest distance between its centroide and the data point.
Figure 6: The steps of the K-means algorithm.
Here's how it works.
Let's say that k = 3. Next, assign each data point randomly to one of the three clusters. Calculate the cluster centroids for each cluster. Each cluster's centroids are indicated by the red, blue, and green stars.
Next, assign each point to the nearest cluster centroid. The figure shows that the top 5 points were assigned to the blue cluster centroid. You can use the same process to assign points to clusters with red and green centraloids.
Next, calculate the centroids for new clusters. The gray stars are the old centroids; the new ones are the red, green and blue stars.
Continue repeating steps 2-3 until you have no switching between clusters. When there is no switching in 2 consecutive steps, exit K-means.
8. PCA
Principal Component Analysis (PCA), which reduces the number of variables, is used to simplify data exploration and visualization. This is achieved by converting the maximum variance from the data into a new coordinate scheme with axes called "principal components".
Each component is a linear mixture of the original variables. They are orthogonal to each other. The correlation between these components will be zero if they are orthogonal.
The first component is the one that captures the direction of maximum variability in the data. The second principal component, which captures remaining variance in data but is not correlated with the first component, has variables that are uncorrelated. The remaining variance is captured by all subsequent principal components (PC3,PC4 and so forth) while remaining uncorrelated with the preceding component.
Figure 7: From 3 variables (genes), 2 variables are now principal components (PC).
Learning techniques for ensemble learning
For improved results, ensembling is the process of combining multiple learners' results (classifiers). This can be done by voting or average. Regression uses averaging and classification. Voting is used for classification. It is believed that groups of learners perform better than single ones.
There are three types of ensembling algorithms available: Boosting, Bagging and Stacking. We are not going to cover 'stacking' here, but if you'd like a detailed explanation of it, here's a solid introduction from Kaggle.
9. 9.Bagging with Random Forests
First, you need to create multiple models using data sets that were created with the Bootstrap Sampling technique. Bootstrap Sampling creates a training set that is made up of random subsamples taken from the original data.
These training sets are the same size as the original data, with some records repeating multiple times and others not appearing at all. The entire original data set then becomes the test set. If the original data set size is N, then each generated training set size is also N. The number of unique records is about (2N/3); and the test set size is also N.
The second step is to bagge multiple models using the same algorithm and different training sets.
Random Forests are an example of this. Random Forests is different from a decision tree where each node gets split on the best feature to minimize error. In Random Forests we pick random features to create the best split. Randomness is because even though bagging can be done, decision trees end up with similar structures and correlated predictions. Bagging on a random subset features results in less correlation between predictions from subtrees.
As a parameter to Random Forest, you can specify the number of features that will be searched at each split point.
In bagging with Random Forest each tree is built using a random selection of records, and each split using a random sampling of predictors.
10. AdaBoost Boost
Adaboost is an acronym for Adaptive Boosting. Because each model is built separately, a bagging is a parallel ensemble. However, boosting is a sequential model where each model is constructed based on the corrections made to the previous model.
Bagging is mainly'simple vote' where each classifier votes for a final outcome that is determined primarily by the majority of parallel models. Boosting involves 'weighted votes' where each classifier votes for a final outcome that is determined largely by the majority. However, the sequential models were constructed by giving greater weight to misclassified instances from the previous models.
Figure 9: Adaboost is a decision tree.
Figure 9 shows steps 1, 2, and 3 of a weak learner, called a decision stump. This is a 1-level decision tree that makes a prediction using only one input feature. It connects its roots to its leaves immediately.
The process of building weak learners goes on until a set number of weak learners is built or until no improvement in training has occurred. Step 4 combines all three decision stumps from the previous models and thus includes 3 splitting rules in the decision trees.
To make a decision about one input variable, you should start with just one decision tree stump.
We have used equal weights to classify the data points as either a triangle or a circle, based on their size. To classify these points, the decision stump generated a horizontal line in its top half. It is clear that the two circles are incorrectly interpreted as triangles. We will therefore assign greater weights to these circles and use another decision stump.
To make a decision about another input variable, second move to another stump of the decision tree.
The size of the misclassified circles in the previous step is greater than the other points. The second step is to correctly predict these circles.
These two circles were correctly classified by the vertical line to the left because they were assigned higher weights. This has led to misclassification of the three circles at top. We will therefore assign greater weights to the three circles at top and use another decision stump.
Third, train another decision tree stump for a decision about another input variable.
The data points that were misclassified in the previous step are bigger than the others. To classify the triangles and circles, a vertical line has been drawn to the right.
Fourth, combine the decision stumps.
The separators of the three previous models have been combined and we can see that this complex rule classifies data points more accurately than any of the weak learners.
Be the first person to like this.