Complete Roadmap To Learn Machine Learning
In this beginner article, we shall look at key Machine Learning Terminology.
What is Machine Learning?
Machine Learning is a part of AI in which the learning computer algorithms improve the machine through the experience and use of the data. Machine learning algorithms build a model based on available data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Once the model is trained, the performance is checked on the testing data. Machine learning algorithms are used in a wide variety of applications, such as in Medical Analysis, Text Classification, Audio Augmentation, Speech recognition, and Computer Vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
We shall now see key Machine Learning Terminology:
Label And Feature
Features are the input variable that is used to train the model. These features can be any form of data i.e., vector of image pixels, medical data, sample text or para for text classification, and so on. Whereas Label/ Target is the output variable that is obtained after the prediction of the model. The label could be the sale prediction for a future month, Image classification i.e., identifying what image the model had predicted, Sentiment analysis of the text, and so on. We shall look at lots of features and label examples in the ongoing 100DaysOfML.
Machine learning is basically divided into 3 types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
Supervised learning is a Machine learning approach that’s defined by its use of labeled datasets. These datasets are designed to train algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time. Supervised consists of Regression and Classification learning algorithms which we shall look into in detail while discussing individual algorithms.
Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover patterns in the dataset provided. Unsupervised consists of Clustering and Dimensionality reduction learning algorithms, even this will be covered in detail.
Variance, Bias, Overfitting, and Underfitting
Bias: Difference between the average prediction and the correct value. Variance: The amount that the prediction will change if different training data sets were used.
Low Bias: KNN, DT, SVM High Bias: Logistic, Linear
Low Variance models: Linear Regression and Logistic Regression. High Variance models: k-Nearest Neighbors (k=1), Decision Trees, and Support Vector Machines.
Overfitting: It is a Low Bias and High Variance model. Generally, Decision trees are prone to Overfitting. Overfitting basically means learning too much from the training dataset.
Underfitting: It is a High Bias and Low Variance model. Generally, Linear and Logistic regressions are prone to Underfitting. Underfitting basically means learning only a few data from the training dataset.
Note: We shall visualize how Overfitting and Underfitting take place while implementing Algorithms.
High Bias – Low Variance (Underfitting): Predictions are consistent, but inaccurate on average. This can happen when the model uses very few parameters. High Bias – High Variance: Predictions are inconsistent and inaccurate on average. Low Bias – Low Variance: It is an ideal model. But, we cannot achieve this. Low Bias – High Variance (Overfitting): Predictions are inconsistent and accurate on average. This can happen when the model uses a large number of parameters.
[Overfitting are like toppers Underfitting are like backbenchers Optimized are like me, who reads what is required and score good marks. ]
How to handle High Variance or High Bias?
High Variance is due to a model that tries to fit most of the training dataset points making it complex. Consider the following to reduce High Variance:
- Reduce input features(because you are overfitting)
- Use a less complex model
- Include more training data
- Increase Regularization term
- Use a more complex model (Ex: add polynomial features)
- Increase input features
- Decrease Regularization term
To increase the accuracy of Prediction, we need to have a Low Variance and Low Bias model. But, we cannot achieve this due to the following: Decreasing the Variance will increase the Bias Decreasing the Bias will increase the Variance.
Training, Validation, and Testing Data Set
Training Dataset is used to train the model. Validation dataset is smaller than the training set and is used to evaluate the performance of models with different hyperparameters. Testing Dataset is used to predict the model by providing the new data value that the model has not seen before. Why we need three datasets will be covered in Cross-Validation. Stay Updated with 100DaysOfML repository.
There are many more terminologies, but we shall discuss them on this ongoing journey: 100DaysOfML- EkSauEk. For Roadmap check the same GitHub link.
Reference for Machine Learning Terminology:
This is it from Day9: Key Machine Learning Terminology.
Want to learn Machine Learning with Proper Roadmap and resources. Then check this Repository: https://github.com/lucifertrj/100DaysOfML