Files
Main/.stversions/3 Knowledge/Machine Learning~20250203-063557.md

9.8 KiB

title, created_date, updated_date, aliases, tags
title created_date updated_date aliases tags
Machine Learning 2024-11-18 2024-11-18

Machine Learning

Glossary

Learning Rate

The learning rate is the step size used during gradient descent. How much do we walk down the gradient? If the learning rate is too small we won't converge to the optimal weights, if it's too large it's very noisy without a tendency to go anywhere and if its way too large we might diverge and go to infinity. Choosing an adequate learning rate is important. The learning rate tends to make weights larger, which is the opposite effect of that of #Regularization.

Batch Size

The batch size determines how many data-points are processed together before updating the model weights and bias. There are the options full-batch, meaning all data points are included in the batch, mini-batch (a subset of examples larger than 1) and a batch size of 1 for #Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent (SGD)

Is a gradient descent during the learning process where a batch size of 1 is used. This usually works, but is very noisy, thus resulting in a loss graph that sometimes goes up and down again, but does converge. The usual approach is to have a higher batch size.

Mini-batch stochastic gradient descent (mini-batch SGD)

This is a compromise between full-batch and #Stochastic Gradient Descent (SGD), where the batch size is between 2 and N for N data-points. It chooses examples in the batch at random and then averages their gradients to update the weights and bias once per iteration

Noise

In machine learning noise refers to the variation during training that causes the loss to increase and decrease. It is sometimes a good thing, because it helps a model to #Generalization better.

Epochs

An epoch is a time frame in which the model has processed every example in the training set once. E.g. given 1000 examples and mini-batch sizes of 100 samples per iteration, an epoch consists of 10 iterations. Usually training requires many epochs, meaning every example is processed multiple times. This is a hyperparameter that is set before training and the tradeoff between training length and model quality is influenced by it. Training differences in the example above (1000 data points, 20 epochs)

Loss Function

The loss function defines what is minimized during the training process. It should be small for good predictions and large for bad predictions. For a linear regression a squared Loss (L_2 loss) is used, which works well because the slope is constant meaning for a \Delta x the resulting \Delta y is constant. However, in a #Logistic Regression this is not the case, meaning that for some variations \Delta z when close to z=0 result in a large change \Delta y, compared to a very small \Delta y when z is large or small. For this case we can use the Log Loss:

Log Loss = \sum_{(x,y)\in{D}} - y \; \text{log}(y') - (1-y)\;\text{log}(1-y')

Generalization

Regularization

Is a mechanism for penalizing model complexity during training. It does that by adding a regularization term to the loss function which results in less overfitting and better #Generalization. Example regularizations:

  • L_1 Regularization: (also called Lasso) adds the absoulte weights to the loss function resulting in making some weights 0 and thus selecting only a subset of features: \lambda \sum{|w_i|}
  • L_2 Regularization: (also called Ridge) adds the square of the weights to the loss function resulting in making the weights smaller: \lambda \sum{w_i^2}
  • Early stopping: is a regularization method that just ends the training before the model fully converges. For example for each epoch you validate the model on a validation set and once it the validation loss function starts increasing (after only decreasing) you stop the training process (stop it before it overfits).

This results in a new loss function: Loss=Original Loss+λ⋅Regularization Term

It is very important for #Logistic Regression, because it is very prone to overfitting.

Feature

Is an input variable to a ML-Model. For example it might be Temperature, Humidity, Pressure.

Feature Cross

This is a synthetic new feature which is the combination of two existing features (e.g wind: still, light and windy - and temperature: freezing, chilly, warm, hot --> resulting in windy). They are mostly used with linear models, but rarely used with neural networks.

Label

Is the answer or the result portion of an example. The thing we want to predict. A labeled example consists of several features with a label. (e.g. Spam or not Spam)

Normalization

The features should be scaled to roughly the same orders of magnitude for the model to work properly. There are different methods to normalize and scale the features such as:

  • Min-Max Scaling: puts all features in buckets between [0,1]. This is important for models where scale changes behaviour drastically (neural networks).
  • Standardization: Make a mean of 0 and a standard deviation of 1. This is mostly used for algorithms that assume the inputs to have a normal distribution (e.g. Principal Component Analysis or logistic regression) and when the features have different ranges and units. This is especially important in distance based algorithms such as Gradient Descent or K-means clustering. It ensures that each feature contributes equally to the outcome.

Confusion Matrix

Actual Positive Actual Negative
Predicted Positive True Positive (TP)
A spam email correctly classified as spam
False Positive (FP)
A not spam email missclassified as spam
Predicted Negative False Negative (FN)
A spam email misclassified as not-spam
True Negative (TN)
A not-spam email correctly classified as not-spam
Note that when the total of actual positives is not close to the total of actual negatives the dataset is imbalanced (see #Classification).
This confusion matrix is closely related to the p-value in scientific studies, where you want (nicely illustrated in this youtube video: Catching a Cheater with Math).

Key Metrics Derived:

  • Accuracy: Rate of correct predictions in all predictions: \frac{TP + TN}{ TP + TN + FP + FN}
  • Precision: Rate of correctly positively classified over all positively classified samples.$\frac{TP}{TP + FP}$
  • Recall (Sensitivity): Correctly positively classified samples over all actual positive samples$\frac{TP}{TP + FN}$
  • Specificity: correctly negatively classified samples over all actual negative samples.$\frac{TN}{TN + FP}$

ROC and AUC

!Pasted image 20241118160705.png In the image above the better model is the one on the right which maximizes TPR and minimizes FPR. If you have different costs for false positives or false negatives chosing a point on the ROC curve closer to the y-axis or closer to the line y=1 might be beneficial.

Prediction Bias

If the average outcome of the dataset (ground truth) is 5% (e.g. number of spam emails) then the prediction of a good model should also be close to 5%, else you have a prediction bias. Possible causes are:

  • biased data: biased sampling for the training set
  • #Regularization is too strong
  • Bugs in the model training pipeline
  • features are not good enough

Models

Linear Regression

A model that makes continuous numerical predictions and can be described with a linear function formula (might be multi variate), resulting in a line, plane, etc.

Logistic Regression

Predicts probability of a given outcome: e.g. is this email spam or not? Will it rain today? You can see in the image below, why a linear regression does not work well for such a problem: !Pasted image 20241118142001.png Logistic Regression uses sigmoid function f(x)=L/(1+e^{-k(x-x_0)}) which is bound between 0 and 1, thus predicting a probability, which can be used as such or further put into buckets resulting in a classification models (use of labels: is spam, is not spam).


\begin{equation}
y' = \frac{1}{1+e^{-z}}
\end{equation}

where z=b+w_1x_1+w_2x_2 + \dotsb + w_nx_n. This means that linear regressions can be converted into logistics regressions by using the formula above. !Pasted image 20241118143053.png

Classification

This model predicts to which category an example belongs to. #Logistic Regression can be used to perform a binary classification, meaning distinguishing between two classes.

[!important] Classification Threshold The classification threshold is important because it divides the classification. Through the #Confusion Matrix it becomes clear that we want to minimize false positives and false negatives, but depending on the example one might be much more costly than the other, which leads to a different classification threshold that we should use. If a dataset has a large imbalance the threshold might be close to 0 or 1.

Multi-class classification can be achieved with binary classification models when using them in series (A, B, C: 1st step A+B or C, 2nd step A or B)

Libraries

Keras