vault backup: 2025-02-03 08:29:48

2025-02-03 08:29:48 +01:00
parent 7909836706
commit ad227f2d69
2018 changed files with 27268 additions and 46 deletions
--- a/Knowledge/.syncthing.Machine
+++ b/Knowledge/.syncthing.Machine
--- a/Law/Patent.md
+++ b/Law/Patent.md
@@ -0,0 +1,14 @@
+- [ ] #todo/b  Finde heraus was Patente heissen und was bei patenten wichtig ist.
+- [ ] Muss ich dafür kämpfen als Autor aufgelistet zu werden? Wäre wahrscheinlich auch gut für meine Karriere.
+- [ ] Was ist eine grace period --> [[Floris]] will das machen und dann mehr Zeit nehmen um das Patent wirklich zu filen.
+
+| NAME    | Test |
+| ------- | ---- |
+|         |      |
+| Maurin  | 6    |
+| Sereina | 8    |
+| Claudio | 5    |
+| Luana   | 88   |
+|         |      |
+
+
--- a/Knowledge/Machine
+++ b/Knowledge/Machine
@@ -0,0 +1,116 @@
+---
+title: Machine Learning
+created_date: 2024-11-18
+updated_date: 2024-11-18
+aliases:
+tags:
+---
+# Machine Learning
+
+## Glossary
+### Learning Rate
+The learning rate is the step size used during gradient descent. How much do we walk down the gradient? If the learning rate is too small we won't converge to the optimal weights, if it's too large it's very noisy without a tendency to go anywhere and if its way too large we might diverge and go to infinity. Choosing an adequate learning rate is important.
+The learning rate tends to make weights larger, which is the opposite effect of that of [[#Regularization|regularization rate]].
+### Batch Size
+The batch size determines how many data-points are processed together before updating the model weights and bias. There are the options full-batch, meaning all data points are included in the batch, mini-batch (a subset of examples larger than 1) and a batch size of 1 for [[#Stochastic Gradient Descent (SGD)|SGD]]. 
+### Stochastic Gradient Descent (SGD)
+Is a gradient descent during the learning process where a batch size of 1 is used. This usually works, but is very noisy, thus resulting in a loss graph that sometimes goes up and down again, but does converge. The usual approach is to have a higher batch size.
+### Mini-batch stochastic gradient descent (mini-batch SGD)
+This is a compromise between full-batch and [[#Stochastic Gradient Descent (SGD)|SGD]], where the batch size is between 2 and N for N data-points. It chooses examples in the batch at random and then averages their gradients to update the weights and bias once per iteration
+### Noise
+In machine learning noise refers to the variation during training that causes the loss to increase and decrease. It is sometimes a good thing, because it helps a model to [[#Generalization|generalize]] better.
+
+### Epochs
+An epoch is a time frame in which the model has processed every example in the training set once. E.g. given 1000 examples and mini-batch sizes of 100 samples per iteration, an epoch consists of 10 iterations. Usually training requires many epochs, meaning every example is processed multiple times. This is a hyperparameter that is set before training and the tradeoff between training length and model quality is influenced by it.
+Training differences in the example above (1000 data points, 20 epochs)
+- Full batch: model runs 1000 examples and then updates weights once --> 20 updates to models
+- [[#Stochastic Gradient Descent (SGD)|SGD]]: model runs 1 example and then updates weights --> 20000 updates to models (1000 per epoch)
+- [[#Mini-batch stochastic gradient descent (mini-batch SGD)|mini-batch SGD]]: Model runs 100 examples (mini-batch size) and then updates weights --> 200 updates to models (10 per epoch)
+
+### Loss Function
+The loss function defines what is minimized during the training process. It should be small for good predictions and large for bad predictions.
+For a linear regression a *squared Loss ($L_2$ loss)* is used, which works well because the slope is constant meaning for a $\Delta x$ the resulting $\Delta y$ is constant. However, in a [[#Logistic Regression]] this is not the case, meaning that for some variations $\Delta z$ when close to $z=0$ result in a large change $\Delta y$, compared to a very small $\Delta y$ when $z$ is large or small.
+For this case we can use the *Log Loss*:
+$$ Log Loss = \sum_{(x,y)\in{D}} - y \; \text{log}(y') - (1-y)\;\text{log}(1-y')$$
+### Generalization
+
+### Regularization
+Is a mechanism for penalizing model complexity during training. It does that by adding a regularization term to the loss function which results in less overfitting and better [[#Generalization]]. 
+Example regularizations:
+- $L_1$ Regularization: (also called Lasso) adds the absoulte weights to the loss function resulting in making some weights 0 and thus selecting only a subset of features: $\lambda \sum{|w_i|}$
+- $L_2$ Regularization: (also called Ridge) adds the square of the weights to the loss function resulting in making the weights smaller: $\lambda \sum{w_i^2}$
+- Early stopping: is a regularization method that just ends the training before the model fully converges. For example for each epoch you validate the model on a validation set and once it the validation loss function starts increasing (after only decreasing) you stop the training process (stop it before it overfits).
+
+This results in a new loss function:
+$Loss=Original Loss+λ⋅Regularization Term$
+
+It is very important for [[#Logistic Regression]], because it is very prone to overfitting.
+
+### Feature
+Is an input variable to a ML-Model. For example it might be Temperature, Humidity, Pressure.
+### Feature Cross
+This is a synthetic new feature which is the combination of two existing features (e.g wind: still, light and windy - and temperature: freezing, chilly, warm, hot --> resulting in windy). They are mostly used with linear models, but rarely used with neural networks.
+### Label
+Is the answer or the result portion of an example. The thing we want to predict. A labeled example consists of several features with a label. (e.g. Spam or not Spam)
+
+### Normalization
+The features should be scaled to roughly the same orders of magnitude for the model to work properly. There are different methods to normalize and scale the features such as:
+- Min-Max Scaling: puts all features in buckets between [0,1]. This is important for models where scale changes behaviour drastically (neural networks).
+- Standardization: Make a mean of 0 and a standard deviation of 1. This is mostly used for algorithms that assume the inputs to have a normal distribution (e.g. [[Principal Component Analysis|PCA]] or logistic regression) and when the features have different ranges and units.
+This is especially important in distance based algorithms such as [[Gradient Descent]] or K-means clustering. It ensures that each feature contributes equally to the outcome.
+
+
+### Confusion Matrix
+
+|                    | Actual Positive                                                     | Actual Negative                                                             |
+| ------------------ | ------------------------------------------------------------------- | --------------------------------------------------------------------------- |
+| Predicted Positive | **True Positive (TP)**<br>A spam email correctly classified as spam | **False Positive (FP)** <br>A not spam email missclassified as spam         |
+| Predicted Negative | **False Negative (FN)**<br>A spam email misclassified as not-spam   | **True Negative (TN)**<br>A not-spam email correctly classified as not-spam |
+Note that when the total of actual positives is not close to the total of actual negatives the dataset is imbalanced (see [[#Classification]]).
+This confusion matrix is closely related to the [[p-value]] in scientific studies, where you want  (nicely illustrated in this youtube video: [Catching a Cheater with Math](https://youtu.be/XTcP4oo4JI4?si=t0bMXD4V3h01E0K1)). 
+
+**Key Metrics Derived**:
+
+- **Accuracy**: Rate of correct predictions in all predictions:   $$\frac{TP + TN}{ TP + TN + FP + FN}$$
+- **Precision**: Rate of correctly positively classified over all positively classified samples.$$\frac{TP}{TP + FP}$$
+- **Recall (Sensitivity)**: Correctly positively classified samples over all actual positive  samples$$\frac{TP}{TP + FN}$$
+- **Specificity**: correctly negatively classified samples over all actual negative samples.$$\frac{TN}{TN + FP}$$
+### ROC and AUC
+![[Pasted image 20241118160705.png]]
+In the image above the better model is the one on the right which maximizes TPR and minimizes FPR. 
+If you have different costs for false positives or false negatives chosing a point on the ROC curve closer to the y-axis or closer to the line y=1 might be beneficial.
+### Prediction Bias
+If the average outcome of the dataset (ground truth) is 5% (e.g. number of spam emails) then the prediction of a good model should also be close to 5%, else you have a prediction bias. Possible causes are:
+- biased data: biased sampling for the training set
+- [[#Regularization]] is too strong
+- Bugs in the model training pipeline
+- features are not good enough
+## Models
+### Linear Regression
+A model that makes continuous numerical predictions and can be described with a linear function formula (might be multi variate), resulting in a line, plane, etc.
+
+### Logistic Regression
+*Predicts probability of a given outcome*: e.g. is this email spam or not? Will it rain today?
+You can see in the image below, why a linear regression does not work well for such a problem:
+![[Pasted image 20241118142001.png]]
+Logistic Regression uses *sigmoid function* $f(x)=L/(1+e^{-k(x-x_0)})$ which is bound between 0 and 1, thus predicting a probability, which can be used as such or further put into buckets resulting in a classification models (use of labels: is spam, is not spam).
+
+$$
+\begin{equation}
+y' = \frac{1}{1+e^{-z}}
+\end{equation}
+$$
+where $z=b+w_1x_1+w_2x_2 + \dotsb + w_nx_n$. 
+This means that linear regressions can be converted into logistics regressions by using the formula above.
+![[Pasted image 20241118143053.png]]
+### Classification
+This model predicts to which category an example belongs to. [[#Logistic Regression]] can be used to perform a binary classification, meaning distinguishing between two classes.
+
+> [!important] Classification Threshold
+> The classification threshold is important because it divides the classification. Through the [[#Confusion Matrix]] it becomes clear that we want to minimize false positives and false negatives, but depending on the example one might be much more costly than the other, which leads to a different classification threshold that we should use.
+> If a dataset has a large imbalance the threshold might be close to 0 or 1.
+
+Multi-class classification can be achieved with binary classification models when using them in series (A, B, C: 1st step A+B or C, 2nd step A or B)
+
+## Libraries
+### Keras