How a neuron learns: Linear layers

Intro

In the world of math, most things can be represented as a function, or simply input->eqn->output (Ex: y = 2x). Conceptually, many things in real life are also functions. Neural nets are universal approximations, by Universal approx. theorem, they can represent any continuous function, and so neural nets have many real-life approximations. Here's a NN trying to approximate a curve (Hello world sketch), from Emergent garden:

approx

Neural net basics

This is a neural net:

While that's oversimplified, a NN can be split into 3 separate parts: Input, Output, and Hidden.

The input layer is bascially a way to take in your input in a mathematical way,

In MNIST with CNNs, the input layer takes in 774 pixels (28x28), values
for NLP, the input layer is a sequence of values/tokens
in audio/video generation, the input is often spectograms, or midi files turned to mathematical values

Hidden layers, built with weights and biases (more on that later), are the main parts of a neural net, the neural net magic, taking the inputs, and transforming them with the values of the weights and biases to eventually reach an output

Output is the final layer, its activations and probabilities hold the answers the nn made,

In CNNs, the output layer holds the probabilities per class
In nlp tasks, the output layer often holds the probs of each token

Linear Layer

In a linear layer, the NN can be simplified to one function:

$$y=mx+b$$, or in the case of Neural nets, $$y=wx+b$$

Here, y is the output layer, or output of the function, while x is the input layer, or input to the function. The Hidden layer transformation works through the values of w and b, the weights and biases.

Sample python code, with the function $$y=8x+3$$

import numpy as np
        import matplotlib.pyplot as plt

        x_v = [2,4]

        # function vals
        y_v = 8*x_v + 3

Now we have a simple dataset for the linear layer, [(2,19),(4,35)]. The linear layer has to now deal with changing w=8 and b=3 from that data

In the neural net, the values are initially random, before slowly getting closer and closer to the perfect

Python code for the linear layer:

np.random.seed(42)
        w = np.random.randn()
        b = np.random.randn()

        def forward():
          return w * x + b

        y_hat = [forward(x) for x in x_v]

w: 0.4523 b: 0.6378

To get to the accurate values, we need two things

a way to get the accuracy, or how close we are to the real values of the outputs (aka Loss)
a way to change the values, in a way to get closer to the real value of w and b (aka gradient descent)

Loss and Gradient Descent

For regression, like this situation, we use MSE (Mean Squared Error) loss. MSE loss is defined as:

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Using this with a cost function, $$C = \frac{MSE}{2}$$ very useful when comparing parameters to error in MSE

A graph from Qtnx:

mse_cost to param function

We can find the right way to go by looking at the loss, and what directions its increasing or decreasing in.

Calculate the gradients using the chain rule, or the backprop calculus,

With Loss L, and activation a, showing the chain rule for weights and biases

$$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial a}\frac{\partial a}{\partial w}$$ $$\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}\frac{\partial a}{\partial b}$$

Updates the paramters with the LR(Learning Rate), between 1e-3 and 1e-5:

w -= grad(w)*lr
        b -= grad(b)*lr

LR matters, shown in the qtnx graphs:

Low lr, proper drop:

small

High lr, wild drop:

high

Overall, the model trains in a couple steps and learns the exact regression.