Ok, now all we have to do is combine the two basic ideas we've learned about so far. If we combine a series of connected perceptrons, and feed good training data (embeddings) through these perceptrons, we have a neural network. Most of you probably know that term - neural networks are among the most famous machine learning models because they have scaled extremely well and are the foundational architecture behind ChatGPT and its peers.

Let's first add in an image here so we have a general sense of what we're talking about.

NeuralNetwork

I like this image, because it highlights how each node in the neural network is just a perceptron. It has the same symbols as we showed in our perceptron article, showing that first we sum our incoming vector, then we have some sort of "activation function" f, where we squeeze the output into the space between 0 and 1. One thing that you may notice that existed in our perceptron graphic, and is missing here, are the input weights. Those do exist in neural networks, they just aren't pictured here.

The first hidden layer is intuitive based on our understanding of perceptrons. Each node takes in a vector [x1, x2, x3, x4] (you can see this denoted with the four incoming lines). Then, that node multiplies each item in the vector by an associated weight (not pictured above), sums them up, and produces an associated output. Those outputs then flow into another set of perceptrons, then into a final output layer that produces output probabilities. This architecture is flexible, and what's pictured above is just an example. You can vary the number of layers, number of input and output nodes, number of nodes inside each layer, etc.

Now explaining the input and output of a single perceptron is not terribly complicated. An incoming vector gets turned into a probability. Explaining the semantic significance of the numbers that get passed between layers of neural networks is significantly less tangible - it's one of the fundamental issues with neural networks. Many traditional machine learning classifiers offer transparency into their decision making. When we build a logistic regression classifier, we generate a line that separates one class from the other. When we build a decision tree, we generate a series of rules that separate one class from another. When we build a neural network, we build a complex series of "neurons" that send each other information. And while we can verify that our outputs are regularly correct, we can't always easily explain why. For that reason it's often referred to as a "black box" architecture, implying that we often can't understand what's happening inside. Why the weight associated with the connection between node a and node b helps influence our output is often not obvious. This is especially true as these architectures scale up. The image above has four input parameters, and two hidden layers composed of seven total nodes. Modern Large Language Models make this look miniscule. The primary model I used to write my thesis (over four years ago), had 110 million input parameters, and 12 hidden layers each composed of 768 nodes. Understanding the significance of the individual connections between nodes in a system that large is almost impossible.

In spite of this shortcoming in explainability, the larger logic of how such a model works is actually the same as we discussed in the perceptron. We create a training task - like classifying positive vs. negative sentiment - and then guess and check, adjusting the weights as we go. Adding more nodes and more connectivity should in theory mimic our brains. By allowing for more connections, we allow the model to understand increasing levels of nuance. In a perceptron, where an input weight is associated with exactly one item in an input vector, our model can learn over time that "dislike" is negative and "like" is positive, and adjust the weights accordingly. This approach might come up short when it comes across a new sentence "I don't like dogs". With more complex connections, the model can build an understanding of how "like" interacts with "don't". The more layers and nodes we have, the more the model can understand complexities present in natural language.

So, to zoom out a bit and recap - we send information forward through the neural network, then we check how we did, adjust the weights associated with each connection, and try again. (Just so you know, this process of adjusting the weights is called backpropagation). We keep going until we are satisfied with the results, then we save those weights. Now, any new vector that comes in can be sent through the architecture, and out pops a probability, a prediction. (When you hear talk of "closed source" vs. "open source" models, these weights are what they are referring to. Open source models, like Meta's Llama 3, publish their architectures and final weights, meaning that if you had the enough computing power, you could recreate their model without having to train it. Closed source models, like ChatGPT, don't publish this information.)

Now here's the logical question - how do we go from a really good classifier to a really good generator? It's one thing to predict 1s vs. 0s, positive vs. negative, junk vs. relevant, but it's entirely different what ChatGPT does - actually generating text in response to our messages.

I think that'd best wait for another week. I'm moving slower than planned, but I fear it's too much anyways. I talked to my Mama the other day, who has a good mind for data, and I think I was losing her a bit already. Bite sized chunks are best.


Thanks for reading! I'm not confident anyone is actually getting anything out of this, but I am. And that's the point I guess. It's nice to remind myself of the basics and try to build back up.