Notes on Intro to Large Language Models Video - Andrej Karpathy

Here are my notes from Andrej Karpathy’s excellent Intro to Large Language Models video.

An open weights model is one that has all of the parameters released to the public.
At its core, an LLM has just two files:
- The parameter files (each parameter is stored with 2 bytes(its stored as a float 16)
- The code file that runs the parameters
Not a lot is necessary to run the model.
The magic and computational lift is in the parameters and how do we obtain them
How do we train the model - its more involved and complicated.
Think of it as taking a chunk of the internet and compressing it via GPU clusters. parameters can be thought of a zip file of the internet. but this isnt exactly a zip file because it is a lossy compression.
What is the neural network doing ? just trying to predict the next word sequence.
Next word prediction is actually really powerful because it is forcing the neural network to learn a lot about the world. all of this knowledge is being compressed in the weights of the parameters.
The network “dreams” internet pages

How do they work?

pre training stage is when we train the model on all internet data.
we know the architecture and the mathematical operation.
We know how to iteratively adjust the parameters to improve the next word predictioon, but we dont know how exactly all of these parameters interact wit each other.
They build and maintain some kind of knowledge store, but it seems to be one dimensional. e.g chatgpt knows who tom cruise’s mother is, but if you ask him who is her son, it doesnt know.
Think of them as empirical artifacts.
The above step, where we train the internet of documents, is the first step.
The next step is training the assistant. (fine tuning). we want to give questions and get answers.
We swap the internet oof documents dataset and replace that with responses that we have collected from people.
- in this stage, we prefer quality over quantity.
Somehow, the model in the fine tuning stage is able to tap into the knowledge that it got fro the pre training stage. we dont know how exactly this happens.
Third step - comparison. instead of the humans generating responses, they look at all of the responses generated by the GPT and chose the ones that are closest to the actual answers.

LLM Scaling Laws

Performnace of LLMs is a smoothed, well defined function based on:
- N - the number of parameters in the network
- D - the amount of text that the model is trained on

next iterations to model: