Built GPT 2 Base Model from scratch (a follow along of Andrej Karpathy video)

I have recently implemented the GPT-2 124M parameter model using PyTorch. I used this project to learn more about the training process of an LLM by implementing GPT-2 and training it from scratch, which was part of Andrej Karpathy sir's Zero to Hero series on YouTube. The video I referred to was titled "Let's reproduce GPT-2 (124M)".

For all who are curious about my learning journey, I have added incremental Git commits based on what I learned in each iteration in this GitHub repository (Github_repo). For those interested, I have also uploaded the trained model weights to this Hugging Face repository ( Model ).

If you feel this is too long and would like me to walk you through it, please let me know in the comments or send me a DM. We can plan a call or a YT live , or something similar.

Model settings, training process and other info

I have trained it for 1 epoch which is (10B tokens of fineweb edu dataset's sample10B ) , 19073 steps. I have followed the exact hyperparameters mentioned in the karpathy's nanogpt which inturn is from gpt-3 paper. I have used 3 A100s of 40GB of vram each and trained the model for almost 6 hours to complete 1 epoch .I have acheived val loss of 4.2 from 10.9 in 1 epoch. I have also used hellswag eval (which this model is performing terribly) which is giving me approx 25% . I have rented GPUs from jarvis labs ( JarvisLabs.ai ) which costed me around 2700 approx ( 4500 approx if i include the time i spent on learning distributed model training with multiple gpus :/ ). I also have saved the model's state_dict for every 50 steps ( which is too much of data (179GB) i feel). I did this to visualize the transformer's learning process especially the attention layer's process. It will take time for me to visualize those and understand them in depth.

Summary of Flow of things that I did from the beginning

For someone who has watched karpathy's video this part will feel like a rewind.

Started off by checking out huggingface's gpt2 model and loaded it with the pretrained weights that are released by OpenAI and performed some text generations and played with model a bit.

Initial Idea was to replicate the same way as huggingface implemented it and train it so that we will be able to load it in the hf model and compare. (But this was not acheived though).

Had written GPT module using pytorch's nn.module. Had written this similar to hf implementation of GPT2 by taking config which contains hyperparams and given the same name to model parameters.

Implemented the simple training loop of 50 steps and were training on the tiny shakesphere dataset. And also added some code to see how much time it is taking for each step and how many tokens are being processed each step.

Did some minute changes to the naive model that has been written by tying weights of token embeddings and the lmhead (final layer of transformer), adding scaling factor to residual pathways .

Started off with GPU gymnastics to improve performance :

Converted the tensors which will by default use float32/64 to use torch.tf32. TF32 is basically mantissa cutdown float number. We will loose some precision but this will improve speed of computation by a lot as there will be less bits to process.And we can take this as we are calculating some kind of scientific data where mantissa matters alot.(Think of mantissa as the part after . in a float number ex: .5234 in 12.5234 in simple words )

* To give a perspective this improved time taken per step from 1100ms -----> 400ms

Converted from tf32 to bfloat16. This is similar but we cut down more mantissa. This will give us some not so precise but close results with a bit more faster compute speed.

After this 1step jumped from 400ms -----> 340ms

Added torch.compile. Up until now our python interpreter does all the computation sequentially line by line. This causes many roundtrips from memory to GPU which can be optimised as we know what will be the next processes to perform,. The same thing is acheived using torch.compile(model). This will compile the whole model to run GPU efficiently with minimal round trips

1 step jumped from 340ms ---> 150ms

Replaced masked attention that we implemented with Flash Attention instead. Flash Attention was the algorithmic improved implementation of attention mechanism as It was proposed keeping parallel computing and GPUs in mind.

1 step jumped from 150ms ---> 107ms

Did some number Gymnastics by converting the numbers we can to closest 2 powers as this takes less time to process inside gpus.This is the reason why now our model is different from Hugging face. We changed the vocab_size as well from 50257 to 524288 (2**19).

1 step jumped from 107ms ---> 100ms

After some number gymnastics we added some hyper params according to GPT-3 paper like adding beta and eps to adamw optimiser. We also introduced gradient clipping as this will normalise and prevent training from any extreme outliers in our data.

Added cosine learning rate decay to match with GPT-3 implementation. And also implemented fused adamw optimisation which is GPU friendly method of performing AdamW.

Also Added L2 regularization in the form of weight decay. And also changed the dataset from tiny shakesphere to fineweb-edu sample 10BT, which has 10B tokens approximately. We have downloaded it, divided into 100 shards and loaded into the training process. (Long story short)

GPT-3 Implemented 0.5M tokens per step of training. we cannot do that with these minute GPU's all at once, so we implemented gradiant accumulation instead. Long story short Instead of performing optimization for 0.5M batches parallelly we do some mini_batch parallelly say 32 and we iterate some 'x' amount of times so that all the 0.5M tokens gets processed in a loop. In this process all the gradients will be accumulated. we perform normalization and perform optimization step now on accumulated gradient.

Now Comes the parallellization process. We did everything that we did up until now in a single GPU. But lets say you want to parallelly train on multiple GPUs at once. This can be acheived by using pytorch's DDP (Distributed Data Parallel).

But where will we use multiple GPUs is the valid question to have.

Basically we are doing gradient accumulation any way. What if we do gradient accumulation in multiple GPUs and do optimization by combining the gradients from all gpus.

This is what essentially we are doing. we perform forward pass and backward pass in the individual gpus inside the grad accum loop. Once we are about to get out of the loop we are synchronising gradients along all the GPUs and performing optimization.

DDP allows us to all of this. But instead of us running using python we use torchrun

Have used Hellaswag eval to evaluate our language modelling capability of our model.

After Checking if everything is in place, wanted to start the legit pre-training of the mdoel. I wanted to do that using 8 H100 GPUs that were available in jarvislabs. But at the end performed traing in 3 A100s (As all other A100s are busy :/ )

I could not perform in H100 even though they were free because for some reason jarvis labs has no feature to mount additional persistant storage to H100s (I assume they are running H100s seperately from the cluster as they are providing VM service with H100). And I wanted to store the statedicts in a persistand storage (which is of size 180GB approx ). and also wanted to download and process fine-web dataset once and store shards (20GB) in a persistant way.

How did I use GPUs ? :

I checked out some instances from jarvislabs which lets us ssh into the machine.

I sshed into machines , cloned the repo there , played with the model and learned new things and logged off and deleted the instance i created.

I have also checked out multiple GPU machines to see their performance.

The powerful machine jarvislabs have is H100. To give a perspective The optimal time to run a single step for A100 was 100ms . H100 completed same process with the same setting within 50ms. It is almost 2x faster. If I have trained the model using 3 H100s instead of 3 A100s it would have completed training in 3 hrs instead of taking 6hrs per epoch. Of cource the price is also high for H100s. If you dont mind the longer time to train I feel the cost will be almost same for training with a H100 and training with a A100. with an extra benifit of being able to attch persistant storage to A100s. I have also tested some other GPUs like A6000. to give a perspective The setting i trained with right now took 3s per Gpu for A100 and 7s per Gpu for A6000.

To summarize the way I did things was To checkout not so powerful GPU but many cpus instance to download dataset and saved shards in the persistant storage mount that I attached to instance Used The same persistant storage and attched it to Test multiple GPUs and Finally train on a GPU I have Fixed (3 x A100) Saved the statedict after every 50 steps and stored them in persistand storage. and finally once training was over stored the trained model weights in the different dir of the same persisted storage. Incase anyone wants to play around with the model and the weights here you go Weights.

My Thoughts, Resolutions and Plans maybe:

I am not super impressed by the model I got and also I wont blame it as it was just trained on one epoch. This is mostly like a learning project for me rather than a production grade product. Even though I wanted to run it for more epochs ( I truly want to ), The cost was too much for me to run it I feel. I will take this experience, learn more from the saved dict, If I feel like I have to train more or want to play around with it again , I will burn my cash again.

Having said that I will not recommend doing this If your only goal is to make a language model, cause there are much better opensource models with better trained weights that you can download and run, and Build RAG arch on top of them. But If you are a crazy guy like me who wants to experiment , visualize the process and wanting to mess architecture and play with it Go for it it is super fun.

I am also wondering what all other usecases this can be trained on. I mean It would be cool to see it as a forecasting model or some other thing . I have some ideas that I want to test with transformers and Transformers based LLM architectures.

I am also introduced to the beautiful world of hugging face and I just can feel I am going to use it extensively. It would be super cool If I get to work in this direction on my full time job.