🌐

Built GPT 2 Base Model from scratch (a follow along of Andrej Karpathy video)

I have recently implemented the GPT-2 124M parameter model using PyTorch. I used this project to learn more about the training process of an LLM by implementing GPT-2 and training it from scratch, which was part of Andrej Karpathy sir's Zero to Hero series on YouTube. The video I referred to was titled "Let's reproduce GPT-2 (124M)".

For all who are curious about my learning journey, I have added incremental Git commits based on what I learned in each iteration in this GitHub repository (Github_repo). For those interested, I have also uploaded the trained model weights to this Hugging Face repository ( Model ).

If you feel this is too long and would like me to walk you through it, please let me know in the comments or send me a DM. We can plan a call or a YT live , or something similar.

Model settings, training process and other info

I have trained it for 1 epoch which is (10B tokens of fineweb edu dataset's sample10B ) , 19073 steps. I have followed the exact hyperparameters mentioned in the karpathy's nanogpt which inturn is from gpt-3 paper. I have used 3 A100s of 40GB of vram each and trained the model for almost 6 hours to complete 1 epoch .I have acheived val loss of 4.2 from 10.9 in 1 epoch. I have also used hellswag eval (which this model is performing terribly) which is giving me approx 25% . I have rented GPUs from jarvis labs ( JarvisLabs.ai ) which costed me around 2700 approx ( 4500 approx if i include the time i spent on learning distributed model training with multiple gpus :/ ). I also have saved the model's state_dict for every 50 steps ( which is too much of data (179GB) i feel). I did this to visualize the transformer's learning process especially the attention layer's process. It will take time for me to visualize those and understand them in depth.

Summary of Flow of things that I did from the beginning

How did I use GPUs ? :

My Thoughts, Resolutions and Plans maybe:

  1. I am not super impressed by the model I got and also I wont blame it as it was just trained on one epoch. This is mostly like a learning project for me rather than a production grade product. Even though I wanted to run it for more epochs ( I truly want to ), The cost was too much for me to run it I feel. I will take this experience, learn more from the saved dict, If I feel like I have to train more or want to play around with it again , I will burn my cash again.
  1. Having said that I will not recommend doing this If your only goal is to make a language model, cause there are much better opensource models with better trained weights that you can download and run, and Build RAG arch on top of them. But If you are a crazy guy like me who wants to experiment , visualize the process and wanting to mess architecture and play with it Go for it it is super fun.
  1. I am also wondering what all other usecases this can be trained on. I mean It would be cool to see it as a forecasting model or some other thing . I have some ideas that I want to test with transformers and Transformers based LLM architectures.
  1. I am also introduced to the beautiful world of hugging face and I just can feel I am going to use it extensively. It would be super cool If I get to work in this direction on my full time job.