Turning GPT-2 into a Conversational AI, with NanoGPT and NanoChatGPT

A couple days ago, I built NanoChatGPT, a model fine tuned on GPT-2-medium model. When most people see GPT-2 they think of an autocomplete, maybe a couple sentences, but with fine tuning and good data, GPT-2 can write dialogue, make stories, and chat, all in 354 million parameters.

The number 354 million is really important. The ChatGPT we all know has 175 billion parameters. We are attempting to Chat with 0.002% of the power in GPT-3.5. That's why this is geared towards short responses, and It isn’t as good at giving contextual responses or staying on topic. But it is great at speaking in english and making natural sentences.

What does it sound like?

Here's a random Chat excerpt I had with the Model:


            ...
            User: Talking about pythons, whats your favorite animal?
            Bot: Not knowing, but it is my favorite animal. I had been thinking about it for a long time, and I guess I
            liked the name.
            User: Which animal though, whats your favorite?
            Bot: I like it when they play but I love birds too.
            User: Birds huh? My favorite is the Peacock
            Bot: No other bird is my favorite though.
            User: What about the flamingo or the parrot
            Bot: Mm, they are both, but the parrot is only in my group.
            User: Do you like macaws or someother parrots
            Bot: Mm, I like them both.
            User: Really? What about Mammals like tigers
            Bot: Mammals like tigers. "I will have this dog today. It's actually quite good. "
            User: What do you like, Lions or tigers?
            Bot: I like when I hear a song, I like that it's interesting. Thanks for the reply. > I have a taste for
            Black Forest Wildcat
            User: Wildcats? What about PUMA
            Bot: So again, what is your opinion?
            User: Jaguar, the king of the amazon
            Bot: Mm, it is a fantastic looking animal, it's not as big as a jaguar, it's not as hard to spot as a
            jaguar, it's a feline like animal.
            User: What animal are you talking about?
            Bot: Mm. My friend in the car, who I wasn't seeing, just said he was a black bear, but he's not a bear. is
            pretty hard to spot (A bear was mentioned but I couldn
            User: He saw a black bear, I saw a Grizzly
            Bot: I haven't seen him yet, I'll call him tomorrow if I do.
            User: I grizzly bear????
            Bot: I wouldn't say that, I just thought that was interesting and was curious about it.
            User: Have you ever been to yellowstone national park??
            ...

As you can see, it's definitely nothing like a regular chatbot :) This is a small excerpt of a conversation with bot, where we smoothly transition from python the programming language to python the animal, and other animals. It gives competent responses and stays on topic for around 10 user-bot turns, transitioning the topic every 10 turns or so can lead to a better chat experience.

Something funny that used happen with the AI, is that it decides that “It’s not worth it anymore” and just sends blank gaps for messages. I have no idea where this came from, but when I decided to send it motivational messages for fun, It started talking again. When I added more data to the dataset, It Stopped the blank spacing, but it still sends messages like it, and may occasionally have too many spaces. but now, let's get into the code

Setup

This is all based off the google colab, but with slight changes to make sure we remain commercial friendly. Make sure you have a GPU.

To begin working with NanoChatGPT, clone the Github repo.


            !git clone https://github.com/VatsaDev/nanoChatGPT.git
            %cd /content/nanoChatGPT

The next step is optional, but if you use google colab as a cloud ai environment, then mounting to google drive can be a great idea for a place to store your ckpt.pt file


            from google.colab import drive
            drive.mount('/content/drive')

Install all dependencies


            !pip install torch numpy transformers datasets tiktoken wandb tqdm # needed to run the model

Make slight edits to prepare.py. We are only using one input file from the dataset NanoChatGPT was trained on, making it commercially friendly and fast to finetune. This input file was from the ubuntu dialogue corpus, a good quality multi turn conversational dataset.

How does this code work? well, looking inside prepare.py


            import os
            import requests
            import tiktoken
            import numpy as np

            train_ids=[]
            val_ids=[]
            enc = tiktoken.get_encoding("gpt2")

            def download_file(url):
            response = requests.get(url)
            if response.status_code == 200:
            with open('dataset.txt', 'wb') as f:
            f.write(response.content)
            print("downloaded dataset, tokenizing")
            else:
            print('Error downloading file:', response.status_code)

            download_file('https://raw.githubusercontent.com/VatsaDev/nanoChatGPT/main/data/Chat/input15.txt')

            def split_file(filename, output_dir, chunk_size):
            if not os.path.exists(output_dir):
            os.mkdir(output_dir)

            with open(filename, 'r') as f:
            lines = f.readlines()

            n_chunks = len(lines) // chunk_size
            for i in range(n_chunks):
            start = i * chunk_size
            end = min((i + 1) * chunk_size, len(lines))

            chunk_lines = lines[start:end]

            output_filename = os.path.join(output_dir, f'{i}-dataset.txt')
            with open(output_filename, 'w') as f:
            f.writelines(chunk_lines)

            split_file('dataset.txt', 'output', 10000)

            def is_numbers(string):
            two_chars = string[:1]

            try:
            int(two_chars)
            return True
            except ValueError:
            return False

            for filename in os.listdir('output'):
            if filename.endswith('.txt'):
            if is_numbers(filename) == True:
            if int(filename[:1]) <= 7: with open(f'output/{filename}', 'r' ) as f: data=f.read()
                train_ids=train_ids+enc.encode_ordinary(data) if int(filename[:1])> 7:
                with open(f'output/{filename}', 'r') as f:
                data = f.read()
                val_ids = val_ids+enc.encode_ordinary(data)

                print(f"train has {len(train_ids):,} tokens")
                print(f"val has {len(val_ids):,} tokens")
                train_ids = np.array(train_ids, dtype=np.uint16)
                val_ids = np.array(val_ids, dtype=np.uint16)
                train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
                val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

Breaking this code down part by part, The imports are OS, request, tiktoken, and numpy. OS and request are used to manipulate files, tiktoken is the tokenizer, a way to turn the text we give GPT-2 into numbers it can work with.

Then we define the train and val arrays, which are used to store tokenized content, along with the enc variable, which is just a way to store the tiktoken encoding value.

After that, we have the download_file function which gets the dataset from our desired location. The function itself is rather simple, we make a file called dataset.txt, request an online file, and write that to our local file.

The next function, split_file, is rather useful, and literally necessary at large dataset sizes. It takes the file we downloaded, and splits them into a bunch of smaller files in an output directory. It splits them by the number of lines in the text file. For example, here, the dataset is 100,000 lines of text, and chunk_size is 10,000, so the total number output files is just 100,000/10,000 = 10 chunk files.

In this implementation, we loop through all the txt files in the output directory, then check if they are chunks with the is_numbers function, then send them to train if they begin with 7 or below, and val if they begin with a number higher than 7. In our case, with 10 chunk files, 0–7 go to train, and 8 and 9 go to val. An 80/20 split between train and val, on parts of the same dataset

Another thing I would like to mention is that the way this was implemented is super important for you, if you scale dataset size. In the original Nano GPT repo, The fine tuning was based of downloading a dataset file off github, splitting the dataset into 90% train and 10% val strings. loading those into memory and tokenizing them. As dataset size grows, loading a lot of data like that onto system ram instantly crashes the PC, let alone tokenizing the data. Chunking the data allows for much smoother processing experience, being able to handle much more data, and tokenizing in chunks is faster, as I discovered, as when I switched to this process, my processing time went from around ~3min 30sec to ~2min.

The last piece of code, simply printing out your train tokens and val tokens, using np.array to turn all our 32 bit integers into 16 bit integers, making the code useable by ML algorithms that can’t handle negative numbers and making the array take up less memory.

Looking into the finetune now,

!python train.py config/finetune-gpt2.py

the config, finetune-gpt2.py contains all the hyperparameters for the model. There are many hyperparameters, but the most important ones are, eval_interval, max_iters, init_from, dataset, batch_size, and learning_rate.

eval_interval determines every how many iterations we check the validation loss, then save the checkpoint to the ckpt.pt file which represents our model. we also have a hyperparameter known as always_save_checkpoint. Which is kept false, and since it’s false, the model only saves a checkpoint if the validation loss improves, which means our model gets better overtime.

In our case the eval_interval is set to 5. setting it lower will mean more evaluations, and possibly a lower val loss, while training will be slower. If we set eval_interval higher, the model might train faster, but the val loss might be higher than what it could have been. Changing hyperparameters is a tradeoff, but see what works best for you.

max_iters is the amount of iterations a model trains. A longer number of iterations is better for the model, as it has a longer time to train and learn, but it also results in a longer training time. for our model, we have 50 iters, and it takes ~30 mins to train. too much training time might also not be useful, due to the fact that over too many iters, training stabilizes out, and the val loss might change by inconsequential amounts. I have yet to try to train model for more than 100 iters, but the results in val loss and model outputs were quite similar to 50 iters, so I left it ait 50.

init_from is the location for the model weights. Here it’s GPT-2-medium, but you could use other options like the other models of the gpt-2 series, or a model you trained yourself already on this project, you could even finetune a model you’ve already finetuned this way.

dataset is the location of your train.bin and val.bin files, which here is chat, because the train.bin and val.bin files were saved in the chat directory.

batch_size is very important, as it's essential to both you’re pc not crashing, and your models training time. 1 batch is how many tokens your model can take in at a time. For GPT 2, this is 1024 tokens, or ~4096 chars. A higher batch size can make your models training time a lot faster, as the model can see multiple batches and move through the dataset faster, but also make computational usage jump, because your loading and processing more in memory. I chose a batch size of 4 because it was my GPUs limit.

learning_rate is the rate at which a model learns information. the larger the dataset, the lower the learning rate, to prevent the model from overfitting, or learning the wrong patterns in the training data. for an extremely large llm like chatGPT, the learning rate is 1e-4 . But in the whole NanoChatGPT model, the learning rate is 2e-5, because we have a much smaller dataset. for the purposes of this tutorial, with a dataset size of a couple mb, you could set the learning rate really high, like 3e-4, and still have no overfit issues.

The training loop

Coming from the original NanoGPT model, You don’t need to know this part to finetune or run the model, skip over this section if thats not what you’re interested in.


            import os
            import time
            import math
            import pickle
            from contextlib import nullcontext

            import numpy as np
            import torch
            from torch.nn.parallel import DistributedDataParallel as DDP
            from torch.distributed import init_process_group, destroy_process_group

            from model import GPTConfig, GPT

            # -----------------------------------------------------------------------------
            # default config values designed to train a gpt2 (124M) on OpenWebText
            # I/O
            out_dir = 'out'
            eval_interval = 2000
            log_interval = 1
            eval_iters = 200
            eval_only = False # if True, script exits right after the first eval
            always_save_checkpoint = True # if True, always save a checkpoint after each eval
            init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'
            # wandb logging
            wandb_log = False # disabled by default
            wandb_project = 'owt'
            wandb_run_name = 'gpt2' # 'run' + str(time.time())
            # data
            dataset = 'openwebtext'
            gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
            batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
            block_size = 1024
            # model
            n_layer = 12
            n_head = 12
            n_embd = 768
            dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
            bias = False # do we use bias inside LayerNorm and Linear layers?
            # adamw optimizer
            learning_rate = 6e-4 # max learning rate
            max_iters = 600000 # total number of training iterations
            weight_decay = 1e-1
            beta1 = 0.9
            beta2 = 0.95
            grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
            # learning rate decay settings
            decay_lr = True # whether to decay the learning rate
            warmup_iters = 2000 # how many steps to warm up for
            lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
            min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
            # DDP settings
            backend = 'nccl' # 'nccl', 'gloo', etc.
            # system
            device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
            dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
            compile = True # use PyTorch 2.0 to compile the model to be faster
            # -----------------------------------------------------------------------------
            config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
            exec(open('configurator.py').read()) # overrides from command line or config file
            config = {k: globals()[k] for k in config_keys} # will be useful for logging
            # -----------------------------------------------------------------------------

            # various inits, derived attributes, I/O setup
            ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
            if ddp:
                init_process_group(backend=backend)
                ddp_rank = int(os.environ['RANK'])
                ddp_local_rank = int(os.environ['LOCAL_RANK'])
                ddp_world_size = int(os.environ['WORLD_SIZE'])
                device = f'cuda:{ddp_local_rank}'
                torch.cuda.set_device(device)
                master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
                seed_offset = ddp_rank # each process gets a different seed
                # world_size number of processes will be training simultaneously, so we can scale
                # down the desired gradient accumulation iterations per process proportionally
                assert gradient_accumulation_steps % ddp_world_size == 0
                gradient_accumulation_steps //= ddp_world_size
            else:
                # if not ddp, we are running on a single gpu, and one process
                master_process = True
                seed_offset = 0
                ddp_world_size = 1
            tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
            print(f"tokens per iteration will be: {tokens_per_iter:,}")

            if master_process:
                os.makedirs(out_dir, exist_ok=True)
            torch.manual_seed(1337 + seed_offset)
            torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
            torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
            device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
            # note: float16 data type will automatically use a GradScaler
            ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
            ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

            # poor man's data loader
            data_dir = os.path.join('data', dataset)
            train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
            val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
            def get_batch(split):
                data = train_data if split == 'train' else val_data
                ix = torch.randint(len(data) - block_size, (batch_size,))
                x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
                y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
                if device_type == 'cuda':
                    # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
                    x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
                else:
                    x, y = x.to(device), y.to(device)
                return x, y

            # init these up here, can override if init_from='resume' (i.e. from a checkpoint)
            iter_num = 0
            best_val_loss = 1e9

            # attempt to derive vocab_size from the dataset
            meta_path = os.path.join(data_dir, 'meta.pkl')
            meta_vocab_size = None
            if os.path.exists(meta_path):
                with open(meta_path, 'rb') as f:
                    meta = pickle.load(f)
                meta_vocab_size = meta['vocab_size']
                print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

            # model init
            model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                              bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
            if init_from == 'scratch':
                # init a new model from scratch
                print("Initializing a new model from scratch")
                # determine the vocab size we'll use for from-scratch training
                if meta_vocab_size is None:
                    print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
                model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
                gptconf = GPTConfig(**model_args)
                model = GPT(gptconf)
            elif init_from == 'resume':
                print(f"Resuming training from {out_dir}")
                # resume training from a checkpoint.
                ckpt_path = os.path.join(out_dir, 'ckpt.pt')
                checkpoint = torch.load(ckpt_path, map_location=device)
                checkpoint_model_args = checkpoint['model_args']
                # force these config attributes to be equal otherwise we can't even resume training
                # the rest of the attributes (e.g. dropout) can stay as desired from command line
                for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
                    model_args[k] = checkpoint_model_args[k]
                # create the model
                gptconf = GPTConfig(**model_args)
                model = GPT(gptconf)
                state_dict = checkpoint['model']
                # fix the keys of the state dictionary :(
                # honestly no idea how checkpoints sometimes get this prefix, have to debug more
                unwanted_prefix = '_orig_mod.'
                for k,v in list(state_dict.items()):
                    if k.startswith(unwanted_prefix):
                        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
                model.load_state_dict(state_dict)
                iter_num = checkpoint['iter_num']
                best_val_loss = checkpoint['best_val_loss']
            elif init_from.startswith('gpt2'):
                print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
                # initialize from OpenAI GPT-2 weights
                override_args = dict(dropout=dropout)
                model = GPT.from_pretrained(init_from, override_args)
                # read off the created config params, so we can store them into checkpoint correctly
                for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
                    model_args[k] = getattr(model.config, k)
            # crop down the model block size if desired, using model surgery
            if block_size < model.config.block_size:
                model.crop_block_size(block_size)
                model_args['block_size'] = block_size # so that the checkpoint will have the right value
            model.to(device)

            # initialize a GradScaler. If enabled=False scaler is a no-op
            scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

            # optimizer
            optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
            if init_from == 'resume':
                optimizer.load_state_dict(checkpoint['optimizer'])
            checkpoint = None # free up memory

            # compile the model
            if compile:
                print("compiling the model... (takes a ~minute)")
                unoptimized_model = model
                model = torch.compile(model) # requires PyTorch 2.0

            # wrap model into DDP container
            if ddp:
                model = DDP(model, device_ids=[ddp_local_rank])

            # helps estimate an arbitrarily accurate loss over either split using many batches
            @torch.no_grad()
            def estimate_loss():
                out = {}
                model.eval()
                for split in ['train', 'val']:
                    losses = torch.zeros(eval_iters)
                    for k in range(eval_iters):
                        X, Y = get_batch(split)
                        with ctx:
                            logits, loss = model(X, Y)
                        losses[k] = loss.item()
                    out[split] = losses.mean()
                model.train()
                return out

            # learning rate decay scheduler (cosine with warmup)
            def get_lr(it):
                # 1) linear warmup for warmup_iters steps
                if it < warmup_iters:
                    return learning_rate * it / warmup_iters
                # 2) if it > lr_decay_iters, return min learning rate
                if it > lr_decay_iters:
                    return min_lr
                # 3) in between, use cosine decay down to min learning rate
                decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
                assert 0 <= decay_ratio <= 1
                coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
                return min_lr + coeff * (learning_rate - min_lr)

            # logging
            if wandb_log and master_process:
                import wandb
                wandb.init(project=wandb_project, name=wandb_run_name, config=config)

            # training loop
            X, Y = get_batch('train') # fetch the very first batch
            t0 = time.time()
            local_iter_num = 0 # number of iterations in the lifetime of this process
            raw_model = model.module if ddp else model # unwrap DDP container if needed
            running_mfu = -1.0
            while True:

                # determine and set the learning rate for this iteration
                lr = get_lr(iter_num) if decay_lr else learning_rate
                for param_group in optimizer.param_groups:
                    param_group['lr'] = lr

                # evaluate the loss on train/val sets and write checkpoints
                if iter_num % eval_interval == 0 and master_process:
                    losses = estimate_loss()
                    print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
                    if wandb_log:
                        wandb.log({
                            "iter": iter_num,
                            "train/loss": losses['train'],
                            "val/loss": losses['val'],
                            "lr": lr,
                            "mfu": running_mfu*100, # convert to percentage
                        })
                    if losses['val'] < best_val_loss or always_save_checkpoint:
                        best_val_loss = losses['val']
                        if iter_num > 0:
                            checkpoint = {
                                'model': raw_model.state_dict(),
                                'optimizer': optimizer.state_dict(),
                                'model_args': model_args,
                                'iter_num': iter_num,
                                'best_val_loss': best_val_loss,
                                'config': config,
                            }
                            print(f"saving checkpoint to {out_dir}")
                            torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
                if iter_num == 0 and eval_only:
                    break

                # forward backward update, with optional gradient accumulation to simulate larger batch size
                # and using the GradScaler if data type is float16
                for micro_step in range(gradient_accumulation_steps):
                    if ddp:
                        # in DDP training we only need to sync gradients at the last micro step.
                        # the official way to do this is with model.no_sync() context manager, but
                        # I really dislike that this bloats the code and forces us to repeat code
                        # looking at the source of that context manager, it just toggles this variable
                        model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
                    with ctx:
                        logits, loss = model(X, Y)
                        loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
                    # immediately async prefetch next batch while model is doing the forward pass on the GPU
                    X, Y = get_batch('train')
                    # backward pass, with gradient scaling if training in fp16
                    scaler.scale(loss).backward()
                # clip the gradient
                if grad_clip != 0.0:
                    scaler.unscale_(optimizer)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
                # step the optimizer and scaler if training in fp16
                scaler.step(optimizer)
                scaler.update()
                # flush the gradients as soon as we can, no need for this memory anymore
                optimizer.zero_grad(set_to_none=True)

                # timing and logging
                t1 = time.time()
                dt = t1 - t0
                t0 = t1
                if iter_num % log_interval == 0 and master_process:
                    # get loss as float. note: this is a CPU-GPU sync point
                    # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
                    lossf = loss.item() * gradient_accumulation_steps
                    if local_iter_num >= 5: # let the training loop settle a bit
                        mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
                        running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
                    print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
                iter_num += 1
                local_iter_num += 1

                # termination conditions
                if iter_num > max_iters:
                    break

            if ddp:
                destroy_process_group()

t starts with whether or not we are running in DDP ( a way to split over multiple GPUs) or on a single GPU. After that, it prints the amount of tokens it will go through in one iter. For an LLM with massive amounts of data, you probably won’t go through all the data, so don’t worry if the number of iters with tokens per iter doesn’t match with total tokens.

The code then sets up a directory where our model’s ckpt.pt goes. Then we have a pytorch seed to make our responses reproducible, but you could remove that for variation in answers when working with the model multiple times. Then we allow for TF32 for matrix multiplication and cuDNN. This helps improve performance. Then we check for cuda and GPU, then we setup a dictionary to match data types with pytorchs datatypes. Then we ctx, which is nothing if the device is a cpu, but is an AMP context otherwise, another performance improver.

then get the train.bin and val.bin files, and makes batches that are batch_size large, and converts the numpy arrays to pytorch tensors. It then moves them to GPU if possible, and finally returns a batch of data as a tuple of (x, y), where x is the input data and y is the output data.

The code begins with the model arguments, which are used to initiate the model, the arguments include the number of layers, the number of heads, the number of embedding dimensions, the block size, the bias setting, and the vocabulary size. The code then checks the value of the init_from variable.
If init_from is set to scratch, then the code creates a new GPT-2 model from scratch. If init_from is set to resume, then the code resumes training from a checkpoint. If init_from is set to a string that starts with gpt2, then the code initializes the model from the pre-trained GPT-2 weights. then we chop blocks to block_size if they happen to be to large, and then moves model to gpu if available.

the loss function gives us the loss estimates we see during training, The function is decorated with the torch.no_grad() decorator, which tells PyTorch not to compute gradients for the function. This is important because the function is only estimating the loss, and there is no need to compute gradients for this purpose.

The code uses a cosine lr scheduler, and then the real train loop starts

It begins by fetching the very first batch, then setting up the iteration counter, then it determines and sets the learning rate for the iteration. Then the code write down the train and val loss, and saves the checkpoint if the val loss is the best so far. The code then performs a forward and backward pass of the model, which involves sending input to get an output in the forward pass, and the process of computing the gradients of the loss function is the backward pass.

The code then clips the gradients, The code then updates the GradScaler, which is used to scale the gradients. Finally, the code zeroes the gradients so that they are not accumulated for the next iteration.

Then the code prints the loss, time, and memory utilization (mfu). The code then increments the iteration counter and repeats the loop. The code finally destroys the process group if the model is using distributed training.

Chatting with the Bot

Now that we have a dataset, and a model, it's time to chat, or inference the model. For NanoChatGPT, we do this with,


            !python chat.py --out_dir=/content/drive/MyDrive/Model --init_from=resume --context=" human: Hello how are
            you? bot: Hello, I'm fine how about you?"

Features, and Future improvements

While this isn’t the full NanoChatGPT, the full version has may things this version doesn’t, including,

Medium Dataset(~700mb), full of a variety of conversations, and a little arithmetic
Model and datasets available on Huggingface
(at best), it can talk to you on a variety of topics and smoothly switch between topics, feeling like a texting human.

In terms of future improvements, these are things that could be done for this model, but I will probably not spend time trying to make this for this model. If interested, make a PR.

Math and Logical Reasoning → While there are datasets for this, it's a lot of stuff to add and it's pretty different from the rest of the dataset, so I might add them, but it would be worse off than model with more parameters learning that.

Short term memory → I haven’t found a great dataset yet but one could make a short term memory format

That's it for this tutorial, I hope you found this interesting, and that you build on NanoChatGPT for your own cool stuff.

Github → https://github.com/VatsaDev/nanoChatGPT