[Pytorch Lightning] GPU memory keeps increasing during training

발생한 현상

Training 에 돌입한 이후, 학습이 진행되는 과정에서 GPU VRAM (을 사용하는) 용량이 점차 증가, 결국 훈련 도중에 OOM 발생

일반적으로는, config 로 지정해준 batch size, grad accumulation, num of gpus 등의 하이퍼파라미터에 따라서 VRAM 사용량이 결정되며 이는 훈련 과정에서 변하지 않는다. (의도적으로 일정 스텝 이후에서 배치사이즈를 변경해주지 않는 한) 그런데 나의 경우에는, 훈련 도중에 자꾸만 VRAM 사용량이 증가하는 모습이 포착되었다.

원인 분석

Training Step 에서, .to('cuda') 명령어를 통해 CPU 에 있는 데이터셋 미니배치를 GPU 로 올려주게 되는데, 아무래도 그때 올라간 미니배치가 계속 누적되는 것 같았다. 아래가 그 문제의 코드이다.

    def training_step(self, batch, batch_idx):
        
        # single dataset 
        texts, images = batch
       
        with torch.no_grad():
            # special tokens
            self.s_token = torch.tensor(self.tokenizer.encode("USER: "), dtype=torch.int64, device="cuda", requires_grad=False)
            self.e_token = torch.tensor(self.tokenizer.encode("ASSISTANT: "), dtype=torch.int64, device="cuda", requires_grad=False)
            self.sep = torch.tensor(self.tokenizer.encode("\n"), dtype=torch.int64, device="cuda", requires_grad=False)
            # TODO: 하드코딩, 태스크별로 수정
            self.task_instruction = torch.tensor(self.tokenizer.encode(" Please generate an image."), dtype=torch.int64, device="cuda", requires_grad=False)

            for key, value in self.special_tokens.items():
                self.special_tokens[key] = torch.tensor([value], dtype=torch.int64, device="cuda", requires_grad=False)

        batch_sequences = []
        prefix = torch.cat([self.special_tokens['bos'], self.s_token, self.special_tokens['boi']])
    
        for i in range(self.config.experiment.train_bsz):
            
            # tokenize image
            image_token_1 = (self.visual_tokenizer.encode_image(image_torch=images[i][0].cuda()) + 32000)[0, :]
            image_token_2 = (self.visual_tokenizer.encode_image(image_torch=images[i][1].cuda()) + 32000)[0, :]
            
            # construct sequence per sample (captioning)
            # task_type = 1
            instruction = torch.cat([prefix, image_token_1, self.special_tokens['eoi'], texts[i].cuda(), self.task_instruction, self.sep, self.e_token, self.special_tokens['boi'], image_token_2, self.special_tokens['eoi'], self.special_tokens['eos']], dim=0)
            
            # for specify loss part
            target_start_idx = len(prefix) + len( image_token_1) + len(self.special_tokens['eoi']) + len(texts[i]) + len(self.task_instruction) + len(self.sep) + len(self.e_token) + len(self.special_tokens['boi'])
            target_end_idx = target_start_idx + len(image_token_2)   
          
            batch_sequences.append(instruction)

        # Stack batch sequences into a 2D tensor
        batch_sequences = torch.nn.utils.rnn.pad_sequence(batch_sequences, batch_first=True, padding_value=0)

        logits = self.forward(batch_sequences)

        # Extract the logits corresponding to {    }
        target_logits = logits[:, target_start_idx:target_end_idx, :]
        target_ids = batch_sequences[:, target_start_idx:target_end_idx]

        # Compute the loss for text_list[1]
        loss = self.loss_fn(
            target_logits.reshape(-1, self.total_vocab_size),
            target_ids.reshape(-1)
        )
        
        self.log_dict({"train/loss": loss.item(),
                       "train/lr": self.trainer.optimizers[0].param_groups[0]['lr'],
                       "train/global_step": self.global_step},
                      on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size=self.config.experiment.train_bsz)
        
        return {"loss": loss}

사실 이전에 사용하던, 정상적으로 학습이 잘 되던 코드와 크게 달라진 부분이 없기에, 정확히 코드의 어느 부분에서 누적 현상이 생기는지는 파악하지 못하였다. 그러나 위 코드에서, 배치를 GPU 에 올려주는 방식을 아래 코드와 같이 수정하였더니 더이상 해당 문제가 발생하지 않았다.

해결 방법

미니배치를 샘플별로 GPU 에 올려주는 것이 아니라, 미니 배치를 한 번에 묶어 (torch.stack 또는 torch.cat 사용) GPU 에 올려주었다.

def sft_dataloader(self, batch, task_type, bsz):
        if task_type == 0 or task_type == 4:
            _, img, txt = batch
            img = img.to("cuda")
            txt_token = txt.to("cuda")
            
            ...

참고 자료. (사실 관련 현상을 언급한 글이 몇개 없었을 뿐더러, 대단히 도움이 되는 글은 없었다.)

https://discuss.pytorch.org/t/gpu-memory-consumption-increases-while-training/2770/26

GPU memory consumption increases while training

I have this issue and I dont know where the code collect the the data to add and my memory usage increases. Could you please help me with that? Here is my code: def train_batch(model, optimizer, device, batch, labels): model.train() optimizer.zero_grad() l

discuss.pytorch.org

'studyLog. AI > 오늘의 디버깅' 카테고리의 다른 글

[Pytorch Lightning] Inconsistent Batch Size in Training (0)	2024.05.29
[TensorFlow] Datatype 과 Tensor shape (0)	2024.02.18

work hard, be kind

[Pytorch Lightning] GPU memory keeps increasing during training

'studyLog. AI > 오늘의 디버깅' 카테고리의 다른 글

티스토리툴바

[Pytorch Lightning] GPU memory keeps increasing during training

'studyLog. AI > 오늘의 디버깅' 카테고리의 다른 글

관련글

티스토리툴바