![]() ![]() ![]() That being said, you can also test the best num_workers for your machine. For GPU specifically, this experiment found that num_workers = 4*num_GPU had the best performance. Setting num_workers >0 is expected to accelerate the process more especially for the i/o and augmentation of large data. Num_workers=0 would make data loading execute only after training or previous process is done. Asynchronous data loading and augmentation It’s recommended to move the data, which will be used in the active projects, to the SSD (or the hard drive with better i/o) for faster speed. Some machines have different hard drives like HHD and SSD. Having more time facilitates a faster model development cycle and leads to better model performance. Then saving memory may enable a larger batch size, which saves more time. Third, maximize the memory usage efficiency to save memory. Second, overlap the processes as much as possible to save time. This way, we can leverage GPUs and their specialization to accelerate those computations. First, reduce the i/o (input/output) as much as possible so that the model pipeline is bound to the calculations (math-limited or math-bound) instead of bound to i/o (bandwidth-limited or memory-bound). Overall, you can optimize the time and memory usage by 3 key points. scaler.step(optimizer) # If optimizer.step() was skipped, # scaling factor is reduced by the backoff_factor # in GradScaler() scaler.update() # Reset the gradients to None optimizer.zero_grad( set_to_none=True) High-level concepts ![]() # If these gradients contain infs or NaNs, # optimizer.step() is skipped. autocast(): # autocast as a context manager output = model(features) loss = criterion(output, target) # Backward pass without mixed precision # It's not recommended to use mixed precision for backward pass # Because we need more precise loss scaler.scale(loss).backward() # Only update weights every other 2 iterations # Effective batch size is doubled if (i+1) % 2 = 0 or (i+1) = len(dataloader): # scaler.step() first unscales the gradients. 7, 11, 12, 13: # Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting # gradients as None, and larger effective batch size ain() # Reset the gradients to None optimizer.zero_grad( set_to_none=True) scaler = GradScaler() for i, (features, target) in enumerate(dataloader): # these two calls are nonblocking and overlapping features = features.to('cuda:0', non_blocking=True) target = target.to('cuda:0', non_blocking=True) # Forward pass with mixed precision with. Use DistributedDataParallel instead of DataParallelĬode snippet combining the tips No. Turn off bias for convolutional layers that are right before batch normalization Use channels_last memory format for 4D NCHW Tensors 17. CNN (Convolutional Neural Network) specific 15.Gradient accumulation: update weights for every other x batch to mimic the larger batch size Set gradients to None (e.g., model.zero_grad( set_to_none=True) ) before the optimizer updates the weights 13. Use mixed precision for forward pass (but not backward pass) 12. Set the batch size as the multiples of 8 and maximize GPU memory usage 11. Set the sizes of all different architecture designs as the multiples of 8 (for FP16 of mixed precision) Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT Use tensor.to( non_blocking=True) when it’s applicable to overlap data transfers 8. Avoid unnecessary data transfer between CPU and GPU 6. Directly create vectors/matrices/tensors as torch.Tensor and at the device where they will run operations 5. Dataloader(dataset, num_workers=4*num_GPU) 3. For each tip, I also provide code snippets and annotate whether it’s specific to the device types (CPU/GPU) or model types. Then I dive into them one by one in detail afterward. I start by providing a full list and a combined code snipped in case you’d like to jump into optimizing your scripts. To better leverage these tips, we also need to understand how and why they work. I collected and organized several PyTorch tricks and tips to maximize the efficiency of memory usage and minimize the run time. The faster each experiment iteration is, the more we can optimize the whole model prediction performance given limited time and resources. The training/inference processes of deep learning models are involved lots of steps. Tuning deep learning pipelines is like finding the right gear combination (Image by Tim Mossholder on Unsplash) Why should you read this post? ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |