Am Neumarkt 😱

#ml

https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

I find this post very useful. I have always wondered what happens after my dataloader prepared everything for the GPU. I didn’t know that CUDA has to copy the data again to create page-locked memory.

I used to set pin_memory=True in a PyTorch DataLoader and benchmark it. To be honest, I have only observed very small improvements in most of my experiments. So I stopped caring about pin_memory.

After some digging, I also realized that performance from setting pin_memory=True in DataLoader is ticky. If we don’t use multiprocessing nor reuse the page-locked memory, it is hard to expect any performance gain.

(some other notes: https://datumorphism.leima.is/cards/machine-learning/practice/cuda-memory/)

NVIDIA Technical Blog

How to Optimize Data Transfers in CUDA C/C++

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our…