ML Engineering 101: A thorough explanation of the error “DataLoader worker (pid(s) xxx) exited unexpectedly” | By Mengliu Zhao | June 2024

Machine Learning


Torch.multiprocessing best practices

However, virtual memory is only one side of the problem – what if adjusting your swap disk doesn't solve the problem?

Another aspect is the underlying issue of the torch.multiprocessing module, whose official web page provides many best practice recommendations:

But in addition to these, there are three further approaches to consider, especially when it comes to memory usage:

First, there is the shared memory leak.By leaking, we mean that memory is not properly released after each execution of the child workers, and this can be observed by monitoring virtual memory usage at runtime: memory consumption keeps growing and reaches an “out of memory” state, which is a very typical memory leak.

So what is causing the leak?

Let's take a look at the DataLoader class itself.

https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py

If we look inside DataLoader, we can see that _MultiProcessingDataLoaderIter is called when nums_worker > 0. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates a worker queue. Torch.multiprocessing uses two different strategies for memory sharing and caching: File Descriptor and File system. meanwhile Filesystem It does not require file descriptor caching and is therefore prone to shared memory leaks.

To see what sharing strategy a machine is using, simply add the following to your script:

torch.multiprocessing.get_sharing_strategy()

To get the system file descriptor limit (Linux), run the following command in a terminal:

ulimit -n

To switch sharing strategies File Descriptor:

torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)

To count the number of open file descriptors, run the following command:

ls /proc/self/fd | wc -l

As long as the system allows File Descriptor Strategies are recommended.

The second is how to start a multi-process worker. In short, it's a debate about whether to use fork or spawn as a worker launch method. Fork is the default way to launch multiple processes on Linux and is much faster as it avoids copying certain files, but it can cause issues when dealing with 3rd party libraries like CUDA tensors or OpenCV with DataLoader.

To use the spawn method, just pass the arguments Multiprocessing Context = “Spawn”.Add it to your DataLoader .

3. Make your Dataset objects pickable/serializable

Here's a very good post that goes into more detail about the “copy-on-read” effect of process folding: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

Simply put, it is No longer a good approach Create a list of file names and load it with the __getitem__ method. Create a numpy array or panda dataframe to store the list of file names for serialization purposes. Also, if you are familiar with HuggingFace, I recommend using CSV/Dataframe to load a local dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *