ML Engineering 101: A thorough explanation of the error “DataLoader worker (pid(s) xxx) exited unexpectedly” | By Mengliu Zhao

Torch.multiprocessing best practices

However, virtual memory is only one side of the problem – what if adjusting your swap disk doesn't solve the problem?

Another aspect is the underlying issue of the torch.multiprocessing module, whose official web page provides many best practice recommendations:

But in addition to these, there are three further approaches to consider, especially when it comes to memory usage:

First, there is the shared memory leak.By leaking, we mean that memory is not properly released after each execution of the child workers, and this can be observed by monitoring virtual memory usage at runtime: memory consumption keeps growing and reaches an “out of memory” state, which is a very typical memory leak.

So what is causing the leak?

Let's take a look at the DataLoader class itself.

https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py

If we look inside DataLoader, we can see that _MultiProcessingDataLoaderIter is called when nums_worker > 0. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates a worker queue. Torch.multiprocessing uses two different strategies for memory sharing and caching: File Descriptor and File system. meanwhile Filesystem It does not require file descriptor caching and is therefore prone to shared memory leaks.

To see what sharing strategy a machine is using, simply add the following to your script:

torch.multiprocessing.get_sharing_strategy()

To get the system file descriptor limit (Linux), run the following command in a terminal:

ulimit -n

To switch sharing strategies File Descriptor:

torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)

To count the number of open file descriptors, run the following command:

ls /proc/self/fd | wc -l

As long as the system allows File Descriptor Strategies are recommended.

The second is how to start a multi-process worker. In short, it's a debate about whether to use fork or spawn as a worker launch method. Fork is the default way to launch multiple processes on Linux and is much faster as it avoids copying certain files, but it can cause issues when dealing with 3rd party libraries like CUDA tensors or OpenCV with DataLoader.

To use the spawn method, just pass the arguments Multiprocessing Context = “Spawn”.Add it to your DataLoader .

3. Make your Dataset objects pickable/serializable

Here's a very good post that goes into more detail about the “copy-on-read” effect of process folding: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

Simply put, it is No longer a good approach Create a list of file names and load it with the __getitem__ method. Create a numpy array or panda dataframe to store the list of file names for serialization purposes. Also, if you are familiar with HuggingFace, I recommend using CSV/Dataframe to load a local dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2

Source link

b"asta binance h"anvisningskod commented on Hiring platform Uplers ups the ante; claims to have created two pronged strategy for workforce : I don't think the title of your article matches th
创建个人账户 commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: Your article helped me a lot, is there any more re
Registro commented on Security Architect | eFinancialCareers: Thanks for sharing. I read many of your blog posts
Anm"al dig f"or att fa 100 USDT commented on Best ChatGPT Tips and Tricks shared by ChatGPT Experts: Turbo-Charge Your AI Experience: Prompts included | by Michael King | Oct, 2023: Thanks for sharing. I read many of your blog posts
Elizabeth Nash commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: 🌍 Global crypto mining is now at your fingertips h

ML Engineering 101: A thorough explanation of the error “DataLoader worker (pid(s) xxx) exited unexpectedly” | By Mengliu Zhao | June 2024

Leave a Reply

RECENT POSTS

Principled approaches for extending neural architectures to function spaces for operator learning

I found a hidden Samsung video setting that makes all streaming apps appear to be playing on the movie screen

FE News | How AI is reshaping entry-level marketing and how you can stand out in 2026

Related Posts

Leave a Reply