efficient model loading and usage across multiple thread on different devices (desktop, cuda, metal, cpu, etc.) #2326
-
hi! i'm using candle in: https://github.com/louis030195/screen-pipe atm to transcribe audio using whisper (maybe later for OCR, embeddings, etc.) everything is mostly ineffecient/non-optimised atm, we record a list of audio input & output and transcribe each audio device recording + transcription runs in its own thread we do not batch we use metal if available we use cuda if available otherwise cpu we record audio 24/7 and run transcription every 5s, write to wav file every 30s and save things to local sqlite the model is loaded/unloaded every 5s (very inefficient I guess?) trying to optimise a bit now, especially facing seg fault 11, killed 9, and bus 10 errors mediar-ai/screenpipe#29 questions
thanks a lot! 🙏 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
If you load the same model multiple times (same thread or multiple thread doesn't make much of a difference), this would duplicate the memory footprint of the weights. |
Beta Was this translation helpful? Give feedback.
If you load the same model multiple times (same thread or multiple thread doesn't make much of a difference), this would duplicate the memory footprint of the weights.
Instead you probably want to load the model once and then clone it to get a separate KV cache (if it's a model with a KV cache), cloning will result in the weights being shared so no duplicate memory and the KV cache being different. Fwiw that's how we handle it to serve moshi.
And yes your errors (segfault 11/killed 9/bus error) certainly look like the OS running out of memory and killing the process.