efficient model loading and usage across multiple thread on different devices (desktop, cuda, metal, cpu, etc.) #2326

louis030195 · 2024-07-10T06:47:40Z

louis030195
Jul 10, 2024

hi!

i'm using candle in:

https://github.com/louis030195/screen-pipe

atm to transcribe audio using whisper (maybe later for OCR, embeddings, etc.)

everything is mostly ineffecient/non-optimised atm, we record a list of audio input & output and transcribe

each audio device recording + transcription runs in its own thread

we do not batch

we use metal if available

we use cuda if available

otherwise cpu

we record audio 24/7 and run transcription every 5s, write to wav file every 30s and save things to local sqlite

the model is loaded/unloaded every 5s (very inefficient I guess?)

trying to optimise a bit now, especially facing seg fault 11, killed 9, and bus 10 errors mediar-ai/screenpipe#29

questions

what happens when you load the model in multiple thread on the same device (metal, cuda, cpu?)
in my mind the most optimised scenario is model is loaded (and stay 24/7) at boot and there is a queue where each thread add things and when hitting a threshold the model batch process (on the best device available)
anything else i should ask/worry/other path i could take?

thanks a lot! 🙏

Answered by LaurentMazare

Jul 10, 2024

If you load the same model multiple times (same thread or multiple thread doesn't make much of a difference), this would duplicate the memory footprint of the weights.
Instead you probably want to load the model once and then clone it to get a separate KV cache (if it's a model with a KV cache), cloning will result in the weights being shared so no duplicate memory and the KV cache being different. Fwiw that's how we handle it to serve moshi.
And yes your errors (segfault 11/killed 9/bus error) certainly look like the OS running out of memory and killing the process.

View full answer

LaurentMazare · 2024-07-10T06:55:00Z

LaurentMazare
Jul 10, 2024
Maintainer

If you load the same model multiple times (same thread or multiple thread doesn't make much of a difference), this would duplicate the memory footprint of the weights.
Instead you probably want to load the model once and then clone it to get a separate KV cache (if it's a model with a KV cache), cloning will result in the weights being shared so no duplicate memory and the KV cache being different. Fwiw that's how we handle it to serve moshi.
And yes your errors (segfault 11/killed 9/bus error) certainly look like the OS running out of memory and killing the process.

2 replies

louis030195 Jul 10, 2024
Author

@LaurentMazare thanks Laurent! Any chance you have a sample code to do this with whisper for example?

louis030195 Jul 10, 2024
Author

PS: in the end running model in a thread and using a channel instead. segfault is gone

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efficient model loading and usage across multiple thread on different devices (desktop, cuda, metal, cpu, etc.) #2326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

efficient model loading and usage across multiple thread on different devices (desktop, cuda, metal, cpu, etc.) #2326

louis030195 Jul 10, 2024

questions

Replies: 1 comment · 2 replies

LaurentMazare Jul 10, 2024 Maintainer

louis030195 Jul 10, 2024 Author

louis030195 Jul 10, 2024 Author

louis030195
Jul 10, 2024

Replies: 1 comment 2 replies

LaurentMazare
Jul 10, 2024
Maintainer

louis030195 Jul 10, 2024
Author

louis030195 Jul 10, 2024
Author