how to serve multiple LLMs. Tried using –port option with latest max version, but second serve command shows error that it is already on same port, even though other port is mentioned
Please elaborate what do mean by multiple LLMs and what’s your usecase? are you trying to serve multiple LLMs in a single/multi GPU and have enough resources then what errors do you get? please include as much info as possible.
Thanks. I was able to execute command “max serve –port xxxx” for multiple LLMs, one port for Gemma and another port for DeepSeek. if we set –device-memory-utilization=0.4 or 0.5 as per memory available, it works for multiple serves. If we do not set this, it will use 0.9 of available memory, it means it will use 90% memory by default for first serve.
1 Like