Model loading and executing - FlashAttention Issue

Do you have any pointers on resource to help with installation of the models after MAX install and Model install and trying to execute. I am persistently getting FlashAttention error while trying to execute (after installing and loading MSFT 3.5-vision-instruct model) - RuntimeError: FlashAttention only support fp16 and bf16 data type
Stack (most recent call last):
File “”, line 1, in
File “/usr/lib/python3.10/multiprocessing/spawn.py”, line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File “/usr/lib/python3.10/multiprocessing/spawn.py”, line 129, in _main
return self._bootstrap(parent_sentinel)
File “/usr/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
self.run()
File “/usr/lib/python3.10/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/profiler/tracing.py”, line 85, in wrapper
return func(*args, **kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/pipelines/model_worker.py”, line 204, in call
logger.exception(
[2025-06-28 12:37:06] ERROR queues.py:143: Model worker process is not healthy
Task completed with error. Stopping
Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 261, in _pull_from_socket
msg = self.pull_socket.recv(**kwargs)
File “zmq/backend/cython/_zmq.py”, line 1203, in zmq.backend.cython._zmq.Socket.recv
File “zmq/backend/cython/_zmq.py”, line 1238, in zmq.backend.cython._zmq.Socket.recv
File “zmq/backend/cython/_zmq.py”, line 1398, in zmq.backend.cython._zmq._recv_copy
File “zmq/backend/cython/_zmq.py”, line 1393, in zmq.backend.cython._zmq._recv_copy
File “zmq/backend/cython/_zmq.py”, line 183, in zmq.backend.cython._zmq._check_rc
zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/scheduler/queues.py”, line 124, in response_worker
responses_list = self.response_pull_socket.get_nowait()
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/profiler/tracing.py”, line 85, in wrapper
return func(*args, **kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 287, in get_nowait
return self.get(flags=zmq.NOBLOCK, **kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 283, in get
return self._pull_from_socket(**kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 264, in _pull_from_socket
raise queue.Empty()
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/scheduler/queues.py”, line 145, in response_worker
raise Exception(“Worker failed!”)
Exception: Worker failed!
[2025-06-28 12:37:06] INFO server.py:264: Shutting down
[2025-06-28 12:37:06] INFO server.py:299: Waiting for connections to close. (CTRL+C to force quit)

For reference, I did the following
(1) This is the code that I loaded the model :
unset MAX_SERVE_PORT
unset MAX_SERVE_USE_HEARTBEAT
max serve --model-path=microsoft/Phi-3.5-vision-instruct --trust-remote-code --torch-dtype=bfloat16 --disable-telemetry --port=9999

(2) This is the output when the server is launched:
2025-06-28 12:35:13] WARNING hf_pipeline.py:89: eos_token_id provided in huggingface config (2), does not match provided eos_token_id (32000), using provided eos_token_id 12:35:13.355 WARNING: 7971 MainThread: max.pipelines: eos_token_id provided in huggingface config (2), does not match provided eos_token_id (32000), using provided eos_token_id [2025-06-28 12:35:13] INFO api_server.py:153:
********** Server ready on http://0.0.0.0:9999 (Press CTRL+C to quit) **********
[2025-06-28 12:35:13] ERROR metrics.py:195: instrument maxserve.pipeline_load is not one of the supported sdk types [2025-06-28 12:35:13] INFO on.py:62: Application startup complete. [2025-06-28 12:35:13] INFO server.py:216: Uvicorn running on http://0.0.0.0:9999 (Press CTRL+C to quit)

(3) I try testing the model with both text and image inputs and got the FlashAttention issue due to variable type mismatch. Below are the two test codes

curl -X POST http://localhost:9999/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “microsoft/Phi-3.5-vision-instruct”,
“messages”: [{“role”: “user”, “content”: [
{“type”: “text”, “text”: “What do you see in this image?”},
{“type”: “image_url”, “image_url”: {“url”: “data:image/jpeg;base64,YOUR_BASE64_IMAGE”}}
]}],
“max_tokens”: 100
}’

curl -X POST http://localhost:9999/v1/chat/completions /v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “microsoft/Phi-3.5-vision-instruct”,
“messages”: [{“role”: “user”, “content”: “Hello, how are you?”}],
“max_tokens”: 50
}’

Both the above gives following error and trace logs:

ERROR queues.py:143: Model worker process is not healthy
Task completed with error. Stopping
Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 261, in _pull_from_socket
msg = self.pull_socket.recv(**kwargs)
File “zmq/backend/cython/_zmq.py”, line 1203, in zmq.backend.cython._zmq.Socket.recv
File “zmq/backend/cython/_zmq.py”, line 1238, in zmq.backend.cython._zmq.Socket.recv
File “zmq/backend/cython/_zmq.py”, line 1398, in zmq.backend.cython._zmq._recv_copy
File “zmq/backend/cython/_zmq.py”, line 1393, in zmq.backend.cython._zmq._recv_copy
File “zmq/backend/cython/_zmq.py”, line 183, in zmq.backend.cython._zmq._check_rc
zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/scheduler/queues.py”, line 124, in response_worker
responses_list = self.response_pull_socket.get_nowait()
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/profiler/tracing.py”, line 85, in wrapper
return func(*args, **kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 287, in get_nowait
return self.get(flags=zmq.NOBLOCK, **kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 283, in get
return self._pull_from_socket(**kwargs)
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/queue/zmq_queue.py”, line 264, in _pull_from_socket
raise queue.Empty()
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/bsam/phi35_vision/.venv/lib/python3.10/site-packages/max/serve/scheduler/queues.py”, line 145, in response_worker
raise Exception(“Worker failed!”)
Exception: Worker failed!
[2025-06-28 12:37:06] INFO server.py:264: Shutting down

I’m not sure what causes this issue, but I see mention of it in other cases. See this discussion, for example: microsoft/Phi-3-small-8k-instruct · RuntimeError: FlashAttention only support fp16 and bf16 data type during fine tuning.

We don’t currently support the Phi3VForCausalLM architecture in MAX that is used for this specific model, so MAX is falling back to the PyTorch implementation, which appears to have the error Josh links above.

We do support Phi3ForCausalLM, however, and models based on that should not show this error. If you want to help us bring up Phi3VForCausalLM in MAX, based on the already-existing architecture, that’d be greatly appreciated.