Gemma-3-1b-it having trouble with long promts, unlike VLLM

I triggered the bug by trying to replace the vllm docker image by the max docker image in my app, which works correctly with VLLM. It was supposed to be a drop-in replacement but it was generating “The. The. The. The. The…” instead of a coherent answer. I think I managed to reproduce without using the serving api. The thing triggering the bug is a very long prompt.

docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN -it --entrypoint=python modular/max-nvidia-base:nightly -m max.entrypoints.pipelines generate --model-path=google/gemma-3-1b-it --prompt "ADD A VERY LONG PROMPT HERE"

Let me know if there is a way to fix the bug on my side or not.

What GPU and what prompt length are you trying? Is this with the latest nightly? We can see if we can reproduce and isolate internally.

In the meantime, can you try different values of --max-length to see if this occurs above or below a fixed size for the context window?

The max version is MAX 25.4.0.dev2025061205.

For the --max-length, it seems unnecessary as i can trigger the bug without using it. If my prompt is “Hello, it’s a good day today.” repeated ~20 times, I get the bug with those numbers:

Prompt size: 711
Output size: 99

I’m pretty sure the context windows is at least 711 tokens. Otherwise it would be very small. But I’m not an expert in LLMs maybe I’m wrong.

The gpu is RTX 2000 ada generation laptop

Can I trouble you to try this again with the latest nightly? We had at least one bug that was fixed in the last couple of days that may have impacted output quality in Gemma 3 models, and I want to see if that may have addressed this as well.

I retried. I’ll give you a sample to reproduce:

docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN -it --entrypoint=python modular/max-nvidia-base:nightly -m max.entrypoints.pipelines generate --model-path=google/gemma-3-1b-it --prompt "When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.—That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed,—That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. Nor have We been wanting in attentions to our British brethren. We have warned them from time to time of attempts by their legislature to extend an unwarrantable jurisdiction over us. We have reminded them of the circumstances of our emigration and settlement here. We have appealed to their native justice and magnanimity, and we have conjured them by the ties of our common kindred to disavow these usurpations, which, would inevitably interrupt our connections and correspondence. They too have been deaf to the voice of justice and of consanguinity. Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world. He is at this time transporting large Armies of foreign Mercenaries to compleat the works of death, desolation, and tyranny, already begun with circumstances of Cruelty & Perfidy scarcely paralleled in the most barbarous ages, and totally unworthy the Head of a civilized nation. One of the first readings of the Declaration by the British is believed to have taken place at the Rose and Crown Tavern on Staten Island, New York in the presence of General Howe. The Spanish-American authorities banned the circulation of the Declaration, but it was widely transmitted and translated: by Venezuelan Manuel García de Sena, by Colombian Miguel de Pombo, by Ecuadorian Vicente Rocafuerte, and by New Englanders Richard Cleveland and William Shaler, who distributed the Declaration and the United States Constitution among Creoles in Chile and Indians in Mexico in 1821. The North Ministry did not give an official answer to the Declaration, but instead secretly commissioned pamphleteer John Lind to publish a response entitled Answer to the Declaration of the American Congress."

I get:

...
but instead secretly commissioned pamphleteer John Lind to publish a response entitled Answer to the Declaration 
of the American Congress. The, the 1, the 1, the 1, the 1, the 1, and the 1, the 1, the 1, the 1, the1, 1, and the 1, the1, and1, the1, and the1, has1, the1, the1, the1, and the1, and1, and1 and the1, the1 and1 and the1 and the1 and

Prompt size: 710
Output size: 99
Startup time: 69441.4381980896 ms
Time to first token: 958.5299491882324 ms
Prompt eval throughput (context-encoding): 740.7175963581425 tokens per second
Time per Output Token: -10.748965399605888 ms
Eval throughput (token-generation): -93.03220941028103 tokens per second
Total Latency: -94.86865997314453 ms
Total Throughput: -10.54088884867859 req/s

The version:

$ docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN -it --entrypoint=python modular/max-nvidia-base:nightly -m max.entrypoints.pipelines --version
{"levelname": "INFO", "process": 1, "threadName": "MainThread", "name": "root", "message": "Logging initialized: Console: INFO, File: None, Telemetry: None", "taskName": null, "timestamp": "2025-06-18T08:47:57.940674+00:00"}
{"levelname": "INFO", "process": 1, "threadName": "MainThread", "name": "root", "message": "Metrics initialized.", "taskName": null, "timestamp": "2025-06-18T08:47:57.941255+00:00"}
MAX 25.5.0.dev2025061805

Also, the timings are all negative numbers, isn’t that strange? Latency, throughput, etc…