Memory Requirement Estimation to run N concurrent request of a specific model

I have a consumer laptop, and got the below log for running max serve followed by my devise hardware specs.

  1. The only component I can upgrade is RAM, how can I count the minimum additional required RAM to rum a specific model smoothly at my device,
  2. If I want to purchase or (custom build) a devise that can handle 10 concurrent requests for a specific model (I would like to build a home AI server for my family)
21:42:32.839 INFO: 6901 MainThread: max.pipelines: No GPUs available, falling back to CPU 
21:42:32.839 INFO: 6901 MainThread: max.pipelines: No GPUs available, falling back to CPU  
21:42:40.095 WARNING: 6933 MainThread: max.pipelines: Insufficient cache memory to support a batch containing one request at the max sequence length of 131072 tokens. Need to allocate at least 1024 pages (32.00 GiB), but only have enough memory for 280 pages (8.75 GiB). 
21:42:41.442 INFO: 6933 MainThread: max.pipelines: Paged KVCache Manager allocated 280 device pages using 32.00 MiB per page.  

Estimated memory consumption:     
Weights:                4.58 GiB     
KVCache allocation:     8.75 GiB     
Total estimated:        13.33 GiB used / 14.81 GiB free 
Auto-inferred max sequence length: 131072 
Auto-inferred max batch size: 1  

21:42:35.823 INFO: 6901 MainThread: max.entrypoints: Starting server using modularai/Llama-3.1-8B-Instruct-GGUF
21:42:35.823 INFO: 6901 MainThread: max.pipelines:  Loading TextTokenizer and TextGenerationPipeline(Llama3Model) factory for:         
architecture:           LlamaForCausalLM         
devices:                cpu[0]         
model_path:             modularai/Llama-3.1-8B-Instruct-GGUF         huggingface_revision:   main         
quantization_encoding:  SupportedEncoding.q4_k         
cache_strategy:         KVCacheStrategy.PAGED         
weight_path:            [    
                                llama-3.1-8b-instruct-q4_k_m.gguf                                 ]

Below are my laptop specs:

[abuka@archlinux ~]$ sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:03:00.0
       logical name: /dev/fb0
       version: c5
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list fb
       configuration: depth=32 driver=amdgpu latency=0 resolution=1920,1080
       resources: irq:49 memory:d0000000-dfffffff memory:e0000000-e01fffff ioport:e000(size=256) memory:fcd00000-fcd7ffff
[abuka@archlinux ~]$ lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 7 5800HS with Radeon Graphics
    CPU family:              25
    Model:                   80
    Thread(s) per core:      2
    Core(s) per socket:      8
    Socket(s):               1
    Stepping:                0
    Frequency boost:         enabled
    CPU(s) scaling MHz:      53%
    CPU max MHz:             4465.2612
    CPU min MHz:             403.4880
    BogoMIPS:                6388.15
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ext
                             d_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs sk
                             init wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflu
                             shopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_c
                             lean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization features:     
  Virtualization:            AMD-V
Caches (sum of all):         
  L1d:                       256 KiB (8 instances)
  L1i:                       256 KiB (8 instances)
  L2:                        4 MiB (8 instances)
  L3:                        16 MiB (1 instance)
NUMA:                        
  NUMA node(s):              1
  NUMA node0 CPU(s):         0-15
Vulnerabilities:             
  Gather data sampling:      Not affected
  Ghostwrite:                Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Not affected
  Spec rstack overflow:      Mitigation; Safe RET
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                     Not affected
  Tsa:                       Mitigation; Clear CPU buffers
  Tsx async abort:           Not affected
[abuka@archlinux ~]$ 
[abuka@archlinux ~]$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi        11Gi       561Mi       108Mi       2.9Gi       3.1Gi
Swap:          4.0Gi       3.7Gi       345Mi
[abuka@archlinux ~]$ 
```

Not sure if I understand your question. MAX Serve can handle multiple requests through batching.

Inevitably, a higher batch size will require more memory to contain the activations.

A start would be looking at this message in the output that was posted:

21:42:40.095 WARNING: 6933 MainThread: max.pipelines: Insufficient cache memory to support a batch containing one request at the max sequence length of 131072 tokens. Need to allocate at least 1024 pages (32.00 GiB), but only have enough memory for 280 pages (8.75 GiB). 

The device does not have enough memory to run a single request at the full sequence length as-is. To run this model with its intended capabilities on a single request at a time, per this message, at least 32 GB of RAM is required. If you are OK with shorter maximum sequence lengths, you can get by with less (as MAX implicitly did in this case).

More RAM is not likely to improve the smoothness of the response in this case, I believe. I believe performance improvements from batching are going to be more evident on GPU than on CPU. The model is currently running on CPU, and the question excludes GPU from consideration.

MAX also uses paged attention, which means that if all sequences are short, it can still use a high batch size. However it may need to reduce batch size with longer sequences.

In short: With the amount of RAM currently available on your device, MAX cannot run the model to its full length; for that you need 32 GB of RAM. More RAM beyond that may allow batching longer sequences together, which may lead to a minor performance improvement. However additional RAM is not likely to substantially increase smoothness. If this were on a GPU, the performance difference would be more substantial.

1 Like