We are committed to respecting your privacy. This is why we want to be completely transparent about the use of your data via the deposit of cookies. By accepting the use of cookies we will offer you a service adapted to your needs and an optimal browsing experience on our site. If you do not personalize your cookies, you accept them by default.

What GPU Power is Needed to Host an AI/LLM Model Locally?

What GPU Power is Needed to Host an AI/LLM Model Locally?

What GPU Power is Needed to Host an AI/LLM Model Locally?

Hosting a large language model (LLM) locally primarily depends on the performance of the graphics card (GPU). Here are the key factors to consider when choosing the right graphics card:

Key Factors Influencing the Choice

  • VRAM Memory: The larger the model, the more VRAM it requires.
  • GPU Architecture: Recent architectures (Ampere, Ada Lovelace, Hopper, Blackwell) offer better performance.
  • Task Type:
    • Inference: Running an existing model, consumes fewer resources.
    • Training: Requires more VRAM and computational power.
  • Numerical Precision: FP32 (precise but heavy), FP16 and INT8 (optimized).
  • Optimization Techniques: Quantization, Pruning, Distillation.

NVIDIA Graphics Cards and Compatible Model Sizes

Graphics Card VRAM Estimated Model Size Model Examples
RTX 4060 Ti 8/16GB 7B to 13B LLaMA 2 7B, Mistral 7B
RTX 5070 / 5070 Ti 12GB 13B to 20B LLaMA 2 13B
RTX 5080 16GB 20B to 34B LLaMA 2 34B
RTX 5090 32GB 34B to 70B LLaMA 2 70B, Falcon 40B
RTX 6000 Ada 48GB Up to 180B Fine-tuning large models
H100 / H200 80GB/141GB 175B+ Running the largest models

Open-Source Model Examples

  • Gemma 3: Versions 1B, 4B, 12B, 27B
  • QwQ: Advanced reasoning model, 32B version
  • DeepSeek-R1: Versions 1.5B, 7B, 8B, 14B, 32B, 70B, 671B
  • LLaMA 3.3: 70B version
  • Phi-4: Microsoft's 14B model
  • Mistral: 7B version
  • Qwen 2.5: Versions 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
  • Qwen 2.5 Coder: Versions 0.5B, 1.5B, 3B, 7B, 14B, 32B

Conclusion

The choice of a graphics card for LLM depends on the available VRAM and possible optimizations.

  • Light Models (7B to 13B): RTX 4060 Ti (16GB)
  • Intermediate Models (20B+): RTX 5080 or 5090
  • Large Models (70B+): RTX 6000 Ada or H200

Optimizations like quantization allow running larger models on more modest GPUs.