2025-02-01

LLMs for On-Premises Deployment

Choosing the LLM

If order to find current SOTA LLMs, the best way to start - is to gather various LLM benchmarks.
There are numerous public LLM benchmarking efforts currently available, with the most influential being the Open LLM Leaderboard provided by Hugging Face, which attracts approximately 40 million monthly visits to the platform and consists of evaluation datasets covering a diverse range of challenges, and LMSYS Chatbot Arena (also known as “LMArena”), which employs a custom approach to rank models based on human preference, engages millions of monthly “players” on the platform, and conducts pre-release tests for top-tier AI labs such as OpenAI, xAI, Meta, Google, and others.

These two benchmarks differ in their evaluation methods and the types of models tested. The Open LLM Leaderboard primarily focuses on models with open-source code and weights, whereas Chatbot Arena evaluates both proprietary and open-source models.

Below, we analyze the rankings from both initiatives.

Open LLM Leaderboard

The Open LLM Leaderboard uses the following metrics, calculated as a weighted average:

IFEval: Instruction-Following Evaluation for Large Language Models. This approach generates a large set of verifiable tasks, such as “write in more than 400 words” or “mention the keyword ‘AI’ at least three times.”
Big Bench Hard (BBH): A set of diverse tasks with defined answers, including solving Boolean expressions, understanding dates, and causal judgment.
Mathematics Aptitude Test of Heuristics (MATH), Level 5: A collection of high-school math competition problems with exact answers that the model must match perfectly.
Graduate-Level Google-Proof Q&A (GPQA): A highly challenging dataset of 448 expert-written multiple-choice questions in biology, physics, and chemistry, designed to be extremely difficult even for domain experts. The dataset facilitates scalable oversight experiments to develop methods for human experts to supervise AI systems that surpass human capabilities in answering complex scientific questions.
Multistep Soft Reasoning (MuSR): A dataset of 756 examples across three domains—murder mysteries, object placements, and team allocation—focused on hard reasoning tasks.
Massive Multitask Language Understanding - Professional (MMLU-Pro): An advanced benchmark enhancing the original MMLU dataset with more challenging, reasoning-focused questions, offering ten answer choices per question to increase difficulty and reduce random guessing. It spans 14 diverse domains, demonstrates stability under prompt variations, and emphasizes complex reasoning tasks with improved performance using Chain of Thought reasoning.

Since we are focusing on open-source LLMs suitable for on-premises deployment with fewer than 80 billion parameters, capable of running on 1x–2x A100 or 1x H200 GPUs, the following table synthesizes the key attributes of the top three models from official providers (as of February 2025):

Model	Parameters	Context Window	License	GPU VRAM*	wAvg**
Qwen/Qwen2.5-72B-Instruct	72B	128k tokens	Custom Qwen	67 GB	47.98
Qwen/Qwen2.5-32B-Instruct	32B	128k tokens	Custom Qwen	29.8 GB	46.60
meta-llama/Llama-3.3-70B-Instruct	70B	128k tokens	Llama3.3	64.7 GB	44.85

*bfloat16; No KV-cache (method).
**Weighted average of Open LLM metrics.

The following table lists models from community providers:

Model	Parameters	Context Window	License	GPU VRAM*	wAvg**
MaziyarPanahi/calme-3.2-instruct-78b	78B	128k tokens	Custom Qwen	71.5 GB	52.08
MaziyarPanahi/calme-3.1-instruct-78b	78B	128k tokens	Custom Qwen	71.5 GB	51.29
dfurman/CalmeRys-78B-Orpo-v0.1	78B	128k tokens	MIT	71.5 GB	51.23

*bfloat16; No KV-cache (method).
**Weighted average of Open LLM metrics.

A clear pattern emerges: most leading models are built on top of Qwen-2.5, with community-provided models achieving remarkable results across all benchmarks.

According to the report, Qwen-2.5, like many modern LLMs, is based on a Mixture-of-Experts (MoE) architecture, incorporating enhancements such as Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE) for efficient processing. It scales pre-training data to 18 trillion tokens and employs advanced post-training techniques, including supervised fine-tuning and multi-stage reinforcement learning, to enhance performance in language understanding, reasoning, and long-context tasks.

Chatbot Arena

Chatbot Arena adopts a different evaluation approach, utilizing crowdsourced, pairwise comparisons of model responses to user-generated prompts. Users vote on their preferred model response, and the platform employs statistical methods, such as the Bradley-Terry model, to rank models based on these preferences. Key metrics include win rates and confidence intervals for model rankings, ensuring robust and sample-efficient evaluations. The primary benefits of this approach are its real-world relevance (due to diverse, live user prompts), scalability, and transparency, providing an open platform for continuous, human-aligned evaluation of LLMs. The resulting rating is an Elo score, calculated based on all “games” played by the AIs.

The following table lists the top open-source models, excluding proprietary ones (the top model was xAI’s Grok 3):

Model	Parameters	Context Window	License	GPU VRAM*	Elo Score
DeepSeek R1	671B	128k tokens	MIT	1543 GB	1361
DeepSeek V3	671B	128k tokens	DeepSeek	1543 GB	1317
Deepseek-V2.5-1210	237B	164k tokens	DeepSeek	640 GB	1279
Athene-v2-Chat-72B	72B	128k tokens	NexusFlow	67 GB	1275

*bfloat16; No KV-cache (method).

In Chatbot Arena, smaller, top-rated models are scarce, with the leaderboard dominated by large DeepSeek models. However, Athene-v2, trained on the Qwen-2.5 base model using RLHF and designed to compete with GPT-4o, is a strong candidate for on-premises deployment due to its open weights.

Another alternative is the distilled DeepSeek R1 models, also based on Qwen-2.5, as shown in the Open LLM Leaderboard results:

Model	Parameters	Context Window	License	GPU VRAM*	wAvg**
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	14B	128k tokens	MIT	15.64 GB	38.22
deepseek-ai/DeepSeek-R1-Distill-Qwen-70B	70B	128k tokens	MIT	77.7 GB	27.81
deepseek-ai/DeepSeek-R1-Distill-Llama-32B	32B	128k tokens	MIT	35.75 GB	22.96

*bfloat16; No KV-cache (method).
**Weighted average of Open LLM metrics.

Surprisingly, the smaller DeepSeek-R1-Distill-Qwen-14B model, fine-tuned on domain-specific datasets, achieves the best performance among the distilled models, making it the easiest to serve due to its smaller size.

Licensing Considerations

License	Commercial Use	Attribution	Redistribution	Modifications
Qwen	✓ (<100M MAU)	Required	Allowed	Allowed
Llama3.3	✓ (<700M MAU)	Required	Allowed	Allowed
DeepSeek	✓	Required	Allowed	Allowed
NexusFlow	✓	Required	Allowed	Allowed
MIT	✓	Required	Allowed	Allowed

Meta’s Llama 3.3 custom license prohibits deployment on AWS or Azure marketplaces, while DeepSeek’s Apache 2.0 license offers full flexibility.

Picking Candidates

Before assessing runtime environments, we select the following model families based on performance, size, and licensing for compatibility with modern deep learning frameworks and inference engines:

MaziyarPanahi/calme-3.2-instruct-78b: The largest and most performant model according to the Open LLM Leaderboard, with a relaxed license.
Qwen/Qwen2.5-32B-Instruct: Strong performance with a moderate parameter count.
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B: Excellent Open LLM score with the smallest parameter count and the most permissive MIT license.
meta-llama/Llama-3.3-70B-Instruct: A strong alternative to Qwen-2.5-based models, with good performance and broader support in the deep learning ecosystem due to its fully open-source nature.

Inference Engine

For running these models, we consider two primary options: vLLM and NVIDIA Triton Inference Server.

vLLM

vLLM is an optimized LLM runtime and inference server, widely regarded as the standard for cloud-based LLM deployment. It introduced Paged Attention, which applies a virtual memory model with paging to the KV-cache matrix, significantly increasing throughput by handling more simultaneous requests. Additional benefits include:

Up to 24x higher throughput compared to traditional methods.
PagedAttention reduces KV cache fragmentation by 80%.
High-Throughput Mode: Processes 2.4k tokens/sec on an A100 with max_num_batched_tokens=4096.
Uses the Open AI Chat Completions API for easy integration with existing systems.
Dynamically adjusts batch size based on real-time requirements.
Allows immediate injection of new requests into ongoing batches.
Maximizes GPU utilization by minimizing idle time.
Reduces latency by processing requests as they arrive, rather than waiting for full batches.
Adapts seamlessly to varying request sizes and arrival patterns.

The models selected (based on Qwen-2.5 or Llama) are supported by vLLM. However, for production use, a more mature technology stack may be beneficial for detailed monitoring, tooling, and additional interfaces.

NVIDIA Triton Inference Server

Triton is a standard for deploying models in the ML engineering community, offering pluggable backends, fine-grained metrics, multi-GPU automatic memory management, and more.

GPU Memory Management

Triton employs the following strategies for optimal GPU utilization:

Concurrent Model Execution: Parallelizes multiple models/instances through GPU hardware scheduling. A Llama 3.3 instance can coexist with Qwen-2.5 on the same A100 cluster without performance degradation.
Instance Groups: Configures multiple parallel executions per model.
Pipelines of Models: Models can “live” on the same GPUs.
Dynamic Batching: Combines requests from multiple users into a single batch for processing.

Triton integrates easily with Kubernetes, supporting auto-scaling via Prometheus metrics.

Pipeline Optimization

Triton’s ensemble scheduler chains preprocessing (Python/C++), inference (TensorRT), and postprocessing into a single request. For example:

ensemble {
  step [
    {
      model_name: "text_helper"
      model_version: -1
      input_map { key: "text" value: "raw_input" }
    },
    {
      model_name: "llama3_retriever"
      model_version: -1
      input_map { key: "embedding" value: "text_helper.output" }
    }
  ]
}

Hosting all models on the same virtual machine using the same GPU(s) significantly reduces latency compared to microservice architectures by eliminating slow network I/O.

NVIDIA Triton Inference Server + vLLM Backend

Since Triton Server v23.10, a vLLM backend has been available, combining vLLM’s high-throughput techniques with Triton’s benefits. This is particularly convenient for models that do not require distributed inference and can fit on a few A100 GPUs mounted on the same machine. However, there are some drawbacks:

Requires the use of the Triton interface instead of the convenient Open AI Chat API.
Deployment requires additional configuration and image building.

Alternative: Ray + vLLM

Ray Serve is a Python-first, framework-agnostic alternative based on the actor model, enabling stateful and asynchronous computing in distributed systems. It extends Ray’s API from functions (tasks) to classes, creating stateful workers or services.

Pros:

Allows specification of VM characteristics directly in Python code.
Supports vLLM in multi-GPU/multi-tenant setups.
Used by major LLM providers like OpenAI.

Cons:

Requires running a Ray Cluster on top of Kubernetes or using Anyscale’s solution, increasing management overhead.

Conclusion

For smaller companies, NVIDIA Triton paired with vLLM is recommended to run models like Qwen2.5-32B-Instruct or DeepSeek-R1-Distill-Qwen-14B on a single A100, scalable as a standard Kubernetes deployment without the overhead of managing an overlay cluster.
For larger enterprises, a Ray Cluster with vLLM runtime, either on in-house Kubernetes or via Anyscale, is ideal for serving larger models like MaziyarPanahi/calme-3.2-instruct-78b (the current state-of-the-art in open-source) or meta-llama/Llama-3.3-70B-Instruct.