Running a large language model (LLM) in production isn’t just about having a powerful server. It’s about building a system that can handle massive memory demands, deliver responses under 500 milliseconds, and scale smoothly when thousands of users hit it at once. If you’ve ever wondered why companies spend hundreds of thousands of dollars just to serve an AI model, the answer lies in the infrastructure - and it’s far more complex than most people realize.
Hardware: It’s All About GPU Memory
The heart of any LLM deployment is the GPU. Not just any GPU - you need one with enough VRAM to hold the entire model in memory. Models like Qwen3 235B require about 600 GB of VRAM to run at full capacity, according to Analytical Software’s December 2024 analysis. That’s not a typo. For context, a single NVIDIA H100 has 80 GB of VRAM. So you’d need at least eight of them just to load that one model. Smaller models like 7B-parameter versions might fit on one or two GPUs, but anything above 40GB of weights demands multi-GPU setups.
Memory bandwidth matters just as much as capacity. The H100 offers 3.35 TB/s of bandwidth, while the older A100 only delivers 1.6 TB/s. That difference isn’t just a number - it directly impacts how fast the model can process each token. If your GPU can’t feed data to the cores fast enough, you’ll see delays even with plenty of memory. That’s why simply stacking more GPUs doesn’t always help. You need the right balance of memory, bandwidth, and interconnect speed.
Disk space isn’t an afterthought either. Model weights are stored on disk before loading into VRAM. For a 600 GB model, you need at least that much fast storage. NVMe SSDs are standard, and many teams use RAID arrays to ensure reliability. Some even keep duplicate copies of weights across multiple storage nodes to avoid downtime if one drive fails.
Software: Containerization, Orchestration, and Quantization
Putting a 10+ GB model into a container sounds simple - until you realize you need to match CUDA versions, GPU drivers, and system libraries exactly. A mismatch by even one point can crash inference. Teams that skip testing this end up with unpredictable failures in production. Tools like Docker and Podman help, but you still need to pin every dependency. Northflank’s February 2025 guide recommends using CI pipelines with Trivy to scan containers for vulnerabilities before deployment.
Once the model is containerized, you need to orchestrate it. Kubernetes is the default choice for most enterprises. But running LLMs on Kubernetes isn’t like running a web app. You need to configure Horizontal Pod Autoscalers (HPA) that respond to queue depth, not just CPU usage. Andrei Karpathy’s December 2024 blog makes this clear: “Horizontal scaling with Kubernetes HPA is essential for handling variable LLM workloads while maintaining cost efficiency.” If you don’t scale based on incoming requests, you’re either overpaying or dropping responses.
Quantization is where most teams save money. Converting a model from 16-bit to 8-bit or even 4-bit reduces memory use by 4x to 8x. That means you can run a model that originally needed 600 GB on just 75 GB. The catch? You lose a little accuracy - usually 1% to 5%. For customer support bots, that’s fine. For legal or medical applications, it’s risky. Many teams test quantized versions in sandbox environments first. Neptune.ai’s November 2024 study found that 68% of enterprises now use quantization in production, mostly for high-volume, low-stakes tasks.
Storage and Networking: The Hidden Bottlenecks
Most people overlook storage architecture. You don’t need the same speed for everything. Bulk data like logs and training datasets go on cheap object storage - AWS S3 costs $0.023 per GB per month. But active model weights? Those need NVMe storage at $0.084 per GB per month. The real win comes from tiered caching. Keep frequently used model fragments in fast local SSDs, and fetch the rest from slower storage as needed. Logic Monitor’s January 2025 whitepaper shows this cuts storage costs by 30-50% without hurting performance.
Networking is another silent killer. If your GPUs are spread across different servers, you need high-speed interconnects. InfiniBand or 400 Gbps Ethernet are common in data centers. For cloud deployments, you’re stuck with what the provider offers - AWS’s g5 instances use 200 Gbps networking, which is usually enough. But if you’re doing distributed inference across multiple regions, latency spikes become unavoidable. Dr. Emily Zhang from Stanford warns that “distributed GPU deployments across data centers often introduce latency that makes them unsuitable for interactive LLM applications requiring sub-500ms response times.” That’s why most real-time chatbots stay within a single region.
Deployment Models: Cloud, On-Prem, Hybrid
You have three main choices: cloud, on-prem, or hybrid. Each has trade-offs.
- Cloud (AWS SageMaker, Google Vertex AI): Easy to start. A single g5.xlarge instance costs $12/hour. But scale up to eight H100s, and you’re looking at $100,000/month. Vendor lock-in is real. You can’t easily move models between providers.
- On-prem: You own the hardware. A single NVIDIA A100 server costs $20,000. A full 8-GPU cluster? Over $500,000. The upside? Total control over data and security. The downside? Average GPU utilization is only 35-45%. You’re paying for idle hardware most of the time.
- Hybrid: This is where the market is heading. 68% of enterprises now use hybrid setups, according to Logic Monitor’s January 2025 survey. Run sensitive workloads on-prem. Offload burst traffic to the cloud. Use edge nodes for low-latency applications. It’s more complex to manage, but it’s the only way to balance cost, speed, and compliance.
Third-party APIs like OpenAI or Anthropic are another option. At $0.005 per 1K tokens for GPT-3.5-turbo, they’re cheap for small projects. But you lose customization, can’t fine-tune models, and risk downtime if the API goes offline. For mission-critical systems, that’s a dealbreaker.
Key Challenges and Best Practices
Teams that deploy LLMs in production hit the same roadblocks again and again. According to Neptune.ai’s November 2024 survey:
- 78% struggle with GPU memory allocation - misconfiguring batch sizes or model parallelism leads to out-of-memory crashes.
- 65% battle latency - users expect responses in under a second. Every extra 100ms hurts engagement.
- 52% have no health checks - if a GPU fails, the whole service goes down.
The best practices are simple but often ignored:
- Test quantization and batching in a sandbox before touching production.
- Use health checks with automatic failover. If one GPU node dies, traffic should reroute instantly.
- Monitor token-per-second throughput, not just CPU. That’s the real metric of performance.
- Start small. Deploy one model, one region, one use case. Don’t try to build a full AI platform on day one.
The Future: Faster Chips, Smarter Pipelines
The next big leap is coming from hardware. NVIDIA’s Blackwell architecture, announced in March 2025, offers 4x the performance of H100s for LLM workloads. That means you could run a 600 GB model on half the number of GPUs. It’s not just about speed - it’s about power efficiency. Blackwell chips use 30% less energy per inference.
On the software side, frameworks like LangChain and LlamaIndex are becoming standard. They help manage complex workflows: fetching data from a vector database, prompting the model, post-processing the output. Adoption jumped from 15% to 62% between 2024 and 2025, according to Gun.io. Vector databases like Pinecone and Weaviate are no longer optional - they’re core components for any LLM that needs real-time knowledge.
By 2026, Gartner predicts 50% of enterprise LLMs will use quantization, and 70% will rely on dynamic scaling. The cost of serving models is falling - but only if you build smart infrastructure. As Dr. Sarah Chen from JFrog puts it: “Efficient resource allocation through dynamic scaling is non-negotiable.” The teams that win aren’t the ones with the most GPUs. They’re the ones who optimize every byte, every millisecond, and every dollar.
How much VRAM do I need to serve a 7B-parameter LLM in production?
A 7B-parameter model typically needs 14-20 GB of VRAM in full precision (FP16). With 8-bit quantization, that drops to around 5-7 GB. Most teams start with a single NVIDIA A100 or H100, which have 40-80 GB of VRAM. This gives room for batching, caching, and overhead. Never deploy without testing memory usage under real traffic.
Can I run LLMs on CPUs instead of GPUs?
Technically, yes - but it’s not practical. CPUs are 10-50x slower than GPUs for LLM inference. A model that takes 200ms on an H100 might take 10 seconds on a high-end CPU. That’s unacceptable for real-time apps. CPUs are fine for small models in low-traffic scenarios, but for anything beyond testing, GPUs are mandatory.
What’s the difference between serving a model and training it?
Training requires massive compute and memory to update model weights - often needing hundreds of GPUs working together for days. Serving (inference) only uses the trained model to generate outputs. It’s far less demanding. You need about 1/10th the hardware for serving compared to training. Most companies serve models they didn’t train - they download open-weight models and deploy them.
Is it better to use cloud services or build my own cluster?
It depends on your scale and control needs. If you’re testing or have low traffic, cloud services like AWS SageMaker are faster and cheaper. But if you’re serving thousands of requests per minute and need data privacy, building your own cluster saves money long-term. The break-even point is usually around $50,000/month in cloud spend. Beyond that, self-hosted becomes more economical - if you have the expertise.
How do I prevent my LLM service from crashing under load?
Use dynamic batching and autoscaling. Don’t let requests pile up. Tools like vLLM and Text Generation Inference automatically group similar requests to maximize GPU usage. Pair that with Kubernetes HPA to add more pods when queue length rises. Add health checks and circuit breakers - if a node fails, stop sending traffic to it. Test under simulated peak loads before going live.
7 Comments
VIRENDER KAUL
The sheer ignorance of people who think you can just throw a 235B model on a cloud instance and call it a day is staggering.
600 GB of VRAM? That's not a server-it's a data center with a side of delusion.
You don't need to be an engineer to understand that if your model weighs more than a Tesla battery pack, you're not deploying-you're hosting a museum exhibit.
And yet, startups are raising millions on 'LLM-as-a-service' pitches that wouldn't survive a 3am load test.
Quantization? Sure, if you're okay with your chatbot hallucinating that the moon is made of cheese and your HR bot tells applicants to 'go f*** themselves' because it misread 'terminate' as 'temperate'.
Health checks? Who needs them when you can just restart the pod and hope for the best?
This isn't AI infrastructure-it's a pyramid scheme built on GPU FOMO.
And don't get me started on hybrid clouds-trying to manage latency across regions is like trying to sync a Swiss watch with a sundial.
Every 'cost-effective' solution is just a time bomb with a Kubernetes label.
Real engineers don't scale horizontally-they optimize vertically, and they don't need a PowerPoint deck to explain why.
Stop treating inference like it's a SaaS startup pitch. It's a hardware problem dressed up as an AI revolution.
And if you're still using OpenAI for mission-critical workflows, you're not innovating-you're gambling with your company's reputation.
Next time someone says 'just use H100s', ask them how many they've actually racked and cooled before midnight.
Most of them haven't even seen a server room.
Infrastructure isn't sexy. But it's the only thing standing between you and a very public, very expensive meltdown.
Mbuyiselwa Cindi
I love how this post breaks it all down without fluff. Seriously, thank you.
Just wanted to add-when we started with our 7B model, we thought a single A100 would be overkill.
Turns out, it wasn’t enough once we added batching + caching + health checks.
We went from 3-second responses to under 400ms by tweaking quantization to 4-bit and using vLLM.
Also, don’t skip the storage tiering-keeping the hot weights on NVMe and the rest on S3 cut our monthly bill by almost half.
And yes, Kubernetes HPA on queue depth? Game changer.
Don’t use CPU usage. That’s like judging a car by how loud the engine is, not how fast it goes.
Start small, test hard, and never assume ‘it works on my laptop’ means it’ll work in prod.
You’ve got this.
Krzysztof Lasocki
Ohhh so THAT’S why my startup’s AI chatbot keeps telling users ‘I am a sentient toaster’?
Not because we’re close to AGI… but because we skipped quantization testing and used a 16-bit model on a 24GB card.
My CTO thought ‘more tokens = smarter AI’.
Turns out, more tokens = more existential crises for customers.
Also, can we talk about how ‘hybrid cloud’ is just corporate speak for ‘we can’t decide if we’re tech or finance’?
Meanwhile, I’m over here manually restarting pods at 3am like a digital janitor.
But hey-at least we’re not using CPUs. That’d be like trying to run a Ferrari on bicycle pedals.
Still, props to the guy who wrote this. Finally, someone didn’t use the word ‘leverage’ once.
Respect.
Henry Kelley
man i just want to say… i tried running a 13b model on a g5.xlarge and thought i was a genius
turns out i was just a guy with too much credit and no clue about memory bandwidth
the latency was like 1.2 seconds… which is basically a lifetime in chat time
then i found out about vLLM and quantization and holy cow it went from ‘user rage’ to ‘smooth as butter’
also, health checks? yes. if one gpu dies, you gotta kill the whole pod, not just hope it comes back
and dont even get me started on how aws charges you for idle gpus like they’re renting out yachts
we switched to hybrid and now we’re saving 60% and still hitting sub-500ms
tl;dr: dont be like me. read the docs. test everything. and maybe sleep sometimes
Rocky Wyatt
Wow. Just… wow.
You spent 2000 words explaining why your job exists.
Let me guess-you got promoted to ‘LLM Infrastructure Lead’ after your boss realized he had no idea what a GPU was.
‘Dynamic scaling’? ‘Tiered caching’? ‘Quantization’?
It’s all just jargon to justify why you’re paid six figures to babysit a bunch of GPUs that sit idle 80% of the time.
Meanwhile, real engineers are running LLMs on Raspberry Pis with 4GB RAM and calling it a day.
But no, we need H100s. Because nothing says ‘cutting edge’ like paying $100k/month to run a model that could’ve been compressed to fit on a thumb drive.
At least admit it: this isn’t innovation. It’s performance art for VCs.
Santhosh Santhosh
I’ve been working on LLM deployments for three years now, and I still feel like I’m learning every day.
What this post captures so well is how the real challenge isn’t the hardware or the software-it’s the human layer.
Everyone wants to move fast, but no one wants to do the boring work of testing edge cases.
I remember one time we deployed a quantized 70B model without validating the token-per-second throughput.
It looked fine on the dashboard-until we realized it was dropping 30% of requests during peak hours because the load balancer wasn’t aware of GPU saturation.
We had to rebuild the entire monitoring stack from scratch.
And yes, Kubernetes HPA on queue depth was the turning point.
But what really saved us was the team culture-we started having weekly ‘infrastructure blameless retros’.
No one got fired. We just fixed the gaps.
It’s not about having the most GPUs.
It’s about having the patience to understand what each byte is doing.
And sometimes, that means sitting quietly for hours watching logs, waiting for the one anomaly that reveals the whole system’s weakness.
It’s not glamorous.
But it’s necessary.
Veera Mavalwala
Oh honey, you think this is complex?
Let me tell you about the time we tried to deploy a 40B model on a hybrid setup with a 200GB cache and a 300GB swap file because ‘we didn’t have time to buy more NVMe’.
It was like trying to run a marathon in flip-flops while juggling flaming chainsaws.
Our latency? 1.8 seconds.
Our uptime? 63%.
Our CEO? Asking if we could ‘just make it faster like ChatGPT’.
We didn’t fix it with more GPUs.
We fixed it by throwing out the entire pipeline and starting from scratch-with a checklist, a coffee machine, and a very angry DevOps engineer who refused to sleep until the first request hit 380ms.
And now? We’re running on 4 H100s with 4-bit quantization, dynamic batching, and health checks that ping every 5 seconds.
Cost? 40% less than before.
Performance? Smooth as silk.
So yes, infrastructure is hard.
But it’s not magic.
It’s just discipline dressed in server racks.