Multi-GPU Inference Strategies for Large Language Models: Tensor Parallelism 101

Tamara Weed, Dec, 17 2025

Categories:

Tags:

Running a 70-billion-parameter language model on a single GPU? Impossible. Not because the model is too smart, but because it’s too big. Even the most powerful consumer GPUs today max out at 24GB or 80GB of memory. A model like Llama-2-70B needs over 140GB just to load its weights in half-precision. That’s where tensor parallelism comes in - the only practical way to split a massive model across multiple GPUs and still get fast, usable inference.

What Exactly Is Tensor Parallelism?

Tensor parallelism breaks down individual layers of a neural network and spreads their pieces across multiple GPUs. Think of it like cutting a cake into slices - each GPU gets a slice of every layer, not the whole layer. When the model runs, each GPU computes its slice, then quickly shares results with the others before moving to the next step.

This isn’t about running the same model on many GPUs (that’s data parallelism). It’s about splitting the model itself so that no single GPU has to hold everything. The technique was first formalized in NVIDIA’s Megatron-LM paper back in 2019, and today it’s the backbone of every major LLM inference system - from Hugging Face’s Text Generation Inference to NVIDIA’s TensorRT-LLM and vLLM.

The magic happens in the weight matrices. For example, in attention layers, the query, key, and value projections (q_proj, k_proj, v_proj) use column parallelism: each GPU gets a portion of the output columns, and inputs are copied to all devices. The output projection (out_proj) uses row parallelism: inputs are split across GPUs, and outputs are summed together. These two patterns are baked into every transformer layer, and frameworks handle the communication automatically.

Why It’s the Only Way to Run Big Models Today

Without tensor parallelism, you simply can’t run models larger than what fits on one GPU. Even if you had a 120GB GPU (which doesn’t exist for consumers), the cost and power draw would be insane. Tensor parallelism lets you run a 70B model on four 80GB A100s - or even on four 24GB RTX 4090s, as hobbyists have proven on Reddit.

NVIDIA’s benchmarks show that tensor parallelism reduces the memory footprint per GPU by the exact number of devices you use. Four GPUs? You need only a quarter of the memory per device. That’s why financial firms and healthcare providers - who need to run massive models for risk analysis or medical diagnostics - rely on it. According to Gartner, 75% of enterprise LLM deployments will use tensor parallelism by 2025. Right now, it’s already in 45% of them.

It’s not just about memory. It’s about speed. A 13B model on four GPUs using tensor parallelism runs 3.2 times faster than on one. That’s not linear scaling - but it’s enough to make real-time chat, document summarization, or code generation feel instant.

How It Compares to Other Parallelism Methods

There are three main ways to spread LLMs across GPUs: data, pipeline, and tensor parallelism. Each has trade-offs.

Data parallelism copies the whole model to every GPU and runs different batches. Great for increasing throughput, useless if your model won’t fit on one GPU.
Pipeline parallelism splits the model by layers - like giving GPU 1 the first 10 layers, GPU 2 the next 10, and so on. Simple to set up, but creates "pipeline bubbles" where some GPUs sit idle waiting for data. Studies show this wastes 30-60% of compute time.
Tensor parallelism splits each layer. No idle GPUs. No bubbles. But every step requires heavy communication between devices.

For inference, tensor parallelism wins because latency matters more than batch size. You want a fast reply, not a big batch of replies. That’s why it dominates single-node deployments. Pipeline parallelism is better for training across multiple machines, but it’s too slow for real-time apps.

Four RTX 4090 GPUs connected by sparking PCIe cables, with an engineer celebrating and NVLink looming in background.

Hardware Requirements: It’s Not Just About GPU Count

You can’t just plug four consumer GPUs into a desktop and expect tensor parallelism to work well. The bottleneck isn’t the GPU - it’s the connection between them.

NVIDIA’s NVLink provides 600 GB/s bidirectional bandwidth between GPUs. PCIe 4.0? Only 32 GB/s. That’s a 20x difference. In real-world tests, using PCIe instead of NVLink adds 15-25% overhead to inference time. That means your 3.2x speedup drops to maybe 2.1x. For production, you need NVLink or equivalent high-speed interconnects.

Even then, communication adds latency. AWS Neuron SDK notes that each synchronization point (where GPUs exchange data) adds 1.2-2.5ms over EFA networking. NVIDIA’s NeuronLink cuts that to 0.3ms. That’s why cloud providers charge a premium for instances with NVLink - it’s not a luxury, it’s a requirement.

Practical Setup: How to Actually Use It

Getting tensor parallelism running is easier than it sounds - if you use the right tools.

Hugging Face Text Generation Inference (TGI): Use --tensor-parallel-size 4 when starting the server. It auto-detects your GPUs and splits the model. No code needed.
vLLM: Same flag. Set tensor_parallel_size to match your GPU count. Works with Llama, Mistral, Phi, and more.
TensorRT-LLM: NVIDIA’s optimized engine. Supports FP8 quantization to cut communication volume by half. Requires more setup but gives the best performance.

Key rules:

Match tensor parallel size to your GPU count. Don’t use TP=2 on 4 GPUs - it wastes resources.
Use FP16 or BF16 precision. Avoid FP32 - it doubles memory and bandwidth use.
Combine with quantization (like INT4) if you’re tight on memory. A 70B model in INT4 can fit on 2x 24GB GPUs with TP=2.

Most users get it working in under an hour using TGI or vLLM. The hard part isn’t the setup - it’s debugging.

Common Problems and How to Fix Them

Even with good tools, things break. Here’s what goes wrong - and how to fix it.

Communication timeouts: If you see "allreduce timeout" errors (common in vLLM with TP>4), increase NCCL timeout settings. Set NCCL_ASYNC_ERROR_HANDLING=1 and NCCL_BLOCKING_WAIT=1.
Incorrect results: Rare, but happens if tensor splits aren’t aligned. Always use frameworks that handle splitting automatically - don’t try to manually partition layers.
Deadlocks: 32% of GitHub issues in vLLM are deadlocks. Usually caused by mismatched GPU topologies. Use nvidia-smi topo -m to check NVLink connections. Place processes so GPUs talking to each other are physically linked.
Scaling beyond 8 GPUs: Communication overhead grows fast. Most models hit diminishing returns after 8 GPUs. If you need more, switch to pipeline + tensor hybrid parallelism.

The community is loud about this. Reddit users rave about running Llama-2-70B on 4x RTX 4090s. But developers complain that scaling beyond 8 GPUs is "not worth it." That’s not a flaw - it’s physics. More GPUs mean more communication. At some point, the wires become the bottleneck.

A neural network splitting into tensor, pipeline, and expert parallelism paths, converging into a glowing answer bubble.

The Future: Hybrid Parallelism Is Coming

Tensor parallelism isn’t perfect. It’s the best tool we have for single-node inference, but it’s not the endgame.

NVIDIA’s TensorRT-LLM 0.5 (released May 2024) now compresses intermediate activations using FP8, cutting communication by 50%. That’s a big win. But the real shift is toward hybrid systems.

Stanford’s Center for Research on Foundation Models predicts that pure tensor parallelism will evolve into context-aware hybrids - where the system automatically switches between tensor, pipeline, and even expert parallelism based on the request. For example: a short prompt uses tensor parallelism for speed. A long document gets split into chunks and processed with pipeline parallelism across multiple nodes.

Mixture-of-Experts (MoE) models like Mixtral 8x7B are already pushing this forward. Instead of splitting every weight, expert parallelism assigns entire experts to different GPUs. That cuts cross-GPU traffic by 40-60%. Tensor parallelism still handles the dense layers, but the experts are distributed differently.

This is where the industry is headed. Not more GPUs in one box - smarter ways to use them across boxes.

Who Should Use Tensor Parallelism?

If you’re asking this question, here’s your answer:

Use it if: You need to run models larger than 13B parameters, you have multiple GPUs with NVLink (or equivalent), and you care about low latency.
Don’t use it if: You’re running models under 7B, you only have one GPU, or you’re using consumer-grade PCIe systems without high-speed interconnects.

For startups and researchers, vLLM and Hugging Face TGI make it free and easy. For enterprises, TensorRT-LLM offers enterprise support, SLAs, and optimization for production.

The bottom line: If you want to deploy a large language model today, tensor parallelism isn’t optional. It’s mandatory. The models are too big. The hardware is too limited. And this is the only method that lets you bridge the gap without sacrificing speed.

What’s Next?

If you’ve got a 13B+ model you’re trying to deploy, start with Hugging Face TGI. Set --tensor-parallel-size to your GPU count. Use FP16. Add quantization if memory is tight. Monitor latency and memory usage. If it works - you’ve just unlocked a model that would’ve been impossible to run a year ago.

The next step? Learn how hybrid parallelism works. Once you’ve mastered tensor parallelism, pipeline and expert parallelism become the natural next layer. But for now - focus on getting one model running well across multiple GPUs. That’s the real win.

What’s the difference between tensor parallelism and data parallelism?

Tensor parallelism splits the model itself across GPUs - each GPU gets a piece of every layer. Data parallelism copies the full model to every GPU and runs different batches. Tensor parallelism lets you run bigger models. Data parallelism lets you process more requests at once - but only if the model fits on one GPU.

Can I use tensor parallelism with consumer GPUs like RTX 4090?

Yes, but performance suffers. Consumer GPUs use PCIe, not NVLink, so communication between them is slow. You can run a 70B model on 4x RTX 4090s, but you’ll lose 20-30% of the speed you’d get on NVLink-connected A100s. For hobbyists, it’s worth it. For production, it’s risky.

Why doesn’t tensor parallelism scale beyond 8 GPUs?

Because communication overhead grows faster than compute power. Every layer needs all GPUs to exchange data. With 16 GPUs, you’re doing 16x more communication per layer. The time spent waiting for data eats up the gains from extra GPUs. Beyond 8, it’s often faster to use pipeline parallelism across multiple nodes.

Which frameworks support tensor parallelism today?

All major LLM inference frameworks do: Hugging Face Text Generation Inference, vLLM, NVIDIA TensorRT-LLM, and DeepSpeed. Open-source tools like vLLM and TGI make it easy to use. NVIDIA’s TensorRT-LLM offers the best performance and enterprise support.

Do I need to write custom code to use tensor parallelism?

No. Frameworks like vLLM and Hugging Face TGI handle everything automatically. You just set a flag - like --tensor-parallel-size 4 - and the system splits the model, manages communication, and runs inference. Custom code is only needed if you’re building your own inference engine or modifying model architecture.

Is tensor parallelism only for inference?

No - it was first developed for training massive models like Megatron-LM. But for inference, it’s even more critical because latency matters more than throughput. Training can wait for batches. Inference can’t wait for a reply.

8 Comments

Aditya Singh Bisht

December 17, 2025 at 16:26

This is such a game-changer for hobbyists like me! I ran Llama-2-70B on 4x RTX 4090s last week and honestly thought my PC was gonna catch fire, but it just... worked. The speed isn't perfect, but for free? Absolute magic. Keep pushing the limits, folks!

And yes, I cried when I saw the first reply come back under 2 seconds. Worth every watt.

Agni Saucedo Medel

December 18, 2025 at 19:25

OMG YES 🙌 I’ve been trying to get this working for months and your breakdown finally clicked! I used vLLM with --tensor-parallel-size 4 and boom - 70B model running on my 2x 24GB cards. Still a bit slow over PCIe but WAY better than nothing. Thank you for making this feel possible 😊

ANAND BHUSHAN

December 19, 2025 at 19:12

So you just split the model and it works? No magic, just math. Cool. I tried it on my 3080 and it crashed. Guess I need more GPUs or better cooling. Or maybe just stop dreaming.

Indi s

December 20, 2025 at 18:06

I’m just glad someone actually explained this without jargon. I read a bunch of papers and got lost in ‘allreduce’ and ‘collective ops’. You made it sound like sharing cake slices. That’s how I’ll explain it to my cousin who thinks AI is just robots talking. Thanks.

Rohit Sen

December 21, 2025 at 08:40

Tensor parallelism? Cute. Real engineers use pipeline parallelism across 32-node clusters. You’re still playing with Lego blocks while the pros are building skyscrapers. Also, NVLink? That’s a luxury for people who can afford to waste $50k on a server. Try running this on a Raspberry Pi cluster.

Vimal Kumar

December 22, 2025 at 06:16

Hey everyone - if you’re new to this and feeling overwhelmed, don’t worry. Start with Hugging Face TGI. Set the flag, hit run. Watch the memory drop. Celebrate the first response. Then come back and read the rest. I did this last month and now I’m helping my local college lab set it up. You got this. Seriously.

And if you’re using PCIe? Just accept the 20% slowdown. It’s still 10x faster than waiting for cloud API quotas. Progress > perfection.

Amit Umarani

December 22, 2025 at 21:25

You wrote 'it’s' as 'its' three times. And 'NVLink' is capitalized inconsistently. Also, '4x RTX 4090s' should be 'four RTX 4090s' if you're going for formal tone. And why are you using 'hobbyists' like it's a bad word? This isn't a blog for teenagers. Fix the punctuation and then we'll talk about tensor parallelism.

Noel Dhiraj

December 22, 2025 at 23:24

Just got my 4x 24GB setup running with vLLM and FP16. Took 45 minutes. No code. Just flags. And it’s actually usable now. I was skeptical but this changed everything. If you’re reading this and thinking you need a supercomputer - you don’t. You just need patience and the right tools. Seriously, try it. You’ll be surprised how simple it is. And if it breaks? Google the NCCL timeout fix. It’s always that.