Posted on September 22, 2023
Did you know that there is a “Bittorrent” GPT? If you are looking to use multiple GPUs to serve a single LLM this is now possible!
An opensource project called Petals, allows a Large Language model to be served in a distributed way – just like Bittorrent.
The reason this is game changing is Petals allows you to combine a few “off the shelf” Graphical Processing Units (GPUs) to collaboratively serve a single very large model such as Meta’s 70 billion parameter model Llama 2.
This means that organisations of any size will very soon be able to protect their data by hosting their very own high performance LLM.
Either on the cloud or locally and at an affordable price.
How Petals works is that each GPU in the network loads a small part of the model, then joins with the others GPUs in a network to run inference or even fine tuning.
Single-batch inference runs at up to 6 steps/sec for LLaMA 2 (70B). This is up to 10x faster than offloading, enough to build chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
Lets show the benefits. If we were to do this the old fashioned way, using a single machine with a NVIDIA A100 (which has 80 GB VRAM) to serve LLAMA 2, it would cost around ~ $55 000 AUD.
How much would it cost to serve this same model using Petals?
using some “back of the napkin math”, assuming:
We are using NVIDIA 3080’s (10GB VRAM) to server Llama 2
A reasonable computer to host an NVIDIA 3080 costs ~$3000 AUD
Llama 2, requires 80 GB VRAM
80GB VRAM / 10 (NVIDIA 3080’s) = 8 computers
8 x $3000 = ~$24000 AUD.
This means that using Petals provides a savings factor of x2.3
There is currently a world wide shortage of GPUs, such as A100/H100s capable of serving 70B parameter models.
Using Petals allows you to skip the queue to serve your model today, all whilst saving a ton of money too!