Open Interpreter – create code locally and see if it works!

The world has mostly tried ChatGPT, and the “Dev world” has mostly tried GPT4, but to my surprise, very few people have tried OpenAi’s Code Interpreter.

Code Interpreter has been called “GPT4.5” and was introduced to the world with much critical acclaim by Greg Brockman of OpenAI at a TED talk on Apr 21, 2023.

Just like ChatGPT took GPT3 one step further, Code Interpreter takes GPT4 to the next level.

Why is Code Interpreter important?
Code Interpreter allows GPT to write & execute code. This is significant because it means that GPT can now:

Write code,

Run the code!,

See what happens,

If the output is not as desired – GPT can now correct it’s own mistakes!

When OpenAi’s Code Interpreter was first released it blew the tech world away, generating a ton of hype across twitter/x, reddit & Youtube.

Unfortunately several parties figured out how to exploit Code Interpreter & to address this issue OpenAI had to remove Code Interpreter’s access to the internet. This severely restricted what Code Interpreter was capable of & quickly killed the hype & potential of the product.

September 2023, enter “Open Interpreter”. Open Interpreter is an open source project that brings the Internet back & therefore restores the full potential of Code Interpreter.

Better yet – Open Interpreter runs on your own computer! If you are dealing with sensitive information, data no longer needs to leave your computer. The “AI reasoning” can be implemented by Open Interpreter and your data can remain local. Even better yet I managed to hack Open Interpreter to work with LM Studio, so data never has to go over the internet at all.

You can try Open Interpreter via this Link

How to run a LLM on your own computer – Never be without GPT ever again!

This week’s highlight from our research is LM Studio.

It is now easy to run a Local Large Language Model (LLM) on your computer.

Why would you want to do this?

Privacy. If you are dealing with sensitive data, most clients would not be comfortable with their data traversing the internet to get to say chatGPT.

Cost. If you are going to be making a repetitive set of calls to OpenAi – let’s say doing data manipulation – the costs quickly add up. Running an LLM on your own computer is free.

Internet Access. If you are in an area where you do not have internet, let’s say an overseas flight or out in the countryside – it might still be super useful to have access to an LLM. By having one on your computer you still have access to specialised knowledge that only an LLM can provide!

Unfortunately until recently it has been very difficult to run a Large Language Model on your own computer. You had to have basic dev skills & understand a considerable amount on how the models themselves worked to even get started.

Recently however I discovered LM Studio and simply put, LM studio abstracts away all the pain.

Unlike all of the other solutions out there at the moment – LM Studio is simply an app.

All you need to do is visit their website & with a couple of clicks it is painlessly installed.

Best of all it is 100% free and works on both Windows & Mac.

You can get started yourself by visiting LM Studio’s website here

BitTorrent for LLMs – getting multiple GPUs to serve one LLM

Did you know that there is a “Bittorrent” GPT? If you are looking to use multiple GPUs to serve a single LLM this is now possible!

An opensource project called Petals, allows a Large Language model to be served in a distributed way – just like Bittorrent.

The reason this is game changing is Petals allows you to combine a few “off the shelf” Graphical Processing Units (GPUs) to collaboratively serve a single very large model such as Meta’s 70 billion parameter model Llama 2.

This means that organisations of any size will very soon be able to protect their data by hosting their very own high performance LLM.

Either on the cloud or locally and at an affordable price.

How Petals works is that each GPU in the network loads a small part of the model, then joins with the others GPUs in a network to run inference or even fine tuning.

Single-batch inference runs at up to 6 steps/sec for LLaMA 2 (70B). This is up to 10x faster than offloading, enough to build chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.

Lets show the benefits. If we were to do this the old fashioned way, using a single machine with a NVIDIA A100 (which has 80 GB VRAM) to serve LLAMA 2, it would cost around ~ $55 000 AUD.

How much would it cost to serve this same model using Petals?

using some “back of the napkin math”, assuming:

We are using NVIDIA 3080’s (10GB VRAM) to server Llama 2

A reasonable computer to host an NVIDIA 3080 costs ~$3000 AUD

Llama 2, requires 80 GB VRAM

Compute required:

80GB VRAM / 10 (NVIDIA 3080’s) = 8 computers

8 x $3000 = ~$24000 AUD.

This means that using Petals provides a savings factor of x2.3

There is currently a world wide shortage of GPUs, such as A100/H100s capable of serving 70B parameter models.

Using Petals allows you to skip the queue to serve your model today, all whilst saving a ton of money too!

The GPT Quantum Realm – Getting LLMs on Laptops

The OpenAI API is expensive. But did you know it is possible to get, a ChatGPT level AI model for free on your OWN computer?

Quantisation is a process that allows a very big language model (like GPT3.5) to be shrunk to fit on consumer grade hardware. This means that you can run the model for free on your very own computer.

The LLM quantisation technique means that the GPT technology is soon to be ubiquitous & (almost) free. Imagine a world when you can be offline and have an LLM on your phone!

If you are looking to get started with Quantisation here are 3 projects, with pros & cons, try it out yourself:

Technique: GGML

Pros: Use GGML if you cannot fit the model entirely on VRAM

Cons: Slow

Technique: Bitsandbytes

Pros: Newest Framework, Ease of use

Cons: Slowest

Technique: GPTQ

Pros: Fast, If you can fit the model entirely on the GPU using VRAM, GPTQ is faster

Cons: ?

GPT Behaviour drift – Why are OpenAI’s models changing all the time?

Did you know that ChatGPT/ OpenAI’s models experience “behaviour drift”? If you’re not aware of this potential problem, it could become a real pain to your project.

Why does drift occur? OpenAI’s models are black boxes and OpenAI is constantly changing how these models work under the hood. The behaviour drift occurs when OpenAI makes updates or architectural changes to their base models without explicitly telling us what exactly the changes are.

This means that if you build an LLM application today and use OpenAI APIs as an endpoint, in a couple of months, it may not perform exactly the same as it did when you first ran it.

You can combat model drift by finetuning your own open source base model such as Llama 2, checkpointing it and deploying the checkpointed model to the cloud. That way you’ll have total control over your models behaviour and could even have version control over the series of models you develop over time which is essential for proper testing and behaviour tracking.

Evidence for shifting performance of the GPT is provided in this linked article. Specifically, the end of the article contains information on behaviour drift