Knowing when it is a good idea to make the switch from OpenAI to Open-Source (OS) models can be complex.
In our case, we used OpenAI's models to get started quickly when building a Retrieval Augmented Generation (RAG) pipeline for our company’s knowledge base of (hundreds of) thousands of pages on Notion. Then, Meta released the Llama-2 series it became reasonably easy to fine-tune our own model.
It turns out that, although it enables complete control over our pipeline and the evaluation process (see below), it wasn’t such a good idea for our use-case. After GPT-4 distillation, performance was decent but maintenance and serving costs were too high to justify the change.
Following this experience, here is a breakdown of the advantages of OpenAI and OS models, regarding security, evaluation, control and customization, costs and, of course, performance.
NB: We chose to compare to OpenAI specifically as we believe it has the best performance and most competitive offering among closed-source LLM APIs, at the moment.
Open-source models allow you to ensure data privacy
Using OS models allows you to ensure data privacy, by keeping it locally or on your server, without any data transfer or internet connection. In some use cases, this level of data control and security is paramount. They include:
- Critical industrial applications where there is no internet connection
- Defense industry
- Healthcare data
- Classified information
In such cases, it makes sense to prioritize open-source alternatives, for this reason alone.
Azure OpenAI guarantees data privacy
The major reason people are concerned about data privacy when it comes to LLMs is that OpenAI remained very vague on that topic. However, Microsoft ensured data privacy in Azure OpenAI (which can be deployed in several regions around the world), so accessing the models this way is as safe as using any other cloud service.
Cybersecurity can also be a factor here as using OpenAI’s models means being vulnerable to any breach of their models (such as adversarial attacks or other vulnerabilities). However, you also get the regular safety improvements and security fixes deployed by the OpenAI team, at no cost.
Evaluating LLMs is crucial but challenging
Evaluating your LLM is crucial to ensure it performs as expected and your iterations actually lead to improvements. However, it is challenging as in most cases, there are countless correct ways to phrase the output of a given prompt. Common methods like A/B testing with real users or paid testers, while effective, are manual, costly, slow and may badly impact users. Recent attempts to automate evaluation using other LLMs have drawbacks and are probably only (sort of) reliable when a better model, like GPT-4, assesses a worse one.
Fine-grained evaluation at the token level
OS models have a key advantage regarding evaluation: they allow for a direct token-level comparison between the expected and predicted answers: unlike OpenAI which provides a single sampled token, OS LLMs grant access to the probability distribution for each possible output token during generation. Metrics like cross-entropy can then be employed to assess the alignment between expected and generated tokens, simplifying the evaluation process significantly.
PS: prior to GPT 3.5 turbo and GPT 4, OpenAI calls could return “logprobs” (up to top 5 tokens probabilities) which may partially enable this method. This is potentially because GPT-4 (and maybe GPT-3.5) is a mixture of models and log probabilities from one model to another may not be comparable.
Control and customization
Open-source models give you control over the generated tokens
Here again, being able to access a probability distribution over the possible output tokens unlocks many possibilities of customization, beyond those the OpenAI make available (such as top_k, the temperature, logit_bias, …). E.g:
- Hugging face Generation method provides the ability to use beams, length penalty, …
- Efficient Token healing in Guidance helps to better communicate the intent of the prompt to the model.
- Speculative execution using batch inference (best explained in this post from Andrej Karpathy) can significantly speed up the inference process.
OpenAI Functions guarantee the schema of the generated text
OpenAI implements “Functions”, a high-level feature (most likely) using constrained sampling in the backend to have greater control over the output schema (naturally also possible with OS models with a bit of engineering). Here, OpenAI really focuses on keeping its API widely understandable, only implementing what brings the highest value to the users and keeping it as simple as possible.
Dev costs can be relatively large with open-source models
To achieve a best performance with an OS model, you probably need to fine-tune it on your data. In our experience, it took 10 to 20 hours of development time (dataset creation by inferring past user inputs with GPT-4, understanding training pipeline like this one, machine deployment, evaluation and iterations) and >10 training iterations of 3-6 hours to get a Llama2-7B Proof-of-Concept with satisfactory performance (comparable to GPT-4 for our specific use case). It would probably take significantly more time to ensure the model is fully ready for production. On top of that should be considered the machine cost, although with Qlora, fine-tuning small LLMs on the cloud only costs a few tenths of dollars.
Serving open-source models is not cheap
Serving LLMs requires large processors. E.g. this article recommends the following hardwares for deploying on AWS: a g5.2xlarge instance (~870$/month on-demand) for LlamaV2 7B and p4d.24xlarge (~23,590$/month on-demand) for llamaV2 70B. Triton inference server (or the more recent HF text generation inference) are tools of choice to ease serving to multiple parallel users. Luckily, the community have been working on memory-efficient tools to make OS LLMs cheaper, mostly vllm and llama.cpp (NB: still early stage and may be unstable in production). Indeed, this source specifies that a c5.2xlarge AWS instance (~245$/month on-demand) is enough to serve a 7B model with llama.cpp at 150ms/token speed
Using OpenAI is significantly cheaper for occasional usage
Open-source large language models (LLMs) hosted on private machines often remain idle for some use cases, causing resource waste and cost inefficiency. In contrast, OpenAI optimizes serving costs by taking advantage of the fact its model are consistently called, allowing to provide very attractive pricing: considering an average request of 2k tokens in the prompt and 1k in the output, it would start to be cheaper to serve a 7B model on AWS with llama.cpp from 2040 requests/month compared to using GPT 4…
Open-source models’ performance is improving quickly
In recent months, significant OS LLMs improvements happened, mostly with the release of LLamaV2. This is only meant to continue, e.g. with Mistral AI building OS LLMs to compete with OpenAI as explained in their strategic memo (which also contained very insightful reasons to prefer OS models), with the first version released on 27/09!
Furthermore, the QLora (quantization + Lora) method made LLM fine-tuning significantly cheaper, and hence accessible (c.f. section on Costs). On top of that, the LIMA paper showed that as few as 1,000 well-curated prompts and answers are enough to fine-tune an LLM.
GPT-4 models remain SOTA
GPT-4 is undoubtedly currently the most powerful LLM and can be used for a wide range of applications without the need for fine-tuning specific tasks, granting you the flexibility to easily reuse it on different classes of tasks. Furthermore, some smaller OpenAI models (e.g. gpt-turbo-3.5) can be improved through fine-tuning without having to bother with training scripts, GPU configuration, etc.
As you can see, unless you have hard constraints in one way or another (if strict data control, then open-source; if small scale and cost-constrained, then OpenAI), it is not a simple question to choose between open-source models and the OpenAI model suite.
At the end of the day, it hinges on a general appreciation of what direction the LLM market will take and how you want to position yourself or your company for the future:
- On the one hand, you could believe that the lead OpenAI has with GPT-4 will only widen with more user-generated data and compute power, and that open-source LLMs will never catch up
- In this case, adopt OpenAI GPT-4 as you will be ready to get to the next iterations of their SOTA models.
- On the other hand, you could bet on the dynamism and creativity of the open-source community
- In this case, build your custom LLM pipeline hinging on open-source models and frameworks.