LM Studio
Downloadable from: https://lmstudio.ai/
Page contents
Ratings
Accuracy / Quality: ★★★☆☆
Flexibility / Features: ★★★★☆
Data security / Privacy: ★★★★☆
Open source model(s)
Pros/cons
Pros
- Customizable for many different (language) models.
- Easily updated to use newly released models.
- Data-secure and private.
- Low technical difficulty.
- Model parameters can be (partially) modified.
Cons
- Typically lower in quality than commercially hosted models.
- Slower than commercially hosted models due to lower hardware quality.
- Fewer features than commercial models.
- At least 5-10 GB local storage space required (smaller models).
Description
General
Using commercially hosted models (e.g. ChatGPT, Gemini or Claude) can have some significant drawbacks. Whenever you enter information into these models there is a risk that the information provided ends up in the training data of the model. This means that there can be no safe usage of these tools when working with either personal sensitive information (e.g. names, phone numbers, (mail) addresses, etc.), but also not with academically sensitive information (research questions, hypotheses, new methodologies, funding information, etc.). Unless a formal (written) agreement is made between the company in question and the WUR, using an open-source model on your own device can be a solution to this problem.
LM Studio (or alternatively: Ollama) is an example of software that allows the user to install language models on their own device. Doing so would allow the model to run on the hardware of the user, rather than the large servers of external companies. Any data you enter into models that are operating locally (on your own device) is not sent to any external servers, and is therefore data secure. Even if the internet connection would fall away, these models would continue to be usable.
Quantization
When installing models in LM Studio (or Ollama) you do so via GGUF-files. These are quantized versions of released open source models. The information in the model resulting from the training of the model is stored as large numbers, often in 32- or 64-bit lengths, such that the information is maintained with high precision. In the quantization process these lengths are reduced to e.g. 8-bit lengths. This shrinks down the size of the model to a more manageable volume for local devices, but comes at the cost of the accuracy of the model. This does not necessarily need to be a problem for the functioning of the model, as the accuracy provided by the high precision of the data is not always required. But if the precision is affected too much, the model will increasingly hallucinate or be unable to answer. In simple terms this process can be compared to the value for Pi: Having more decimals makes the calculation more precise, but not necessarily better.
The level of quantization can be found in the model name, with Q8 (8-bit) referring to a model in which a large degree of precision has been kept, whereas a Q2 (2-bit) version has lost a lot of precision but is significantly smaller in size. Conventionally the Q4 or Q5 quantized versions of models are an acceptable balance between size and accuracy for most tasks.
Model parameters
An advantage of using locally installed models is that parameters which are usually hidden in the background are available for modification. We will highlight the most relevant parameters.
Temperature
A core behavior of language models is determined via the 'Temperature' parameter. This parameter can vary from 0 to 1. A value of 0 represents a very strict and deterministic model, in which the prediction of the next token (word) is based on the largest statistical probability. It will result in the most likely answer being generated, and for the model to respond with the same answer consistently. A value of 1 instead results in more variation being introduced in the model, and less likely answers being generated. This can be beneficial when writing more creative texts such as poems, or when asking for more unique suggestions for the improvement of your text. The default temperature set for the model upon installing it is an indicative value of an acceptable balance between these two extremes. In LM Studio the Temperature setting can be found in the Advanced Configuration menu (top-right) under the 'Sampling' tab.
Context length
When reading the information you provide (your prompt and/or uploaded documents) and writing your answers the model is dependent on its context length. This parameter can be seen as the 'attention span' of the model. The greater the context length, the slower the model risks becoming, but the more information it can memorize as part of its answering. In LM Studio the context length can be found by clicking on the cogwheel next to the dropdown menu for the model selection.
When uploading documents to the model, bear in mind that the available context length can affect the quality of the answer. With a context length too short to fit the full document LM Studio uses the 'retrieval' method. This means that the language model looks for more exact matches of information in the file and is more of a 'searcher'. When the context length is large enough to fit the full size of the document (and prompt), LM Studio uses the 'full-injection' method, which means that the document is used as contextual information, allowing for more in-depth and related answers, rather than exact matches from the text.
System prompt
After loading a model, it can be used straight away. However, you can also insert a 'System prompt'. These are instructions which the model always needs to consider, and which are considered more important than the prompts you enter into your chat. These can be compared to the Custom Instructions feature available in ChatGPT. The system prompt can be used to specify the behavior of the model for all chats it is used in, until the setting is disabled again. For examples on how to write a system prompt, you can take inspiration from the system prompts of Claude, which are available via this link. You can find the System prompt feature in LM Studio under the Advanced Configuration menu on the top-right.
Finding models
How to find suitable models
To find GGUF-versions of open source models the best place to start is the Huggingface website. This website serves as a repository of open source models developed by both large companies as well as fine-tuned versions of those models released by individuals. To find suitable models for LM Studio you can look on the Models page. On this page, apply the 'Text generation' filter in the 'Tasks' tab, and the 'GGUF' filter in the 'Libraries' tab. What you are left with are all potentially usable open source language models.
You will rarely find GGUF-versions of models together with the originally released models. Hence we recommend looking at the versions released by Bartowski or the LM Studio Community. Inside the model page you can download the version of the model you desire via the 'Files' tab. Alternatively, you can also use the 'Discover' functionality within LM Studio itself, though this is a more limited list of available model versions.
Next, you will need to find a suitable model from the list of available ones. When comparing the available models, you will find that the larger the model (indicated by the B-value in the name, referring to the billions of parameters in the model) the higher the accuracy. However, this also means that the model is significantly larger to install and run. In this guide we will list a handful of versions, but will leave out the largest versions of models as these are unlikely to be usable on commonly used devices. We will only list 'Instruct' or 'It' versions of models, as these have been specifically fine-tuned to operate as a chatbot, making them more suitable for the application in LM Studio.
Each model has its own strengths and weaknesses, and their own sizes and quality of the training data. Luckily this can be assessed via independent benchmarks or via blind tests. Benchmarks are datasets containing questions (and answers) which can be posed to a language model to test its capabilities or accuracy on various tasks. These questions are (ideally) not present in the training data of the model, allowing the ranking of the models by the percentage of questions correctly answered. In this guide we will use the Livebench benchmark website. Alternatively, models can also be tested blindly by users by presenting users with the output of two or more models to a prompt they submitted, after which the user indicates which model (without knowing the name of the model) performed best. Based on this information a ranking can be established from real user experience. In this guide we use the LM Arena website for this ranking. Note that both the benchmark and blind test rankings were done on the full versions of the models, not the quantized versions. Hence a small decrease in quality compared to the listed ranking can be expected when using the model locally via LM Studio.
You can check if the model you wish to download will adequately run on your computer by looking it up in the Discover function of LM studio. This search will warn you when a model will (likely) use too much of your computer memory.
Overview of models and benchmarks
The table below contains a list of (potentially) relevant models, is by no means complete, and prioritizes the more commonly used models. Note that not all model versions have a listed benchmark score. In these cases, know that smaller models generally have a lower performance than larger models.
Model description | Livebench | LM Arena | Download links | |||||
---|---|---|---|---|---|---|---|---|
Model family | Developer | Version | Reasoning | Math | Language | Coding | Overall score | |
LlaMa | Meta | 3.1-8B-Instruct-Turbo | 13.33 | 18.31 | 17.71 | Generation: 29.49 Completion: 8 |
1176 | |
3.3-70B-Instruct | 50.75 | 42.24 | 39.2 | Generation: 37.18 Completion: 36 |
1256 | |||
Gemma | 2-2B | N/A | N/A | N/A | N/A | 1142 | ||
2-9B | 15.17 | 19.80 | 25.53 | Generation: 26.92 Completion: 18 |
1191 | |||
2-27B | 28.08 | 26.52 | 32.62 | Generation: 35.90 Completion: 36 |
1220 | |||
Qwen | Alibaba | 2.5-7B-Instruct-Turbo | 28.42 | 39.49 | 15.80 | Generation: 39.74 Completion: 37 | N/A | |
2.5-14B | N/A | N/A | N/A | N/A | N/A | |||
2.5-32B | 42.08 | 46.61 | 23.24 | Generation: 57.69 Completion: 56 | N/A | |||
2.5-72B-Instruct-Turbo | 45.42 | 54.29 | 34.99 | Generation: 51.28 Completion: 64 |
1258 | |||
Mistral | Mistral | Large 2411 | 41.67 | 44.69 | 39.52 | Generation: 46.15 Completion: 48 |
1244 | |
Small 2409 | 29.92 | 24.24 | 24.49 | Generation: 24.36 Completion: 18 |
N/A | |||
Nemo | N/A | N/A | N/A | N/A | N/A | |||
DeepSeek | DeepSeek | 3 | 56.75 | 60.54 | 47.48 | Generation: 61.54 Completion: 62 |
1315 |
About the developers
When using language models you should be aware of the biases inherent in them, and the limitations of the models. Some of these are tied to the developers of the models. Hence we will briefly go over each developer and provide some background information on them to provide context for the use of their respective models.
LLaMa
The Llama models are developed by Meta, an American technology company that is also known for social media platforms like Facebook, Instagram and WhatsApp. Apart from these social media apps and websites, Meta is also one of the biggest AI companies, most known for its online Meta AI model (not available in the Netherlands) and the offline Llama model. In total, all versions of the offline model are downloaded more than 350 million times. The newest version of the model is Llama 3.2. Usage of the LLaMa models is allowed under the LLaMa License for research and commercial purposes, but is more limited for commercial purposes when it concerns models of 'small' sizes (less than 8B parameters) due to potential large-scale use in commercial devices such as smartphones.
Qwen
At more than 40 million downloads over more than 100 different versions, Alibaba Cloud’s Tongyi Qianwen (Qwen) is another popular open source large language model. Alibaba Cloud is the digital technology part of the Chinese Alibaba group, also known for the web shops AliExpress and Alibaba. The model is very adequate at mathematics and programming tasks. Contrary to American and European model developers, the training data used for Qwen contains a proportionally larger body of data from Chinese origin. This may result in different model behavior compared to Western models, especially when social or political subjects are concerned.
Mistral
Mistral is an independent AI company from France founded in 2023. There are several Mistral applications like the online assistant LeChat, and the open source models Mistral, Codestral and Mathstral. Mistral made a name for itself by being the first commercial company to implement the 'Mixture of Experts' model strategy, in which various smaller models were each trained on a specific set of subjects, and then brought together to form a larger combined model. The resulting output was significantly more accurate than that of its competitors at the time.
Gemma
Another prominent AI company is Google DeepMind. Google is renowned for (among others) YouTube, Android and of course the search engine. Since the introduction of chat assistant Gemini (formerly known as Bard), Google also has also released a series of open source AI tools named Gemma. Apart from the Gemma 2 tool specialized in text generation, there are specialized model for code generation (CodeGemma) and images (PaliGemma).
DeepSeek
DeepSeek is an AI tool from High-Flyer, a Chinese hedge fund company. The newest model (Deepseek-V2.5) is praised for high benchmark results. Until recently the DeepSeek Coder was also considered the most accurate open source AI coding assistant. Similar to Qwen, the training data used to train the DeepSeek models is less American and Western Europe focused, and as such may give different results than Western models on social or political topics.
Alternatives
- Ollama: https://ollama.com/
- Installing the full model and operating it via programming language (commonly Python). This requires more technical knowledge and is less user-friendly.