MAGAN.AI is a powerful offline AI system designed for generating high-quality images and text. One of the key factors in optimizing its performance is understanding quantization and how it affects the GPU, VRAM, and RAM usage. In this blog post, we will explore how quantization works and how to choose the right model for low VRAM systems.
Quantization is the process of reducing the precision of the weights and activations in a neural network model. By reducing the precision, the model takes up less memory, which directly impacts VRAM usage.
QX: The X is the number of bits for the model, the less bits the smaller the model and also lower quality.
QX_K: File names with K have grouped quantization that has better accuracy than older _0 and _1 models.
QX_K_Y: The Y usually is for S/M/L, or Small/Medium/Large
The IBM-Granite/Granite-3.3-8b-Instruct-GGUF model has a parameter count of 8 billion and an 8-bit architecture. Here are the quantization options available for this model:
When selecting a model for your system a good base rule is to select the model that has a file size smaller than the amount of VRAM your system has. Once you have that model working, check your VRAM and GPU usage and you can either get a larger or smaller model. The number of parameters correlates to the file size. Low VRAM systems will need to stay under the 12 billion parameters. With more VRAM larger models can be loaded.
Given a system with 6 GB of VRAM, you would need to choose a quantization level that fits within this limit. From the example quantization options, the 5-bit quantization levels are the most suitable choices:
MAGAN.AI uses GGUF (General-GPT-Universal Format) files for the chat. You can download LLM models to use in MAGAN.AI from the Hugging Face Model Hub. Here is the link to find and download the models:
[Download LLM Models from Hugging Face]
https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&apps=llama.cpp&sort=downloads
This link filters for GGUF files that work with llama.cpp for text generation. You can further filter the results by number of parameters on the left. MAGAN.AI works best with general or Instruct models. Reasoning or Thinking models do not work with MAGAN.AI at this time.
Click the model you are interested in. On the model page the model quantization options will be on the right. Select the correct size and click the version button. A right toolbar will open where you can download the model using the Download button at the top.
Once the file is downloaded move it to the “models/llm” folder.
Understanding quantization and how it affects VRAM and GPU usage is crucial when using an offline AI system like MAGAN.AI. By carefully selecting the right quantization level and choosing models that fit within your VRAM constraints, you can ensure that your AI system runs efficiently.
By following these guidelines, you can effectively manage the memory usage of your MAGAN.AI system and achieve optimal performance.