Zeus Labs

4Bit QLoRA - 13B & 33B on 24GB VRAM or less?

Author: Elinas

Preface

I had heard of training quantized models for some time now, specifically 4bit/int4. I shrugged them off as most likely being low quality - since you’re taking a model that has already been quantized (for inference) and then training on it. I was pleasantly surprised and will give you a quick overview on how I replicated Chronos-13B using a single 3090 with a ~22% speed increase over 8bit/int8.

To note - LLaMA 7B and 13B can be run well under 24GB VRAM. 30/33B was the original idea to run on a single 3090.

Bitsandbytes nf4 Format is Added to Transformers

Since I wanted to try int4 training, and I had a 3090 sitting around doing nothing, I decided to do a bit of research on how the process works and how to set it up. I won’t go into the technical details, but you can read this blog post for more info Now if you look at that blog post, the title is “Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA”, which is exactly what I am going to explain.

Some Technical Details & Getting to Training (Soon)

Previously, to train a LoRA model, you’d need to use 8bit or int8 type (not to be confused with FP8 or others which are floating point numbers and not integers, thus are more precise), nonetheless, the prospect of training LLaMA 33B on my 3090 was exciting, and the added speed was a bonus.

Since LLaMA and many other models have a default “memory,” or otherwise known as “context length,” I always try to train my models at a length of 2048 tokens which is what LLaMA was trained at. This includes getting a high quality dataset, with many samples that are around the range of 2048 tokens for the model’s max potential.

Rough Tests & Findings (LLaMA)

Experiment - Replicating Chronos-13B in 4bit

I wanted to try this method and since I did not want to train 30/33B (Chronos versions were not out yet of that size too..) I decided to try to replicate the 8bit version of the model as closely as I could in 4bit. For this I used my own trainer in which I implemented QLoRA. The trainer is originally adapted from Stanford Alpaca LoRA repo but has since evolved to include numerous features.

Setting up your Environment

I used WSL as a PoC for this and I highly recommend you use Linux or WSL for simplicity.

You can use my trainer to accomplish training easily, it can be found here: Zeus LLM Trainer

I’ll provide the instructions as I did myself originally:

  1. Create the venv - python -m venv venv
  2. Activate the venv - source venv/bin/actiate
  3. Install the requirements - pip install -r requirements.txt

That should have you covered, now you will need a dataset, which there are many to choose from if you browse Hugging Face. For this demo we’ll use the GPT4 Alpaca LoRA Dataset but you can use any format as long as it follows in json or jsonl:

Running the Model

Note I changed the base model, it was pointing to the wrong one originally, the correct base model is elinas/llama-13b-hf-transformers-4.29 for this demo!

Here is the run configuration I used

python finetune.py \
    --base_model='elinas/llama-13b-hf-transformers-4.29' \
    --data_path='dataset.json' \
    --train_4bit \
    --num_train_epochs=3 \
    --cutoff_len=2048 \
    --val_set_size=0 \
    --output_dir='./13b-4bit-qlora-chronos' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=128 \
    --lora_alpha=256 \
    --gradient_accumulation_steps=8 \
    --per_device_train_batch_size=2 \
    --save_and_eval_steps=500 \
    --warmup_ratio=0.04 \
    --group_by_length \
    --save_total_limit=2 \
    --use_xformers 

I won’t go through every parameter but there are some you should be familiar with.

--base_model='elinas/llama-13b-hf-transformers-4.29' is just saying to use the LLaMA 13B model I created a while ago which will automatically be downloaded.

--train_4bit simply signifies that you are using QLoRA to train your model, this is needed.

--num_train_epochs=3 generally we train for 3 epochs, sometimes more, but not usually less unless your model is overfit.

--cutoff_len=2048 is where we want the model to cut the samples off at. Currently, without alternate methods, 2048 is the max.

--use_xformers is a nice “hack” to reduce VRAM usage quite significantly at nil cost to the end result.

--per_device_train_batch_size=2 Should be kept at 2 for this demo, BUT you may be able to increase --gradient_accumulation_steps=8 to a higher value like 16 as I purposely under-provisioned VRAM to ensure there were no crashes.

With these settings, I was using ~18GB VRAM, and that includes 2 additional LoRA attention layers, so this might fit on a 16GB card like an A4000. Additionally, 7B should be trainable with a 12GB card.

Please read the rest of my documentation here for more information on the hyperparameters.

The Results

Comparing to training the same configuration on 2x RTX A6000 GPUs, I estimated a 22%+ increase in speed alone. Now, how good is the model compared to the original? Not as good, but if I did not ever use the original model, or did a complete blind test, then it would be harder to ascertain the differences.

The quality drop is not significant enough to warrant actual comparisons here, as they are just different. I have not been able to figure out the exact reason behind the results other than that it follows instructions… not as well. Now, that does not mean 4bit QLoRA is bad, the quite opposite, actually. I’ve demonstrated that LoRAs are comparable to finetunes with great datasets.

Conclusion

Should you bother? - It’s really up to you and what your goal is. I enjoy experimenting with bleeding-edge tech, and while this might not be as good as an FP16 model, neither is a quantized 4bit model that most of you all use anyway. It’s just another method, and should been seen as that, not just “worse” because it utilizes lower precision. In the end, if you’re going to train, you should have a dataset prepared.

Will I personally do it again? - Yes! I am planning to try it on LLaMA 65B and see the results from a large parameter model. Though, current priority is on extending context length on the Chronos models.

If you are interested in LLMs, Deep Learning, ML, etc., please join the Zeus Labs Discord Server

Note that this post may be updated with a follow-up in the future. Thanks for reading.

Elinas