What is Bitsandbytes?

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. (Source)

Benefits of using Bitsandbytes

  • Easy to use Bitsandbytes continues to be the simplest method for model quantization, as it eliminates the need for calibrating the quantized model with input data, a process also known as zero-shot quantization. This technique enables out-of-the-box quantization of any model that includes torch.nn.Linear modules. With every new architecture introduced in the Transformers library, users can leverage bitsandbytes quantization immediately if the architecture is compatible with Accelerate's device_map set to "auto." This approach offers quantization directly upon model loading without the requirement for additional post-processing or preparatory steps, and it does so with only a minor impact on performance.
  • No trade-offs from adaptor merges Training adapters atop a quantized base model enables seamless integration of the adapters with the base model for deployment, preserving optimal inference performance. Additionally, it's possible to combine the adapters with the dequantized version of the model as well.
  • Support AMD GPUs out of box

Trade-offs from using Bitsandbytes

Which memory spaces are saved, the trade-off of using Bitsandbytes are increased latency.

Installation

Since bitsandbytes requires GPUs, it won't be able to run well in pure CPU environment (although you could just use pip to install it). On more details how to install a GPU environment with bitsandbytes, please look into below articles:

Loading a Model in 4-bit for the smallest memory

Loading model into 4-bits reduces memory usage to almost 25%-30% of original size. This can be done by setting the load_in_4bit=True argument.


Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using pipeline
>>> import torch
>>> from transformers import pipeline
>>> from datetime import datetime
>>>
>>> print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
torch.cuda.memory_allocated: 0.000000GB
>>> start = datetime.now()
>>>
>>> pipe = pipeline(model='facebook/opt-1.3b', device_map="auto", model_kwargs={"load_in_4bit": True})
>>> output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
>>> print(output)
[{'generated_text': "This is a cool example! Thank you.\nNo problem, I'm glad to help."}]
>>>
>>> print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
torch.cuda.memory_allocated: 0.841606GB
>>> end = datetime.now()
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 5.135089
>>> exit()



Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a bug or not, but"]
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.705603
>>> exit()

Loading a Model in 8-bit if 4-bit is not a must

Loading model into 4-bits reduces memory usage to almost 50% of original size. Technically, 8bit model takes less compression and therefore gain back a bit speed. The difference will be more observable in bigger models.


Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 1.33 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 1.796418
>>> exit()

Improve Inference Speed by override Quantization Configuration

While 4bit model gives you the smallest memory footprint, it trades off the most performance. To gain back the speed, you could setbnb_4bit_compute_dtype to a different value, such as torch.bfloat16. This approach might not work for all models. Please review the model card for details.

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
...     model_name,
...     quantization_config=bnb_config,
...     device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a bug or not, but"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.67244
>>>
>>> exit()

Using NF4 Data Type

Add bnb_4bit_quant_type="nf4"into quantization configuration could load weights initialized with normal distribution.

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
...      load_in_4bit=True,
...      bnb_4bit_quant_type="nf4",
...      bnb_4bit_compute_dtype=torch.float16,
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
...     model_name,
...     quantization_config=bnb_config,
...     device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.673312
>>>
>>> exit()

Use Double Quantization to continually improve Memory Efficiency

Another proven benefit is to use double quantization to gain even better memory footprint, especially dealing with larger models.


Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
...      load_in_4bit=True,
...      bnb_4bit_quant_type="nf4",
...      bnb_4bit_compute_dtype=torch.float16,
...      bnb_4bit_use_double_quant=True
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
...     model_name,
...     quantization_config=bnb_config,
...     device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.720076
>>> exit()

Offloading Between CPU and GPU

Another advantage of using bitsandbytes is that you could offload weights cross GPU and CPU. This is very helpful when you load a larger model with limited GPU capacity. To enable the CPU offloading, use llm_int8_enable_fp32_cpu_offload=True.


Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-6.7b'
>>>
>>> bnb_config = BitsAndBytesConfig(
...      load_in_4bit=True,
...      bnb_4bit_quant_type="nf4",
...      bnb_4bit_compute_dtype=torch.float32,
...      bnb_4bit_use_double_quant=True,
...      llm_int8_enable_fp32_cpu_offload=True
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
...     model_name,
...     quantization_config=bnb_config,
...     device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example!  I'm not sure if it's a bug or not,"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 3.4 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 1.841406
>>> exit()

Play with Outlier Threshold

llm_int8_threshold allows you to play and identify outlier threshold. The proper value could potentially improve inference speed. Please see below paper for the origin theory.

Skipping modules during conversation

llm_int8_skip_modules is another advanced parameter which allows you to skip specific modules on 8-bit.

Fine Tuning

Bitsandbytes quantization supports fine tuning well. You could fine tune above techniques into your fine tuned model for better performance.

Let's pause here. Shall you have any comments and suggestions, please leave your messages.