What is Bitsandbytes?
The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. (Source)
Benefits of using Bitsandbytes
- Easy to use Bitsandbytes continues to be the simplest method for model quantization, as it eliminates the need for calibrating the quantized model with input data, a process also known as zero-shot quantization. This technique enables out-of-the-box quantization of any model that includes torch.nn.Linear modules. With every new architecture introduced in the Transformers library, users can leverage bitsandbytes quantization immediately if the architecture is compatible with Accelerate's device_map set to "auto." This approach offers quantization directly upon model loading without the requirement for additional post-processing or preparatory steps, and it does so with only a minor impact on performance.
- No trade-offs from adaptor merges Training adapters atop a quantized base model enables seamless integration of the adapters with the base model for deployment, preserving optimal inference performance. Additionally, it's possible to combine the adapters with the dequantized version of the model as well.
- Support AMD GPUs out of box
Trade-offs from using Bitsandbytes
Which memory spaces are saved, the trade-off of using Bitsandbytes are increased latency.
Installation
Since bitsandbytes requires GPUs, it won't be able to run well in pure CPU environment (although you could just use pip to install it). On more details how to install a GPU environment with bitsandbytes, please look into below articles:
Loading a Model in 4-bit for the smallest memory
Loading model into 4-bits reduces memory usage to almost 25%-30% of original size. This can be done by setting the load_in_4bit=True
argument.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using pipeline
>>> import torch
>>> from transformers import pipeline
>>> from datetime import datetime
>>>
>>> print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
torch.cuda.memory_allocated: 0.000000GB
>>> start = datetime.now()
>>>
>>> pipe = pipeline(model='facebook/opt-1.3b', device_map="auto", model_kwargs={"load_in_4bit": True})
>>> output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
>>> print(output)
[{'generated_text': "This is a cool example! Thank you.\nNo problem, I'm glad to help."}]
>>>
>>> print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
torch.cuda.memory_allocated: 0.841606GB
>>> end = datetime.now()
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 5.135089
>>> exit()
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a bug or not, but"]
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.705603
>>> exit()
Loading a Model in 8-bit if 4-bit is not a must
Loading model into 4-bits reduces memory usage to almost 50% of original size. Technically, 8bit model takes less compression and therefore gain back a bit speed. The difference will be more observable in bigger models.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 1.33 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 1.796418
>>> exit()
Improve Inference Speed by override Quantization Configuration
While 4bit model gives you the smallest memory footprint, it trades off the most performance. To gain back the speed, you could setbnb_4bit_compute_dtype
to a different value, such as torch.bfloat16
. This approach might not work for all models. Please review the model card for details.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
... load_in_4bit=True,
... bnb_4bit_compute_dtype=torch.float16,
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
... model_name,
... quantization_config=bnb_config,
... device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a bug or not, but"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.67244
>>>
>>> exit()
Using NF4 Data Type
Add bnb_4bit_quant_type="nf4"
into quantization configuration could load weights initialized with normal distribution.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
... load_in_4bit=True,
... bnb_4bit_quant_type="nf4",
... bnb_4bit_compute_dtype=torch.float16,
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
... model_name,
... quantization_config=bnb_config,
... device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.673312
>>>
>>> exit()
Use Double Quantization to continually improve Memory Efficiency
Another proven benefit is to use double quantization to gain even better memory footprint, especially dealing with larger models.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-1.3b'
>>>
>>> bnb_config = BitsAndBytesConfig(
... load_in_4bit=True,
... bnb_4bit_quant_type="nf4",
... bnb_4bit_compute_dtype=torch.float16,
... bnb_4bit_use_double_quant=True
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
... model_name,
... quantization_config=bnb_config,
... device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a good idea to have a"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 0.76 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 0.720076
>>> exit()
Offloading Between CPU and GPU
Another advantage of using bitsandbytes is that you could offload weights cross GPU and CPU. This is very helpful when you load a larger model with limited GPU capacity. To enable the CPU offloading, use llm_int8_enable_fp32_cpu_offload=True
.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # Example of using AutoModel
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
>>> from datetime import datetime
>>>
>>> model_name = 'facebook/opt-6.7b'
>>>
>>> bnb_config = BitsAndBytesConfig(
... load_in_4bit=True,
... bnb_4bit_quant_type="nf4",
... bnb_4bit_compute_dtype=torch.float32,
... bnb_4bit_use_double_quant=True,
... llm_int8_enable_fp32_cpu_offload=True
... )
>>>
>>> start = datetime.now()
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>>
>>> model = AutoModelForCausalLM.from_pretrained(
... model_name,
... quantization_config=bnb_config,
... device_map="auto"
... )
>>>
>>> inputs = tokenizer("This is a cool example!", return_tensors="pt")
>>>
>>> start = datetime.now()
>>> outputs = model.generate(**inputs)
>>> end = datetime.now()
>>>
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["This is a cool example! I'm not sure if it's a bug or not,"]
>>>
>>> print(f'Memory used by model: {round(model.get_memory_footprint()/1024/1024/1024, 2)} GB')
Memory used by model: 3.4 GB
>>>
>>> delta = end - start
>>> print('Difference is seconds:', delta.total_seconds())
Difference is seconds: 1.841406
>>> exit()
Play with Outlier Threshold
llm_int8_threshold
allows you to play and identify outlier threshold. The proper value could potentially improve inference speed. Please see below paper for the origin theory.
Skipping modules during conversation
llm_int8_skip_modules
is another advanced parameter which allows you to skip specific modules on 8-bit.
Fine Tuning
Bitsandbytes quantization supports fine tuning well. You could fine tune above techniques into your fine tuned model for better performance.
Let's pause here. Shall you have any comments and suggestions, please leave your messages.