As mentioned in the previous article, Llama.cpp might not be the fastest among the various LLM inference mechanisms provided by Intel. This article continues with some more tests.

Llama.cpp

First, we run llama.cpp again with the same prompt and record the data.

main -m Meta-Llama-3-70B-Instruct-Q4_K_M.gguf -n 128 --prompt "OpenVINO is" -t 8 -e -ngl 999 --color
None
Peak power consumption is 56 watts
None
1.44 tokens/second

🤗Huggingface Transformers + IPEX-LLM

Although Llama.cpp also uses IPEX-LLM to accelerate computations on Intel iGPUs, we will still try using IPEX-LLM in Python to see the difference.

The code ipex-llm-llama3.py is as follows,

# Modified by Wei Lu(mailwlu@gmail.com)
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import torch
import time
import argparse

from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3
DEFAULT_SYSTEM_PROMPT = """\
"""

def get_prompt(user_input: str, chat_history: list[tuple[str, str]],
               system_prompt: str) -> str:
    prompt_texts = [f'<|begin_of_text|>']

    if system_prompt != '':
        prompt_texts.append(f'<|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|>')

    for history_input, history_response in chat_history:
        prompt_texts.append(f'<|start_header_id|>user<|end_header_id|>\n\n{history_input.strip()}<|eot_id|>')
        prompt_texts.append(f'<|start_header_id|>assistant<|end_header_id|>\n\n{history_response.strip()}<|eot_id|>')

    prompt_texts.append(f'<|start_header_id|>user<|end_header_id|>\n\n{user_input.strip()}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n')
    return ''.join(prompt_texts)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama3 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Meta-Llama-3-70B-Instruct",
                        help='The huggingface repo id for the Llama3 (e.g. `meta-llama/Meta-Llama-3-70B-Instruct`) to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="OpenVINO is",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=128,
                        help='Max tokens to predict')
    parser.add_argument('--bit', type=int, default=4,
                        help='load in 4,5 or 8 bit')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path

    if args.bit == 4:
        # Load model in 4 bit,
        # which convert the relevant layers in the model into INT4 format
        # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
        # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. Not help in this case.
        model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_4bit=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_low_bit=f"sym_int{args.bit}",#https://github.com/intel-analytics/ipex-llm/blob/10e480ee96f1746392ef9baca93585435f6f9523/python/llm/src/ipex_llm/transformers/model.py#L139
                                                    #not saving VRAM:cpu_embedding=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)

    model = model.half().to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    # here the terminators refer to https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct#transformers-automodelforcausallm
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    ]

    # Generate predicted tokens
    with torch.inference_mode():
        prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                eos_token_id=terminators,
                                max_new_tokens=20)

        # start inference
        st = time.time()
        output = model.generate(input_ids,
                                eos_token_id=terminators,
                                max_new_tokens=args.n_predict)
        torch.xpu.synchronize()
        end = time.time()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=False)
        print(f'Inference time: {end-st} s, tokens: {len(output[0])}, t/s:{len(output[0])/(end-st)}')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output (skip_special_tokens=False)', '-'*20)
        print(output_str)
python ipex-llm-llama3.py
None
Peak power consumption is 54 watts
None
1.91 tokens/second

Even with the warmup round (line 92–94) commented, it can still make 1.86 tokens/second, larger than llama.cpp's 1.44 obviously.

OpenVINO

The third option is OpenVINO, which can be used with 🤗 Transformers through Optimum.

I tried two pre-converted models from other users, https://huggingface.co/nsbendre25/llama-3-70B-Instruct-ov-fp16-int4-asym/ and https://huggingface.co/fakezeta/Meta-Llama-3-70B-Instruct-ov-int4, but both failed due to insufficient memory.

None
None

To compare all three methods, we will switch to the 8B Llama-3 model.

Llama.cpp

Model is from https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF.

main -m Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 128 --prompt "OpenVINO is" -t 8 -e -ngl 999 --color
None
44 watts
None
8.35 tokens/second

🤗Huggingface Transformers + IPEX-LLM

python ipex-llm-llama3.py --repo-id-or-model-path=meta-llama/Meta-Llama-3-8B-Instruct --bit=8
None
sym_int8 quantification
None
47 watts
None
9.0 tokens/second

OpenVINO

Model is from https://huggingface.co/fakezeta/llama-3-8b-instruct-ov-int8.

The code is as follows.

from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
import time

# Load model from Hugging Face already optimized with GPTQ
model_id = "fakezeta_llama-3-8b-instruct-ov-int8"
model = OVModelForCausalLM.from_pretrained(model_id, device="GPU")

# Inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
phrase = "OpenVINO is"
results = pipe(phrase,max_length=20)
print(results)
tokens=128
tt = time.time()
results = pipe("OpenVINO is",max_length=tokens)
ts = tokens/(time.time()-tt)
print(results)
print(f"{ts:2.2f} t/s")
None
42 watts
None
7.48 tokens/second vs llama.cpp's 8.35

In summary, 🤗Transformers + IPEX-LLM provides the best performance, which might be related to the quantization format. Further comparisons are needed to evaluate the precision loss of various quantization formats.

None

Llama.cpp offers flexibility in allocating layers between CPU and GPU. With half of the CPU memory often remaining free, this allows for experimenting with 5-bit or higher quantization and larger models than the 70B.

Additionally, Intel provides hardware acceleration for ONNX + DirectML. This will be included in the next experiment. Stay tuned…