Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel ARC Support #1575

Closed
linus378 opened this issue Apr 26, 2023 · 80 comments
Closed

Intel ARC Support #1575

linus378 opened this issue Apr 26, 2023 · 80 comments
Labels
enhancement New feature or request stale

Comments

@linus378
Copy link

I was wondering if Intel ARC Gpu's work with this. Could not read anything about this here.

@linus378 linus378 added the enhancement New feature or request label Apr 26, 2023
@linus378
Copy link
Author

Also i do wonder if this could support 2 gpu's so you don't have too offload anything into RAM. Such as a arc a770 and a rx 6600.

@dan9070
Copy link

dan9070 commented Apr 27, 2023

It doesn't have support for OneAPI or OpenVINO currently from my knowledge, as I myself own an Intel Arc GPU.

@mmccool
Copy link

mmccool commented May 2, 2023

It doesn't, unfortunately. I really wish it did though as I have a dual A770 system myself (and these cards have a lot of VRAM for the price, and also good low-precision AI accelerators, etc). For now I'm running on CPU which is, of course, horribly slow.

However, one issue is that Intel's support for pytorch on its GPUs needs a special version based on pytorch 1.10 (see https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-intel-extension-for-pytorch-for-gpus.html) but this system uses pytorch 2.0.0. As soon as Intel gpu support for pytorch 2.0.0 comes out though I'm hoping support can be extended in this system (if I can find time maybe I'll even be able to contribute some patches). For CPU pytorch 2.0.0 is already supported: https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/releases.html

In the meantime, it would be great if the readme could at least be updated to say WHAT GPUs are supported.

BTW The one-click installer also fails if you don't have an NVIDIA GPU, even if you select "None". I had to go the git clone route.

@mmccool
Copy link

mmccool commented May 2, 2023

Multi-GPU support for multiple Intel GPUs would, of course, also be nice. MultiGPU is supported for other cards, should not (in theory) be a problem. I personally don't really care about mixing GPUs from different vendors, though :)

A bonus would be the ability to use Intel integrated graphics, although they have limited VRAM capabilities, but maybe good enough for some simple things.

@rattlecanblack
Copy link

Would love to see this as well, with the power and amount of VRAM the arc is a great little card for those of us that do more compute stuff than gaming, especially considering the price.

@miraged3
Copy link

Intel has released torch 2.0 support for arc gpus. https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu

@itlackey
Copy link

Does the release of pytorch 2 support move things forward for Arc support?

@oobabooga
Copy link
Owner

I have created a pinned thread for Intel Arc discussion and welcome you to move the discussion there: #3761

To my knowledge, llama-cpp-python should work with GPU acceleration on Intel Arc as long as you compile it with CLBLAST. See https://github.com/oobabooga/text-generation-webui#amd-metal-intel-arc-and-cpus-without-avcx2

@itlackey
Copy link

itlackey commented Sep 1, 2023

You rock! Thank you for all the hard work on this project!

@abhilash1910
Copy link
Contributor

@oobabooga Intel Arc GPU support is in the pipeline ; the support integration would be started in 2-3 weeks time (by myself) . There are some other items in the pipeline at Intel which we are covering - and we plan to add this to our GPU soon.

@oobabooga
Copy link
Owner

@abhilash1910 thanks for the info. For XPU inference on transformers, is it currently enough to do

model.to(torch.device('xpu'))

or similar, like here?

Does any special pytorch import command have to be made?

@itlackey
Copy link

I found this while researching how this all works.

https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html

It looks like there shouldn't be much to change, but I'm new to LLM/AI development. So I may be missing something.

@oobabooga
Copy link
Owner

Thanks @itlackey. I guess it should be a few changed lines (for the transformers loader):

  1. model = model.to("xpu") in modules.models.huggingface_loader
  2. return input_ids.to(torch.device('xpu')) in modules.text-generation.encode.

It would be nice if someone could test this.

@itlackey
Copy link

I'll have time in a few days and will give it a shot. We may also need to make some changes to the installer and/or docker image to load the Intel libs and driver and recompile llama.cpp to get xpu to work.
I was able to do this with a docker image for FastChat and llama.cpp. We should be able to do the same for textgen.

@abhilash1910
Copy link
Contributor

Good to know the interest ; thanks @oobabooga @itlackey (helps to determine priority). I will add in the changes starting tomorrow(25th Sept) and that can be tested.

Thanks @itlackey. I guess it should be a few changed lines (for the transformers loader):

  1. model = model.to("xpu") in modules.models.huggingface_loader
  2. return input_ids.to(torch.device('xpu')) in modules.text-generation.encode.

It would be nice if someone could test this.

@oobabooga
Copy link
Owner

Awesome @abhilash1910 :)

@Yorizuka
Copy link

Yorizuka commented Sep 30, 2023

Hello, I just purchased an Intel Arc A770 16gb. When it arrives (in a week) I will be willing to help test stuff on linux.
In general if ARC GPUs become usable, it could be a really nice option, especially if multi GPU is possible.

@Yorizuka
Copy link

Yorizuka commented Oct 8, 2023

small update: The GPU has arrived, I will install it into my PC when I have time. I am excited to start playing around with LLMs on my own PC.

@Th-Underscore
Copy link
Contributor

Thanks @itlackey. I guess it should be a few changed lines (for the transformers loader):

  1. model = model.to("xpu") in modules.models.huggingface_loader
  2. return input_ids.to(torch.device('xpu')) in modules.text-generation.encode.

It would be nice if someone could test this.

Doesn't change anything (yet). Using an Intel Iris Xe Graphics (not very good, I know) on WSL2. I'll test some more stuff out.

@Yorizuka
Copy link

Not sure if this is user error (im new to this) or an actual issue, but I'm getting errors talking about cuda while trying to load in a model. I find this really odd, especially because I chose the IPEX option during the ./start_linux.sh first time install.

2023-10-15 16:30:34 INFO:Loading HuggingFaceH4_zephyr-7b-alpha...
Loading checkpoint shards: 100%|██████████████████| 2/2 [02:04<00:00, 62.41s/it]
2023-10-15 16:32:39 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/modules/ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/modules/models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/modules/models.py", line 141, in huggingface_loader
    model = model.cuda()
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2168, in cuda
    return super().cuda(*args, **kwargs)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui-1.7/installer_files/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

@oobabooga
Copy link
Owner

@Yorizuka can you try making those changes to modules/models.py and modules/text-generation.py?

diff --git a/modules/models.py b/modules/models.py
index 5bd9db74..c376c808 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -137,6 +137,8 @@ def huggingface_loader(model_name):
         if torch.backends.mps.is_available():
             device = torch.device('mps')
             model = model.to(device)
+        elif hasattr(torch, 'xpu') and torch.xpu.is_available():
+            model = model.to('xpu')
         else:
             model = model.cuda()
 
diff --git a/modules/text_generation.py b/modules/text_generation.py
index 0f24dc58..295c7cdd 100644
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@@ -132,6 +132,8 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
     elif torch.backends.mps.is_available():
         device = torch.device('mps')
         return input_ids.to(device)
+    elif hasattr(torch, 'xpu') and torch.xpu.is_available():
+        return input_ids.to('xpu')
     else:
         return input_ids.cuda()

@Yorizuka
Copy link

I applied the patch, same issue.

2023-10-16 02:25:12 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/modules/ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/modules/models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/modules/models.py", line 143, in huggingface_loader
    model = model.cuda()
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2168, in cuda
    return super().cuda(*args, **kwargs)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/yori/mnt/8tb/yori_home_big/text-generation-webui/installer_files/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

to confirm I did the patch correctly, here is the git status:

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   modules/models.py
	modified:   modules/text_generation.py

no changes added to commit (use "git add" and/or "git commit -a")

and my git rev-parse HEAD output d331501ebc83e80c5d8f49c3e7c547730afff5c2

@Th-Underscore
Copy link
Contributor

Th-Underscore commented Oct 16, 2023

print(f"generations: input_ids set! model class: {shared.model.__class__.__name__} | has xpu {hasattr(torch, 'xpu')}") in text-generation/modules prints: image (using a GGUF model, though I'm trying to get CBLAS set up right now though, which is probably why llama.cpp is messing up)

So I uninstalled the torch and torchvision installed by the one-click installer and reinstalled IPEX, resulting in an unidentified .so error. Putting export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/anaconda3/envs/tg/lib in ~/.bashrc fixes that.
But I still get the same message: image (I changed the message slightly, my apologies)

And to add onto what @Yorizuka mentioned, trying run a GPTQ model in Transformers also gives this error: RuntimeError: GPU is required to quantize or run quantize model. alongside WARNING:torch.cuda.is_available() returned False. This means that no GPU has been detected. Falling back to CPU mode.

@Yorizuka
Copy link

Yorizuka commented Oct 16, 2023

I think the issue described in this comment #3761 (comment) is likely related to the issue we are having here.

@oobabooga
Copy link
Owner

@TheRealUnderscore about the transformers error, can you check if it works after this commit?

8ea554b

@Th-Underscore
Copy link
Contributor

Th-Underscore commented Oct 17, 2023

@oobabooga image
It seems the error is something to do with what Yorizuka said. hasattr(torch, 'xpu') returned false in my previous message, so it's not detecting PyTorch XPU whatsoever.

These were my PyTorch settings (via print(torch.__config__.show())) before reinstalling 2.0.1a0:

PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

Image for more readable build settings

And these are my 2.0.1a0 settings. Now lots of things have changed:

PyTorch built with:
  - GCC 11.2
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.2-Product Build 20230613 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

Image for more readable build settings
All the differences (I've no clue what could help and what couldn't, so I'm just listing them all):

  • GCC ver upgraded 9.3 -> 11.2
  • MKL ver upgraded
  • MKL-DNN ver downgraded
  • CPU extensions downgraded AVX512 -> AVX2
  • CUDA and CUDNN vers removed
  • CXX_COMPILER devtoolset-9 -> gcc-toolset-11
  • In CXX-FLAGS:
    • D_GLIBCXX_USE_CXX11_ABI state 0 -> 1
    • fabi-version removed
    • DLIBKINETO_NOCUPTI added
    • Werror added, set range-loop-construct
    • Wunused-local-typedefs added
    • Wno-error added, set deprecated-declarations
    • Wno-invalid-partial-specialization removed
    • Wno-used-private-field removed
    • Wno-aligned-allocation-unavailable removed
    • Wno-error added, set redundant-decls
  • TORCH_VERSION 2.1.0 -> 2.0.1
  • USE_CUDA ON -> OFF
  • USE_CUDNN ON -> OFF
  • USE_NCCL 1 -> OFF

Are any of these settings relevant to the GPU?

I'll keep looking into it on my own, I wouldn't be surprised if it was an installation error by me.

@i6od
Copy link

i6od commented Oct 20, 2023

I managed to get 0 tokens output with ipex lol
Screenshot_1
image
image
image
anyways im sleepy ive been at this all day,

@abhilash1910
Copy link
Contributor

Some updates regarding failures to build or compile with our systems(FYI):

  • For gbnf/ggml based compiler patterns, the support is in progress so there might be failures with older oneapi /dpct (if you are using previous release)
  • For issues related to IPEX xpu related to build, I would recommend switiching to latest public IPEX. Also tag me in case you are having difficulties building or using IPEX on your arc systems.
  • This support is in progress and I would update periodically as there are some subsequent works which need to be merged to use this fully .
    cc @oobabooga and others who are using our devices. Thank you for your continued support and interest on ARC.

@cyrillebeauchamp
Copy link

Sorry in advance for the long post.

Unfortunately the above is only a part of the solution: other requirements install a more recent version of PyTorch not compatible with Intel GPUs.

So I did a manual install from scratch:

Install Intel drivers:

# download the key to system keyring
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

# add signed entry to apt sources and configure the APT client to use Intel repository:
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list

#update the repositories and install the base kit:
sudo apt update
sudo apt install -y \
  intel-opencl-icd intel-level-zero-gpu level-zero \
  intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
  libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
  libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
  mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

Install Intel® oneAPI Base Toolkit:

# download the key to system keyring
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

# add signed entry to apt sources and configure the APT client to use Intel repository:
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

#update the repositories and install the base kit:
sudo apt update
sudo apt install intel-basekit

Install some missing libraries:

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list
apt update
apt install intel-oneapi-runtime-openmp=2023.2.2-47 intel-oneapi-runtime-dpcpp-cpp=2023.2.2-47 intel-oneapi-runtime-mkl=2023.2.0-49495

Install Miniconda 3:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Create a new conda environment:

conda create -n textgen python=3.9
conda activate textgen

Install the WebUI:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements_nowheels.txt

Install PyTorch:

python -m pip install torch==2.0.1a0 torchvision==0.15.2a0 intel-extension-for-pytorch==2.0.110+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu-idp/us/

Activate PyTorch:

source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh

Test it is working:

python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

Install llama-cpp-python:

sudo apt-get install --reinstall pkg-config cmake-data
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

Start the server:

python server.py

Download your model and enjoy :)

My tests were done on a dedicated install of Ubuntu running on Win 11 WSL2 on a Samsung Galaxy Book2 (Intel i7 processor with integrated graphics and 16GB of RAM): llama2 13B (32 layers to GPU) loads fast and runs above 2 tokens per seconds which is acceptable for personal use.
NB: don't forget to increase the maximum amount of RAM WSL2 can use.

Hope it helps,
Cyrille

@oobabooga
Copy link
Owner

oobabooga commented Dec 7, 2023

@cyrillebeauchamp thanks for the guide, that's extremely valuable. I think that with your commands it may be possible to automatically compile llama-cpp-python for Intel GPUs using GitHub Actions. Then we could add these to dedicated requirements_intel.txt and requirements_intel_noavx2.txt files.

If you are interested, you may be able to create a new .yml in this repository (where all the wheels here are compiled) and PR it to @jllllll:

https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/tree/main/.github/workflows

@NineMeowICT
Copy link

NineMeowICT commented Dec 9, 2023

@cyrillebeauchamp Thanks for your detailed instruction! But I still have trouble loading the LLM.
2023-12-09_10-49

In the beginning, I thought that the problem may be caused by the Intel iGPU which can also be used as a Level-Zero XPU device. So I export a environment variable "ONEAPI_DEVICE_SELECTOR=level_zero:0". But the error still happened.
2023-12-09_10-55

By the way, the GPU IMC usage and the RAM usage increased quickly after I click the "Load" button.
2023-12-09_11-02

Any ideas? Thank you!

@itlackey
Copy link

itlackey commented Dec 11, 2023

Using llama.cpp for Intel Arc support will result in most operations running on the CPU currently.

PyTorch extensions work well on Intel Arc. FastChat uses it and it's significantly faster than running llama.cpp.

There is work being done to support better drivers in llama.cpp but as of right now the OpenCL implementation runs primarily on the CPU. This is why we see 2 t/s using llama.cpp and around 20 t/s using FastChat on Arc GPUs.

@Jacoby1218
Copy link

https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.10%2Bxpu new version of IPEX, native windows support has been added

@Jacoby1218
Copy link

  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py", line 209, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 88, in load_model
    output = load_func_map[loader](model_name)
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 238, in huggingface_loader
    model = LoaderClass.from_pretrained(path_to_model, **params)
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 3870, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 743, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 317, in set_module_tensor_to_device
    new_value = value.to(device)
  File "D:\oobabooga_windows\oobabooga_windows\text-generation-webui\installer_files\env\lib\site-packages\torch\cuda\__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled```

@ghost
Copy link

ghost commented Dec 30, 2023

I'm not sure if this is the right place to post this. I receive the below error after installing OobaBooga using the default Arc install option on Windows. The install seemed to go well but running it results in the below DLL load error. Other threads that mentioned this loading error suggested it might be a PATH issue. I tried adding a few paths to the OS environment but couldn't resolve it. Any suggestions?

It's an Arc A770 on Windows 10.

Traceback (most recent call last) ─────────────────────────────────────────┐
│ C:\text-generation-webui\server.py:6 in │
│ │
│ 5 │
│ > 6 import accelerate # This early import makes Intel GPUs happy │
│ 7 │
│ │
│ C:\text-generation-webui\installer_files\env\Lib\site-packages\accelerate_init_.py:3 in │
│ │
│ 2 │
│ > 3 from .accelerator import Accelerator │
│ 4 from .big_modeling import ( │
│ │
│ C:\text-generation-webui\installer_files\env\Lib\site-packages\accelerate\accelerator.py:32 in │
│ │
│ 31 │
│ > 32 import torch │
│ 33 import torch.utils.hooks as hooks │
│ │
│ C:\text-generation-webui\installer_files\env\Lib\site-packages\torch_init_.py:139 in │
│ │
│ 138 err.strerror += f' Error loading "{dll}" or one of its dependencies.' │
│ > 139 raise err │
│ 140 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
OSError: [WinError 126] The specified module could not be found. Error loading
"C:\text-generation-webui\installer_files\env\Lib\site-packages\torch\lib\backend_with_compiler.dll" or one of its
dependencies.
Press any key to continue . . .

@oobabooga
Copy link
Owner

I think that the new loader below (gpt-fast) should work on Intel Arc, but I have no way of testing it. It has 4-bit and 8-bit support.

#5180

Any tests would be welcome.

@ksullivan86
Copy link

I really don't know much as I am super new to AI, I am looking into getting more into AI mainly for Home Assistant and hopefully someday an assistant similar to Iron Man's Jarvis(Jarvis isn't possible today right?). Anyways I have a home server that I just built(14700k) and with the main use case as plex, I stumbled upon LocalAI last week (didn't even realize AI could be home hosted) I got excited about the above possibility and hosting AI would also help justify the cost of my server upgrade and running it 24/7.
I didn't put a GPU in my build because I wanted to keep idle power as low as possible and I just found out that intel ARC can idle at less than 1 watt on new intel systems so I am really interested in if intel ARC works also if it does what would need to be changed in the docker compose file? Also be helpful to have a CPU only docker compose file as well as I had to really do some searching to find out how install via docker with only CPU.
Is there anyway to use the integrated GPU in the 14700k?
Is an Intel ARC worth buying for AI right now? I was able to use my CPU with home assistant but it was super slow(probably over an minute to respond) and while just inside the webgui the response was almost instant, I guess its doing a lot more when it comes from home assistant...do you think and intel ARC would solve the problem, I really want to stay with ARC for the low idle power, I know battlemage is probably around 11 months away if it even gets released with the recent rumors so I dont think waiting for the v2 is a great idea considering its possible it never gets released(hopefully not and the fully compete with nvidia)

@Leo512bit
Copy link

It looks like llama.cpp now supports SYCL for Intel GPUs. Is Arc support now possible?

ggerganov/llama.cpp#2690

@ElliottDyson
Copy link

It looks like llama.cpp now supports SYCL for Intel GPUs. Is Arc support now possible?

ggerganov/llama.cpp#2690

This is brilliant news! Could I get a ping when it's been implemented into this repo please?

@oobabooga
Copy link
Owner

I am aware of SYCL and wanted to ask about this. There is also the Vulkan option now that may work on Intel Arc.

Can someone try these and see which one works with GPU offloading and is the fastest?

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python  # Option 1: SYCL
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python  # Option 2: Vulkan
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python  # Option 3: Kompute 

@Leo512bit
Copy link

I am aware of SYCL and wanted to ask about this. There is also the Vulkan option now that may work on Intel Arc.

Can someone try these and see which one works with GPU offloading and is the fastest?

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python  # Option 1: SYCL
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python  # Option 2: Vulkan
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python  # Option 3: Kompute 

Stupid question: So I assume that since there is no CMake in WGU's codebase, I assume you are talking about building llama.cpp, so once I build it, do I "plug in" the compiled stuff into TGWUI or do I run a model on llama.cpp somehow? (My only understanding of llama.cpp is that it is some kind of backend.) Which ever of those things I've got to do how do I do it?

@ElliottDyson
Copy link

ElliottDyson commented Feb 8, 2024

I am aware of SYCL and wanted to ask about this. There is also the Vulkan option now that may work on Intel Arc.

Can someone try these and see which one works with GPU offloading and is the fastest?

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python  # Option 1: SYCL
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python  # Option 2: Vulkan
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python  # Option 3: Kompute 

Unfortunately I have no idea how to build this myself.

However, if someone wants to send me one that I can download I am more than happy to test.

All I know is that the current installation procedure for Intel ARC doesn't utilise my Intel Arc A770 16GB, even when it downloads all the correct packages and says it's offloaded the layers to GPU (which I can tell it hasn't because I have no VRAM being used up).

Edit for Clarification: Doesn't work unless using Transformers and no quantisation.

@ElliottDyson
Copy link

There's also the following library which should allow us to use quantised versions of the models that are for hugging face's transformers library on intel GPUs, as currently this may be approach to getting quantised models running on these GPUs other than through Llama CPP:

https://github.com/huggingface/optimum-intel

@ElliottDyson
Copy link

ElliottDyson commented Feb 15, 2024

I am aware of SYCL and wanted to ask about this. There is also the Vulkan option now that may work on Intel Arc.
Can someone try these and see which one works with GPU offloading and is the fastest?

pip uninstall -y llama_cpp_python llama_cpp_python_cuda llama_cpp_python_cuda_tensorcores

CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python  # Option 1: SYCL
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python  # Option 2: Vulkan
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python  # Option 3: Kompute 

Unfortunately I have no idea how to build this myself.

However, if someone wants to send me one that I can download I am more than happy to test.

All I know is that the current installation procedure for Intel ARC doesn't utilise my Intel Arc A770 16GB, even when it downloads all the correct packages and says it's offloaded the layers to GPU (which I can tell it hasn't because I have no VRAM being used up).

Edit for Clarification: Doesn't work unless using Transformers and no quantisation.

I hope this isn't disruptive to your work, but any updates on some wheels for us to test with? I've tried building them personally but to no success @oobabooga

@NineMeowICT
Copy link

@oobabooga Thanks!
I have tried all the backends you mentioned above. And this is my conclusion:
Using Vulkan is almost as efficient as using SYCL at the present stage. LLaMA-cpp-python using Kompute can not be built due to the following reason.
kompute_build_error

@NineMeowICT
Copy link

NineMeowICT commented Mar 10, 2024

Here is my solution using llama-cpp-python:

Test environment: Linux Mint 21.3 Cinnamon with Linux kernel 6.5.0-25

Python version: 3.10.* or 3.11.*

Before all steps, install the Intel Driver(Thanks @cyrillebeauchamp):

# download the key to system keyring
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

# add signed entry to apt sources and configure the APT client to use Intel repository:
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list

#update the repositories and install the base kit:
sudo apt update
sudo apt install -y \
  intel-opencl-icd intel-level-zero-gpu level-zero \
  intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
  libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
  libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
  mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

Method 1: Vulkan

Step 1: Install Vulkan SDK

wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt update -y
sudo apt-get install -y vulkan-sdk
# To verify the installation, use the command below:
vulkaninfo

Step 2: Install necessary python packages

pip install -r requirements_nowheels.txt

Step 3: Build and install the wheel

CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python

Step 4: Launch and enjoy it!

python server.py

Method 2: SYCL

Step 1: Install Intel® oneAPI Base Toolkit
Please select a way that suits you and follow the instructions: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

Step 2: Install necessary python packages

pip install -r requirements_nowheels.txt

Step 3: Install pytorch and IPEX

pip install torch==2.1.0a0 torchvision==0.16.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

REMINDER: If you are using oneAPI 2024.1.0 or newer, please run the following command instead:

python -m pip install torch==2.1.0.post0 torchvision==0.16.0.post0 torchaudio==2.1.0.post0 intel-extension-for-pytorch==2.1.20+xpu oneccl_bind_pt==2.1.200+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Step 4: Build and install the wheel

source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON" pip install llama-cpp-python

Step 4: Launch and enjoy it!

python server.py

@NineMeowICT
Copy link

NineMeowICT commented Mar 10, 2024

GPU: Intel ARC A770 16GB
Model: LLaMA2-13B-Tiefighter.Q8_0.gguf

Test Result:
SYCL
2024-02-24_17-38

output.mp4

Vulkan
vulkan_log
vulkan

@ksullivan86
Copy link

@cyrillebeauchamp Thanks for your detailed instruction! But I still have trouble loading the LLM. 2023-12-09_10-49

In the beginning, I thought that the problem may be caused by the Intel iGPU which can also be used as a Level-Zero XPU device. So I export a environment variable "ONEAPI_DEVICE_SELECTOR=level_zero:0". But the error still happened. 2023-12-09_10-55

By the way, the GPU IMC usage and the RAM usage increased quickly after I click the "Load" button. 2023-12-09_11-02

Any ideas? Thank you!

idk what gpu you are using but do you know what your arc card is drawing at idle? I have been trying to see if its even possible to get intel acr aspm down to under 1 watt of power like in windows, Im using unraid and had to use a custom kernel so idk if that is any reason why I cant get the low idle to work. My 770 is constantly drawing 40w, and I only went with ARC because of the extremely low idle. Have you had any luck?

@NineMeowICT
Copy link

@ksullivan86 I use A770 too and its idle power is about 40w too. I tried to use ASPM to lower it but it didn't work on both Windows and Linux. I suspect that the power management when some of ARC graphic cards are idle is not implemented at present. There may be differences in cards between different AIC manufacturers

@opticblu
Copy link

opticblu commented Mar 16, 2024

FYI this works in WSL2 Windows 11 A770 16g, make sure to disable iGPU (I just did it in bios instead of device manager to be sure)

Did it with syctl

Truncated quote from @NineMeowICT above, for brevity

Here is my solution using llama-cpp-python:
....

Step 4: Launch and enjoy it!

python server.py

With iGPU enabled it didn't work, works great with A770 only

Thanks for putting this together @NineMeowICT and others

@sambartik
Copy link

Thanks for the write-up @NineMeowICT, nice one!

Since then something has probably changed and doing the SYCL method did not entirely work. I received an error that is described in this issue: pytorch/pytorch#123097

Installing these versions of packages instead solved the issue:

python -m pip install torch==2.1.0.post0 torchvision==0.16.0.post0 torchaudio==2.1.0.post0 intel-extension-for-pytorch==2.1.20+xpu oneccl_bind_pt==2.1.200+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

@NineMeowICT
Copy link

@sambartik Yes, you're right. I also encountered this so I downgrade oneMKL. I will edited my post to keep up.

Thanks!

Copy link

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

@LovelyA72
Copy link

LovelyA72 commented Jun 24, 2024

sudo apt-get install --reinstall pkg-config cmake-data
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

If this method no longer works after updating to intel toolchain 202x.x(your version installed), you may need to manually install intel-oneapi-compiler-dpcpp-cpp-202x.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests