[AI] LLaMA INT8 Inference guide

뮤리찌 2023. 3. 8. 18:02

2023. 3. 8. 18:02

DOWNLOAD THE CONVERTED WEIGHTS

Some generous anon converted all the weights. Grab them here: https://rentry.org/LLaMA-8GB-Edition and https://rentry.org/llama-tard-v2

Huggingface implementation is available now!

gh pr checkout 21955 inside the transformers directory. You'll need to clone it first: git clone https://github.com/huggingface/transformers

llamanon here.
This guide is supposed to be understandable to the average /aicg/ user (possibly retarded). This is for Linux obviously - I don't know how to run bitsandbytes on Windows, and I don't have a Windows machine to test it on.

If you're on Windows, I recommend using Oobabooga. It now supports LLaMA with 8bit.

Why don't I recommend using oobabooga? It's terrible at memory management and according to my tests, you'll use less VRAM with meta's own inference code as opposed to ooba's.

여기 라마논.
이 가이드는 평균 /aicg/ 사용자(지체 가능성이 있음)가 이해할 수 있도록 되어 있습니다.이것은 Linux용입니다.Windows에서 비트나 바이트를 실행하는 방법을 모르기 때문에 테스트할 수 있는 Windows 머신이 없습니다.

Windows 를 사용하고 있는 경우는, Oobobooga 를 사용하는 것을 추천합니다.8비트로 LLaMA를 지원하게 되었습니다.

OOBabooga를 추천하는 것은 어떨까요?메모리 관리 능력이 형편없고, 제 테스트에 따르면, oba가 아닌 메타의 자체 추론 코드로 VRAM을 덜 사용하게 될 것입니다.

Download LLaMA weights

magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce
Get the .torrent

Please download and seed all the model weights if you can. If you want to run a single model, don't forget to download the tokenizer.model file too.

Set up Conda and create an environment for LLaMA

I hate conda too, but it's the official method recommended by meta for some reason, and I don't want to deviate.
저도 콘다는 싫지만 메타가 추천하는 공식 방법이라서 벗어나고 싶지 않아요.

Set up Conda

Open a terminal and run:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

2. Run

chmod +x Miniconda3-latest-Linux-x86_64.sh

3. Run

./Miniconda3-latest-Linux-x86_64.sh

4. Go with the default options. When it shows you the license, hit q to continue the installation.
기본 옵션을 사용합니다.라이선스가 표시되면q설치를 계속합니다.

5. Refresh your shell by logging out and logging in back again.
로그아웃한 후 다시 로그인하여 셸을 새로 고칩니다
I think closing the terminal works too, but I don't remember. Try both.
터미널을 닫으면 되는 것 같은데 기억이 안 나네요.둘 다 먹어봐.

Create env and install dependencies (환경 및 설치 의존 관계 생성)

Create an env: 환경을 만듭니다.
conda create -n llama
Activate the env: env를 활성화합니다.
conda activate llama
Install the dependencies:의존 관계를 인스톨 합니다.
NVIDIA:
conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
AMD:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2
Clone the INT8 repo by tloen: INT8 repo를 tloen으로 복제합니다.
git clone https://github.com/tloen/llama-int8 && cd llama-int8
Install the requirements: 요건을 인스톨 합니다.
pip install -r requirements.txt pip install -e .

Create a swapfile 스왑 파일 생성

Loading the weights for 13B and higher models needs considerable amount of DRAM. IIRC it takes about 50GB for 13B, and over a 100GB for 30B. You'll need a swapfile to take care of excess memory usage. This is only used for the loading process; inference is unaffected (as long as you meet the VRAM requirements).
13B 이상 모델의 무게를 로드하려면 상당한 양의 DRAM이 필요합니다.IIRC는 13B의 경우 약 50GB, 30B의 경우 100GB 이상이 소요됩니다.과도한 메모리 사용량을 처리하려면 스왑 파일이 필요합니다.이것은 로드 프로세스에만 사용됩니다.추론은 영향을 받지 않습니다(VRAM 요건을 충족하는 한).

Create a swapfile: 스왑 파일 생성:
sudo dd if=/dev/zero of=/swapfile bs=4M count=13000 status=progressThis will create about ~50GB swapfile. Edit the count to your preference. 13000 means 4MBx13000.
그러면 약 50GB의 스왑 파일이 생성됩니다.를 편집하다count당신의 취향에 따라.13000은 4MBx13000을 의미합니다.
Mark it as swap: 스왑으로 표시
sudo mkswap /swapfile
Activate it: 활성화
sudo swapon /swapfile

If you want to delete it, simply run sudo swapoff /swapfile and then rm /swapfile.
삭제할 경우 실행하기만 하면 됩니다.sudo swapoff /swapfile그리고 나서.rm /swapfile.

Run the models 모델 실행

I'll assume your LLaMA models are in 당신의 LLama 모델은 현재
~/Downloads/LLaMA.

Open a terminal in your llama-int8 folder (the one you cloned). 터미널에서 폴더(복제된 폴더)룰 엽니다.
Run:
python example.py --ckpt_dir ~/Downloads/LLaMA/7B --tokenizer_path ~/Downloads/LLaMA/tokenizer.model --max_batch_size=1
You're done. Wait for the model to finish loading and it'll generate a prompt.
모든 작업이 완료됩니다.모델의 로드가 완료될 때까지 기다리면 프롬프트가 생성됩니다.

Add custom prompts 커스텀 프롬프트 추가

By default, the llama-int8 repo has a short prompt baked in to example.py.
디폴트로는 lama-int8 repo에는 다음과 같은 짧은 프롬프트가 포함되어 있습니다.

Open the "example.py" file in the "llama-int8" directory.
Navigate to line 136. It starts with triple quotations, """.
136행으로 이동합니다.처음에는 세 개의 인용구로 시작하지만
Replace the current prompt with whatever you have in mind.
현재 프롬프트를 원하는 프롬프트로 바꿉니다.

I'm getting shitty results! 결과가 안 좋아!

The inference code sucks for LLaMA. It only supports Temperature and Top_K. We'll have to wait until HF implements support for it (already in the works) so that it can properly show its true potential.
추론 코드가 LLama에겐 최악이야Temperature와 Top_K만 지원합니다.HF가 그 진정한 잠재력을 제대로 발휘할 수 있도록 지원(이미 진행 중)을 실시할 때까지 기다려야 합니다.

https://rentry.org/llama-tard

LLaMA INT8 Inference guide

DOWNLOAD THE CONVERTED WEIGHTS Some generous anon converted all the weights. Grab them here: https://rentry.org/LLaMA-8GB-Edition and https://rentry.org/llama-tard-v2 Huggingface implementation is available now! You can now convert the weights to a HF form

rentry.co

3. CUDA 메모리 부족 오류

그example.py파일은 다음 설정에 따라 캐시를 사전 준비합니다.

model_args: ModelArgs = ModelArgs(max_seq_len=max_seq_len, max_batch_size=max_batch_size, **params)

모델 무게(7B 모델)에 대해 14GB의 메모리를 차지하므로 2 * 2 * n_layer * max_batch_size * max_seq_len * n_heads * head_dim 바이트를 저장하는 디코딩 캐시에 16GB를 사용할 수 있습니다.

기본 파라미터의 경우 이 캐시는 7B 모델의 경우 약 17GB(2 * 2 * 32 * 32 * 1024 * 32 * 128)였습니다.

명령줄 옵션이 추가되어 있습니다.example.py디폴트를 변경했습니다.max_seq_len30GB GPU에서 디코딩이 가능한 최대 512입니다.

사용의 하드웨어에 따라서, 이러한 설정을 내려 주세요.

https://github.com/tloen/llama-int8/blob/main/FAQ.md#3

Eddy Lab