Engineering/AI

vLLM 설치 및 실행 w/macOS

망고v 2025. 12. 29. 17:01

Ollama는 어느 정도 익숙해졌는데, Kubernetes나 Production 환경으로 가려면 vLLM이 권장되는 것 같다. 이번에는 vLLM을 설치 및 실행해보면서 친해지자.

 

설치하기

Python 기반으로 설치하는 것과 Docker 방식이 제공되는데, 일단 Python으로 설치 및 점검해보고 Docker로도 진행해보자.

 

https://docs.vllm.ai/en/v0.12.0/getting_started/installation/cpu/#apple-silicon

 

사전 준비

Python3과 uv를 설치한다. 

mango@mac llm % uv venv --python 3.12 --seed
Using CPython 3.12.10 interpreter at: /usr/local/bin/python3.12
Creating virtual environment with seed packages at: .venv
 + pip==25.3
Activate with: source .venv/bin/activate

mango@mac llm % source .venv/bin/activate
(llm) mango@mac llm %

 

GitHub Repository를 clone하고, 해당 디렉토리로 이동하여 아래의 명령어를 수행한다.

mango@mac llm % git clone https://github.com/vllm-project/vllm.git
Cloning into 'vllm'...
remote: Enumerating objects: 153771, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 153771 (delta 9), reused 3 (delta 3), pack-reused 153754 (from 3)
Receiving objects: 100% (153771/153771), 130.00 MiB | 16.53 MiB/s, done.
Resolving deltas: 100% (121124/121124), done.

(llm) mango@mac llm % cd vllm

(llm) mango@mac vllm % uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
Using Python 3.12.10 environment at: /Users/${USER}/workspace/architect/llm/.venv
Resolved 135 packages in 459ms
Prepared 3 packages in 611ms
Installed 135 packages in 690ms
..생략..

(llm) mango@mac vllm % uv pip install -e .
Using Python 3.12.10 environment at: /Users/${USER}/workspace/architect/llm/.venv
Resolved 136 packages in 9.43s
      Built vllm @ file:///Users/${USER}/workspace/architect/llm/vllm
Prepared 1 package in 26.58s
Installed 1 package in 2ms
 + vllm==0.14.0rc1.dev156+g17347daaa (from file:///Users/${USER}/workspace/architect/llm/vllm)

 

모델 실행하기

Qwen2.5-Coder-7B 모델을 한 번 실행시켜보자. 나의 경우 model을 로딩하다가 멈췄었는데, 다시 수행하니 정상적으로 startup 메시지를 볼 수 있었다.

(llm) mango@mac llm % vllm serve Qwen/Qwen2.5-Coder-7B-Instruct
INFO 12-29 16:13:58 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(APIServer pid=74446) INFO 12-29 16:14:20 [api_server.py:1274] vLLM API server version 0.14.0rc1.dev156+g17347daaa
(APIServer pid=74446) INFO 12-29 16:14:20 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen2.5-Coder-7B-Instruct', 'model': 'Qwen/Qwen2.5-Coder-7B-Instruct'}
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 663/663 [00:00<00:00, 1.07MB/s]
(APIServer pid=74446) INFO 12-29 16:14:26 [model.py:517] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=74446) INFO 12-29 16:14:26 [model.py:1688] Using max model len 32768
(APIServer pid=74446) WARNING 12-29 16:14:26 [cpu.py:157] VLLM_CPU_KVCACHE_SPACE not set. Using 12.00 GiB for KV cache.
(APIServer pid=74446) INFO 12-29 16:14:26 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 7.30kB [00:00, 8.86MB/s]
vocab.json: 2.78MB [00:00, 9.14MB/s]
merges.txt: 1.67MB [00:00, 7.37MB/s]
tokenizer.json: 7.03MB [00:00, 38.2MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242/242 [00:00<00:00, 647kB/s]
INFO 12-29 16:14:33 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=74578) INFO 12-29 16:14:35 [core.py:95] Initializing a V1 LLM engine (v0.14.0rc1.dev156+g17347daaa) with config: model='Qwen/Qwen2.5-Coder-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-Coder-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False), seed=0, served_model_name=Qwen/Qwen2.5-Coder-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=74578) INFO 12-29 16:14:38 [cpu_worker.py:86] Warning: NUMA is not enabled in this build. `init_cpu_threads_env` has no effect to setup thread affinity.
(EngineCore_DP0 pid=74578) INFO 12-29 16:14:38 [parallel_state.py:1210] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.0.0.2:55091 backend=gloo
(EngineCore_DP0 pid=74578) INFO 12-29 16:14:38 [parallel_state.py:1418] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=74578) INFO 12-29 16:14:38 [cpu_model_runner.py:55] Starting to load model Qwen/Qwen2.5-Coder-7B-Instruct...
model.safetensors.index.json: 27.8kB [00:00, 30.6MB/s]
model-00002-of-00004.safetensors:  15%|████████████████████▏                                                                                                                  | 739M/4.93G [01:04<06:59, 9.99MB/s]
model-00001-of-00004.safetensors:   0%|▎                                                                                                                                   | 10.9M/4.88G [01:04<11:40:50, 116kB/s]
model-00004-of-00004.safetensors:   3%|████▋                                                                                                                                  | 37.6M/1.09G [00:50<21:07, 830kB/s]
model-00003-of-00004.safetensors:   0%|                                                                                                                                               | 0.00/4.33G [00:00<?, ?B/s]
(EngineCore_DP0 pid=76150) INFO 12-29 16:31:25 [weight_utils.py:510] Time spent downloading weights for Qwen/Qwen2.5-Coder-7B-Instruct: 143.532489 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:05,  2.69s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:10<00:04,  4.18s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  5.67s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  4.68s/it]
(EngineCore_DP0 pid=76150)
(EngineCore_DP0 pid=76150) INFO 12-29 16:31:44 [default_loader.py:308] Loading weights took 18.74 seconds
(EngineCore_DP0 pid=76150) INFO 12-29 16:31:44 [kv_cache_utils.py:1305] GPU KV cache size: 224,640 tokens
(EngineCore_DP0 pid=76150) INFO 12-29 16:31:44 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request: 6.86x
(EngineCore_DP0 pid=76150) INFO 12-29 16:31:46 [cpu_model_runner.py:65] Warming up model for the compilation...
(EngineCore_DP0 pid=76150) INFO 12-29 16:32:14 [cpu_model_runner.py:75] Warming up done.
(EngineCore_DP0 pid=76150) INFO 12-29 16:32:14 [core.py:272] init engine (profile, create kv cache, warmup model) took 30.12 seconds
..생략..
(APIServer pid=76136) INFO:     Started server process [76136]
(APIServer pid=76136) INFO:     Waiting for application startup.
(APIServer pid=76136) INFO:     Application startup complete.

 

테스트

vLLM은 기본적으로 8000 포트로 기동되며, 기동 로그에 있는 Route 정보를 참고하여 아래와 같이 Swagger UI(/docs)를 호출했다. 모델이 정상적으로 로딩되었는지도 확인 가능하다.

 

자원사용량

상세 설정을 안해서인지 '메모리'의 경우, 자원을 상당히 많이 점유하는 것을 볼 수 있었다. 상세 설정 없이 쓰기에는 Ollama Native App이 가장 무난한 것 같다.

CPU

메모리

 

기타

vLLM에서 제공되는 Dockerfile(https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.cpu)은 현재 uv 설치 관련 약간의 문제가 있는 것 같다. Container 환경에서 어떻게 동작하는지 보고 싶었는데, 뭐.. 별 차이는 없을 듯하다. 다음은 바로 Kubernetes 환경에 직접 설치해보자.

'Engineering > AI' 카테고리의 다른 글

Open WebUI 설치 w/AKS  (0) 2026.01.06
vLLM 설치 w/AKS  (0) 2026.01.05
Docker LLM 설치하기 w/Ollama  (0) 2025.12.29
로컬(macOS) LLM 설치하기 w/Ollama  (0) 2025.12.24