Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KVCache (Vineyard)] Do we need PodAffinity for model name in kvcache crd needs proper reconciliation? #876

Open
gangmuk opened this issue Mar 17, 2025 · 2 comments

Comments

@gangmuk
Copy link
Collaborator

gangmuk commented Mar 17, 2025

🐛 Describe the bug

Originally, kvcache pod and vllm-engine pods were scheduled at the same node by the PodAffinity in kvcache crd.

Image

However, after restarting llama-3-8b-instruct, kv cache pod and engine pod are scheduled in different pods. The new engine pod is scheduled in the new node but kv cache pod stays in the same pod.

Image

Not sure which way is the best to make kv cache pod and engine pod colocated.

Steps to Reproduce

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-8b-instruct
  labels:
    model.aibrix.ai/name: llama-3-8b-instruct
    model.aibrix.ai/port: "8000"
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 100%
    type: RollingUpdate
  selector:
    matchLabels:
      model.aibrix.ai/name: llama-3-8b-instruct
  template:
    metadata:
      labels:
        model.aibrix.ai/name: llama-3-8b-instruct
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: machine.cluster.vke.volcengine.com/gpu-name
                operator: In
                values:
                - NVIDIA-L20
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - /models/llama-3.1-8b-instruct/
            - --served-model-name
            - llama-3-8b-instruct
            - --trust-remote-code
            - --enable-chunked-prefill
            - "false"
            - --max-model-len
            - "100000"
            - --dtype
            - bfloat16
            - --disable-log-requests
            - --swap-space
            - "0"
            # - --enable-prefix-caching
          env:
            - name: VLLM_USE_VINEYARD_CACHE
              value: "0"
            - name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB
              value: "72"
            - name: AIBRIX_LLM_KV_CACHE
              value: "0"
            - name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS
              value: "aibrix"
            - name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE
              value: "16"
            - name: AIBRIX_LLM_KV_CACHE_SOCKET
              value: /var/run/vineyard.sock
            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-kvcache-llama-3-8b-instruct-rpc:9600"
            - name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE
              value: "1"
            - name: "VINEYARD_CACHE_METRICS_ENABLED"
              value: "1"
            - name: FLAGS_metrics
              value: "1"
            - name: GLOG_logtostderr
              value: "1"
          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.1-edb07092-20250118
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - |
                  while true; do
                    RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
                    WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
                    if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                      echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                      exit 0
                    else
                      echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                      sleep 5
                    fi
                  done
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          name: vllm-openai
          ports:
          - containerPort: 8000
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 1
              nvidia.com/gpu: "1"
            requests:
              cpu: 1
              nvidia.com/gpu: "1"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /models
            name: model-hostpath
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /var/run
            name: kvcache-socket            
        - command:
          - aibrix_runtime
          - --port
          - "8080"
          env:
          - name: INFERENCE_ENGINE
            value: vllm
          - name: INFERENCE_ENGINE_ENDPOINT
            value: http://localhost:8000
          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 2
            successThreshold: 1
            timeoutSeconds: 1
          name: aibrix-runtime
          ports:
          - containerPort: 8080
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /ready
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - aibrix_download
        - --model-uri
        - tos://aibrix-artifact-testing/models/llama-3.1-8b-instruct/
        - --local-dir
        - /models/
        env:
        - name: DOWNLOADER_MODEL_NAME
          value: llama-3.1-8b-instruct
        - name: DOWNLOADER_NUM_THREADS
          value: "16"
        - name: DOWNLOADER_ALLOW_FILE_SUFFIX
          value: json, safetensors
        - name: TOS_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_ACCESS_KEY
              name: tos-credential
        - name: TOS_SECRET_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_SECRET_KEY
              name: tos-credential
        - name: TOS_ENDPOINT
          value: tos-cn-beijing.ivolces.com
        - name: TOS_REGION
          value: cn-beijing
        image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.0
        imagePullPolicy: IfNotPresent
        name: init-model
        resources: {}
        volumeMounts:
        - mountPath: /models
          name: model-hostpath
      terminationGracePeriodSeconds: 60
      volumes:
        - name: model-hostpath
          hostPath:
            path: /root/models
            type: DirectoryOrCreate
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "4Gi"
        - name: kvcache-socket
          hostPath:
            path: /var/run/vineyard-kubernetes/default/aibrix-kvcache-llama-3-8b-instruct

---
apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: llama-3-8b-instruct
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: llama-3-8b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: llama-3-8b-instruct
  type: ClusterIP

---

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
  name: aibrix-kvcache-llama-3-8b-instruct
  namespace: default
  annotations:
    kvcache.orchestration.aibrix.ai/pod-affinity-workload: llama-3-8b-instruct
    kvcache.orchestration.aibrix.ai/pod-anti-affinity: "true"
    kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  cacheSpec:
    image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vineyardd:20241120
    imagePullPolicy: IfNotPresent
    cpu: "4000m"
    memory: 72Gi

Expected behavior

kv cache pod and engine are in the same node even after restarting engine pod.

Environment

main branch

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 17, 2025

@gangmuk Please help check your service naming issues. I encountered similar issue #696.

@gangmuk
Copy link
Collaborator Author

gangmuk commented Mar 17, 2025

@Jeffwan I think it is a different issue. did you mean this issue #874 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants