web-dev-qa-db-ja.com

GKEログでのKubernetes OOMKilledイベントの検出

ポッドを調べると、次のようなOOMKilledイベントのインストルメンテーションを設定したいと思います。

Name:   pnovotnak-manhole-123456789-82l2h
Namespace:  test
Node:   test-cluster-cja8smaK-oQSR/10.x.x.x
Start Time: Fri, 03 Feb 2017 14:34:57 -0800
Labels:   pod-template-hash=123456789
    run=pnovotnak-manhole
Status:   Running
IP:   10.x.x.x
Controllers:  ReplicaSet/pnovotnak-manhole-123456789
Containers:
  pnovotnak-manhole:
    Container ID: docker://...
    Image:    pnovotnak/it
    Image ID:   docker://sha256:...
    Port:
    Limits:
      cpu:  2
      memory: 3Gi
    Requests:
      cpu:    200m
      memory:   256Mi
    State:    Running
      Started:    Fri, 03 Feb 2017 14:41:12 -0800
    Last State:   Terminated
      Reason:   OOMKilled
      Exit Code:  137
      Started:    Fri, 03 Feb 2017 14:35:08 -0800
      Finished:   Fri, 03 Feb 2017 14:41:11 -0800
    Ready:    True
    Restart Count:  1
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tder (ro)
    Environment Variables:  <none>
Conditions:
  Type    Status
  Initialized   True
  Ready   True
  PodScheduled  True
Volumes:
  default-token-46euo:
    Type: Secret (a volume populated by a Secret)
    SecretName: default-token-tder
QoS Class:  Burstable
Tolerations:  <none>
Events:
  FirstSeen LastSeen  Count From                SubObjectPath       Type    Reason    Message
  --------- --------  ----- ----                -------------       --------  ------    -------
  11m   11m   1 {default-scheduler }                      Normal    Scheduled Successfully assigned pnovotnak-manhole-123456789-82l2h to test-cluster-cja8smaK-oQSR
  10m   10m   1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Created   Created container with docker id xxxxxxxxxxxx; Security:[seccomp=unconfined]
  10m   10m   1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Started   Started container with docker id xxxxxxxxxxxx
  11m   4m    2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Pulling   pulling image "pnovotnak/it"
  10m   4m    2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Pulled    Successfully pulled image "pnovotnak/it"
  4m    4m    1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Created   Created container with docker id yyyyyyyyyyyy; Security:[seccomp=unconfined]
  4m    4m    1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Started   Started container with docker id yyyyyyyyyyyy

ポッドログから取得できるのは、

{
 textPayload: "shutting down, got signal: Terminated
"
 insertId: "aaaaaaaaaaaaaaaa"
 resource: {
  type: "container"
  labels: {
   pod_id: "pnovotnak-manhole-123456789-82l2h"
   ...
  }
 }
 timestamp: "2017-02-03T22:34:48Z"
 severity: "ERROR"
 labels: {
  container.googleapis.com/container_name: "POD"
  ...
 }
 logName: "projects/myproj/logs/POD"
}

そして、kubletログ。

{
 insertId: "bbbbbbbbbbbbbb"   
 jsonPayload: {
  _BOOT_ID: "ffffffffffffffffffffffffffffffff"    
  MESSAGE: "I0203 22:41:11.925928    1843 kubelet.go:1816] SyncLoop (PLEG): "pnovotnak-manhole-123456789-82l2h_test(a-uuid)", event: &pleg.PodLifecycleEvent{ID:"another-uuid", Type:"ContainerDied", Data:"..."}"
 ...

これは、これをOOMイベントとして一意に識別するのに十分とは思えません。他のアイデアは?

9
pnovotnak

OOMKilledイベントはログに表示されませんが、ポッドが強制終了されたことを検出できる場合は、 kubectl get pod -o go-template=... <pod-id> 理由を判別します。 the docs の例:

[13:59:01] $ ./cluster/kubectl.sh  get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}'  simmemleak-60xbc
Container Name: simmemleak
LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]]

プログラムでこれを実行している場合、kubectl出力に依存するより良い代替策は、Kubernetes REST API GET /api/v1/pods メソッド。 APIにアクセスするためのメソッドも ドキュメントに記載されています です。

5
Adam