Chaos Engineering 2025 - レジリエンスを科学する | 最新情報

Chaos Engineeringとは

Chaos Engineeringは、本番環境の障害耐性を科学的に検証するための実践手法です。Netflixが2010年代に提唱したこの手法は、2025年現在、クラウドネイティブ環境におけるSRE（Site Reliability Engineering）の必須スキルとして確立されています。

「システムが障害に耐えられるかどうかは、実際に障害を起こしてみなければわからない」という考え方に基づき、計画的に障害を注入し、システムの挙動を観察します。

2025年のChaos Engineering市場

指標	数値
市場規模（2025年）	$2.8B
年間成長率（CAGR）	18.5%
エンタープライズ採用率	67%
Kubernetes環境での採用率	82%

Chaos Engineeringの5原則

Netflix Chaos Engineering Teamが定義した原則は、2025年も変わらず有効です。

1. 定常状態の仮説を立てる

# 定常状態の定義例
steady_state_hypothesis:
  name: "APIレスポンスタイムが正常範囲内"
  probes:
    - type: http
      name: "api-response-time"
      provider:
        type: http
        url: "https://api.example.com/health"
        timeout: 3
      tolerance:
        - status_code: 200
        - response_time_ms: "<500"

    - type: prometheus
      name: "error-rate-below-threshold"
      provider:
        type: prometheus
        url: "http://prometheus:9090"
        query: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])"
      tolerance:
        - value: "<0.01"  # エラー率1%未満

2. 実世界のイベントを反映

# 現実的な障害シナリオ
real_world_events:
  - name: "データセンター障害"
    type: zone_outage
    probability: "低頻度・高影響"

  - name: "ネットワーク遅延"
    type: latency_injection
    probability: "高頻度・中影響"

  - name: "サービス依存関係の障害"
    type: dependency_failure
    probability: "中頻度・高影響"

  - name: "リソース枯渇"
    type: resource_exhaustion
    probability: "中頻度・中影響"

3. 本番環境で実験

4. 継続的に実行

5. 影響範囲を最小化

主要ツール比較

LitmusChaos（CNCF）

Kubernetes-nativeなオープンソースChaos Engineeringプラットフォーム。CNCFインキュベーティングプロジェクト。

# LitmusChaos ChaosExperiment定義
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-cpu-hog
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create", "delete", "get", "list", "patch", "update"]
    image: litmuschaos/go-runner:latest
    imagePullPolicy: Always
    args:
      - -c
      - ./experiments -name pod-cpu-hog
    command:
      - /bin/bash
    env:
      - name: TARGET_CONTAINER
        value: ""
      - name: CPU_CORES
        value: "1"
      - name: TOTAL_CHAOS_DURATION
        value: "60"
      - name: PODS_AFFECTED_PERC
        value: "100"
      - name: RAMP_TIME
        value: ""
    labels:
      name: pod-cpu-hog
      app.kubernetes.io/part-of: litmus
---
# ChaosEngine（実験実行）
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-cpu-hog
      spec:
        components:
          env:
            - name: CPU_CORES
              value: "2"
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: PODS_AFFECTED_PERC
              value: "50"

Gremlin

エンタープライズ向け商用Chaos Engineeringプラットフォーム。

# Gremlin Attack定義（YAML形式）
attack:
  type: cpu
  args:
    - "--length"
    - "300"
    - "--cores"
    - "2"
    - "--percent"
    - "80"
  target:
    type: container
    strategy:
      type: Random
      attrs:
        containerLabels:
          - app: api-server
        limit: 2
  scheduling:
    type: once
    startTime: "2025-01-15T14:00:00Z"

# Gremlin Scenario（複数攻撃の組み合わせ）
scenario:
  name: "マルチリージョン障害テスト"
  description: "東京リージョンのサービス障害をシミュレート"
  steps:
    - attack:
        type: blackhole
        args:
          - "--length"
          - "180"
          - "--hostnames"
          - "db-tokyo.example.com"
        target:
          type: container
          strategy:
            type: Exact
            attrs:
              containerLabels:
                - region: tokyo
        halt_conditions:
          - type: prometheus
            query: "rate(http_requests_total{status='500'}[1m]) > 0.1"
            description: "エラー率が10%を超えたら停止"

Chaos Monkey（Netflix OSS）

Netflixが開発した元祖Chaos Engineeringツール。

# Chaos Monkey設定
chaos_monkey:
  enabled: true

  # スケジュール設定
  schedule:
    calendar:
      - 09:00-17:00  # 営業時間内のみ
    weekdays:
      - Monday
      - Tuesday
      - Wednesday
      - Thursday
      - Friday

  # 対象設定
  targeting:
    app: "my-application"
    account: "production"
    region: "ap-northeast-1"

    # 除外設定
    exclusions:
      - app: "critical-payment-service"
      - cluster: "database-cluster"

  # 終了確率（1日あたり）
  probability: 0.1  # 10%の確率で1台終了

  # 通知設定
  notifications:
    slack:
      webhook: "https://hooks.slack.com/services/..."
      channel: "#chaos-alerts"

Chaos Mesh（PingCAP）

TiDB開発元によるKubernetes向けChaos Engineeringプラットフォーム。

# Chaos Mesh NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "web"
  delay:
    latency: "100ms"
    correlation: "25"
    jitter: "10ms"
  duration: "5m"
  scheduler:
    cron: "@every 2h"
---
# Chaos Mesh PodChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      "app": "api-server"
  duration: "30s"
---
# Chaos Mesh IOChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay
  namespace: chaos-testing
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "database"
  volumePath: /var/lib/postgresql/data
  path: "*"
  delay: "100ms"
  percent: 50
  duration: "3m"

ツール比較表

機能	LitmusChaos	Gremlin	Chaos Monkey	Chaos Mesh
ライセンス	Apache 2.0	商用	Apache 2.0	Apache 2.0
K8s Native	Yes	Yes	No	Yes
GUI	Yes	Yes	No	Yes
ワークフロー	Yes	Yes	No	Yes
オブザーバビリティ統合	強力	強力	基本	強力
エンタープライズサポート	有料	標準	なし	有料
CNCF	Incubating	-	-	Incubating

実験設計パターン

パターン1: サービス依存関係テスト

# Litmus Workflow: マイクロサービス依存関係テスト
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: microservice-dependency-test
  namespace: litmus
spec:
  entrypoint: dependency-chaos
  serviceAccountName: argo-chaos
  templates:
    - name: dependency-chaos
      steps:
        # Step 1: 定常状態の検証
        - - name: verify-steady-state
            template: steady-state-check

        # Step 2: 依存サービスへの障害注入
        - - name: inject-payment-failure
            template: service-failure
            arguments:
              parameters:
                - name: service
                  value: "payment-service"
                - name: duration
                  value: "120s"

        # Step 3: 回復性の検証
        - - name: verify-graceful-degradation
            template: degradation-check

        # Step 4: 復旧確認
        - - name: verify-recovery
            template: recovery-check

    - name: steady-state-check
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            response=$(curl -s -o /dev/null -w "%{http_code}" http://api-gateway/health)
            if [ "$response" != "200" ]; then
              echo "定常状態が確認できません"
              exit 1
            fi
            echo "定常状態を確認"

    - name: service-failure
      inputs:
        parameters:
          - name: service
          - name: duration
      container:
        image: litmuschaos/k8s:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            kubectl scale deployment {{inputs.parameters.service}} --replicas=0
            sleep {{inputs.parameters.duration}}
            kubectl scale deployment {{inputs.parameters.service}} --replicas=3

    - name: degradation-check
      container:
        image: curlimages/curl:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            # サーキットブレーカーが動作していることを確認
            response=$(curl -s http://api-gateway/checkout)
            if echo "$response" | grep -q "fallback"; then
              echo "グレースフルデグラデーション確認"
              exit 0
            fi
            echo "フォールバックが動作していません"
            exit 1

パターン2: リソース枯渇テスト

# Chaos Mesh: メモリ・CPU枯渇テスト
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: resource-exhaustion-test
  namespace: chaos-testing
spec:
  entry: resource-test
  templates:
    - name: resource-test
      templateType: Serial
      deadline: 30m
      children:
        - cpu-stress
        - memory-stress
        - disk-stress
        - recovery-verification

    - name: cpu-stress
      templateType: StressChaos
      deadline: 10m
      stressChaos:
        mode: all
        selector:
          namespaces:
            - production
          labelSelectors:
            "app": "compute-intensive"
        stressors:
          cpu:
            workers: 4
            load: 80
        duration: "5m"

    - name: memory-stress
      templateType: StressChaos
      deadline: 10m
      stressChaos:
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            "app": "api-server"
        stressors:
          memory:
            workers: 2
            size: "512MB"
        duration: "5m"

    - name: disk-stress
      templateType: IOChaos
      deadline: 10m
      ioChaos:
        action: fault
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            "app": "database"
        volumePath: /var/lib/data
        percent: 50
        duration: "3m"

    - name: recovery-verification
      templateType: Task
      deadline: 5m
      task:
        container:
          name: verify
          image: curlimages/curl:latest
          command:
            - /bin/sh
            - -c
            - |
              for i in $(seq 1 30); do
                if curl -sf http://api-gateway/health; then
                  echo "Recovery confirmed"
                  exit 0
                fi
                sleep 10
              done
              exit 1

パターン3: ネットワーク分断テスト

# Network Partition テスト
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      "zone": "zone-a"
  direction: both
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        "zone": "zone-b"
  duration: "5m"
---
# DNS障害テスト
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-failure
  namespace: chaos-testing
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      "app": "web"
  patterns:
    - "external-api.example.com"
    - "*.third-party.io"
  duration: "3m"

GameDay実施ガイド

GameDayとは

GameDayは、チーム全体で計画的にChaos実験を実施するイベントです。障害対応スキルの向上、システム弱点の発見、運用プロセスの検証を目的とします。

GameDay計画テンプレート

# GameDay計画書
gameday:
  metadata:
    name: "2025 Q1 GameDay"
    date: "2025-02-15"
    duration: "4時間"
    participants:
      - "SREチーム"
      - "プラットフォームチーム"
      - "アプリケーション開発チーム"
      - "セキュリティチーム"

  objectives:
    - "マルチAZ障害時の可用性検証"
    - "データベースフェイルオーバーの動作確認"
    - "オンコール対応プロセスの検証"
    - "復旧時間目標（RTO）の測定"

  scenarios:
    - name: "シナリオ1: AZ障害"
      type: zone_failure
      target: "ap-northeast-1a"
      expected_behavior:
        - "トラフィックが他AZに自動フェイルオーバー"
        - "RTO: 5分以内"
        - "データ損失なし"
      success_criteria:
        - "99.9%のリクエストが成功"
        - "レイテンシ増加は20%以内"

    - name: "シナリオ2: データベース障害"
      type: database_failure
      target: "primary-database"
      expected_behavior:
        - "リードレプリカへの自動昇格"
        - "アプリケーションの自動再接続"
      success_criteria:
        - "フェイルオーバー完了: 30秒以内"
        - "トランザクション損失なし"

    - name: "シナリオ3: 外部API障害"
      type: dependency_failure
      target: "payment-gateway"
      expected_behavior:
        - "サーキットブレーカーの発動"
        - "フォールバック処理の実行"
      success_criteria:
        - "ユーザー影響の最小化"
        - "適切なエラーメッセージ表示"

  rollback_plan:
    automatic:
      - condition: "エラー率 > 5%"
        action: "即座に実験停止"
      - condition: "レイテンシ > 10秒"
        action: "即座に実験停止"
    manual:
      - "GameDay責任者による停止判断"
      - "緊急連絡先への通知"

  communication:
    channels:
      primary: "#gameday-war-room"
      escalation: "#incident-response"
    stakeholder_notifications:
      - timing: "開始1時間前"
        message: "GameDay開始予定の通知"
      - timing: "開始時"
        message: "GameDay開始の通知"
      - timing: "終了時"
        message: "GameDay完了レポート"

GameDay実行チェックリスト

# GameDay実行チェックリスト
checklist:
  pre_gameday:
    - "[ ] 全参加者へのカレンダー招待送付"
    - "[ ] 実験対象システムのバックアップ確認"
    - "[ ] モニタリングダッシュボードの準備"
    - "[ ] ロールバック手順の確認"
    - "[ ] ステークホルダーへの事前通知"
    - "[ ] オンコール体制の確認"
    - "[ ] 通信チャンネルのテスト"

  during_gameday:
    - "[ ] 定常状態のベースライン記録"
    - "[ ] 各シナリオの実行と記録"
    - "[ ] インシデント対応の観察"
    - "[ ] メトリクスの継続監視"
    - "[ ] 予期せぬ問題の即座な対応"

  post_gameday:
    - "[ ] レトロスペクティブの実施"
    - "[ ] 発見事項のドキュメント化"
    - "[ ] アクションアイテムの作成"
    - "[ ] 改善チケットの起票"
    - "[ ] 次回GameDayの計画"

Kubernetes環境での実践

Kubernetes Chaos実験の体系

# Kubernetes環境向けChaos実験カタログ
chaos_experiments:
  pod_level:
    - name: pod-delete
      description: "Podの強制削除"
      validates: "ReplicaSet/Deploymentの自動復旧"

    - name: pod-cpu-hog
      description: "Pod内のCPU負荷"
      validates: "リソース制限とHPA動作"

    - name: pod-memory-hog
      description: "Pod内のメモリ負荷"
      validates: "OOMKillerとPod再起動"

    - name: container-kill
      description: "コンテナプロセスの終了"
      validates: "livenessProbeと再起動"

  node_level:
    - name: node-drain
      description: "ノードのドレイン"
      validates: "Pod再スケジューリング"

    - name: node-taint
      description: "ノードへのtaint追加"
      validates: "Podのevictionと移動"

    - name: kubelet-kill
      description: "kubeletプロセスの停止"
      validates: "ノード障害検知と対応"

  network_level:
    - name: pod-network-loss
      description: "Podへのネットワーク遮断"
      validates: "サービスディスカバリとフェイルオーバー"

    - name: pod-network-latency
      description: "ネットワーク遅延注入"
      validates: "タイムアウト設定とリトライロジック"

    - name: pod-network-corruption
      description: "パケット破損"
      validates: "データ整合性チェック"

Litmus Workflow実践例

# 本番環境向けChaosワークフロー
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: production-resilience-test
  namespace: litmus
spec:
  schedules:
    - name: weekly-chaos
      cron: "0 10 * * 3"  # 毎週水曜10時
      timezone: "Asia/Tokyo"

  workflow:
    spec:
      entrypoint: resilience-test
      templates:
        - name: resilience-test
          steps:
            # Phase 1: Pre-chaos健全性チェック
            - - name: pre-chaos-check
                template: health-check

            # Phase 2: 段階的Chaos注入
            - - name: phase-1-light
                template: light-chaos

            - - name: phase-1-verify
                template: verify-recovery

            - - name: phase-2-medium
                template: medium-chaos
                when: "{{steps.phase-1-verify.outputs.result}} == 'pass'"

            - - name: phase-2-verify
                template: verify-recovery

            - - name: phase-3-heavy
                template: heavy-chaos
                when: "{{steps.phase-2-verify.outputs.result}} == 'pass'"

            # Phase 3: 最終検証
            - - name: final-verification
                template: comprehensive-check

        - name: health-check
          container:
            image: litmuschaos/litmus-checker:latest
            command: ["./check"]
            args:
              - "--endpoint=http://api-gateway/health"
              - "--expected-status=200"
              - "--timeout=30s"

        - name: light-chaos
          suspend: {}  # Chaos実験をここに定義

        - name: medium-chaos
          suspend: {}

        - name: heavy-chaos
          suspend: {}

        - name: verify-recovery
          outputs:
            parameters:
              - name: result
                valueFrom:
                  path: /tmp/result.txt
          container:
            image: litmuschaos/litmus-checker:latest
            command: ["/bin/sh", "-c"]
            args:
              - |
                sleep 60
                if curl -sf http://api-gateway/health; then
                  echo "pass" > /tmp/result.txt
                else
                  echo "fail" > /tmp/result.txt
                fi

        - name: comprehensive-check
          container:
            image: litmuschaos/litmus-checker:latest
            command: ["./comprehensive-check"]
            args:
              - "--services=api,web,worker"
              - "--timeout=300s"

Chaos実験のオブザーバビリティ統合

# Prometheus + Grafanaダッシュボード設定
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-dashboard
  namespace: monitoring
data:
  chaos-experiment-dashboard.json: |
    {
      "title": "Chaos Experiment Dashboard",
      "panels": [
        {
          "title": "実験ステータス",
          "type": "stat",
          "targets": [
            {
              "expr": "litmuschaos_experiment_status"
            }
          ]
        },
        {
          "title": "エラー率（実験中）",
          "type": "graph",
          "targets": [
            {
              "expr": "rate(http_requests_total{status=~'5..'}[1m]) / rate(http_requests_total[1m])"
            }
          ]
        },
        {
          "title": "レイテンシP99（実験中）",
          "type": "graph",
          "targets": [
            {
              "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))"
            }
          ]
        },
        {
          "title": "Pod再起動回数",
          "type": "graph",
          "targets": [
            {
              "expr": "increase(kube_pod_container_status_restarts_total[5m])"
            }
          ]
        }
      ]
    }
---
# アラートルール
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-experiment-alerts
  namespace: monitoring
spec:
  groups:
    - name: chaos-safety
      rules:
        - alert: ChaosExperimentCriticalImpact
          expr: |
            litmuschaos_experiment_status == 1
            and
            rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m]) > 0.05
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Chaos実験が重大な影響を与えています"
            description: "エラー率が5%を超えました。実験を即座に停止してください。"

2025年の最新動向

AI駆動のChaos Engineering

# AI支援Chaos実験（2025年トレンド）
ai_chaos_features:
  intelligent_targeting:
    description: "MLモデルによる脆弱ポイント自動検出"
    capabilities:
      - "依存関係グラフ分析による重要サービス特定"
      - "過去のインシデントパターンからの学習"
      - "リスクスコアに基づく実験優先順位付け"

  adaptive_experiments:
    description: "リアルタイムフィードバックに基づく実験調整"
    capabilities:
      - "システム応答に基づく負荷自動調整"
      - "異常検知時の即座なロールバック"
      - "最適な実験パラメータの自動探索"

  predictive_analysis:
    description: "障害発生予測と予防的テスト"
    capabilities:
      - "パフォーマンス劣化の早期警告"
      - "キャパシティ不足の予測"
      - "依存関係リスクの可視化"

プラットフォームエンジニアリングとの統合

# Internal Developer Platform統合
platform_integration:
  backstage:
    plugin: "backstage-plugin-chaos"
    features:
      - "サービスカタログからの直接実験起動"
      - "実験履歴の可視化"
      - "チーム別レジリエンススコア"

  golden_paths:
    chaos_templates:
      - name: "新規サービス導入時のレジリエンステスト"
        mandatory: true
        experiments:
          - pod-failure
          - network-latency
          - dependency-failure

      - name: "本番デプロイ前の回帰テスト"
        mandatory: true
        experiments:
          - resource-stress
          - pod-delete

継続的Chaos（Continuous Chaos）

# CI/CDパイプライン統合
continuous_chaos:
  github_actions:
    name: "Chaos Regression Test"
    on:
      push:
        branches: [main]
    jobs:
      chaos-test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4

          - name: Deploy to Staging
            run: kubectl apply -f k8s/

          - name: Run Chaos Experiments
            uses: litmuschaos/github-chaos-actions@v1
            with:
              experiment: "pod-delete"
              target-namespace: "staging"
              chaos-duration: "30s"

          - name: Verify Recovery
            run: |
              sleep 60
              curl -sf http://staging-api/health || exit 1

          - name: Generate Report
            run: litmus report --format=markdown > chaos-report.md

ベストプラクティス

# Chaos Engineering成熟度モデル
maturity_model:
  level_1_initial:
    practices:
      - "手動での障害テスト"
      - "本番外環境でのみ実施"
    next_steps:
      - "ツールの導入"
      - "基本的な実験の自動化"

  level_2_managed:
    practices:
      - "定期的なGameDay実施"
      - "主要サービスの実験カバレッジ"
      - "オブザーバビリティとの基本統合"
    next_steps:
      - "本番環境での段階的実験"
      - "CI/CDパイプライン統合"

  level_3_defined:
    practices:
      - "本番環境での継続的Chaos"
      - "自動化されたセーフティネット"
      - "全サービスの実験カバレッジ"
    next_steps:
      - "AI支援の実験設計"
      - "プラットフォーム統合"

  level_4_optimizing:
    practices:
      - "AI駆動の脆弱性検出"
      - "予測的Chaos Engineering"
      - "組織全体のレジリエンス文化"

まとめ

2025年のChaos Engineeringは、単なる障害テストから、システムレジリエンスを継続的に向上させる科学的アプローチへと進化しています。

LitmusChaos、Chaos Mesh（CNCFプロジェクト）の成熟、AI支援機能の登場、プラットフォームエンジニアリングとの統合により、エンタープライズでの採用が加速しています。

Kubernetes環境では特に、Pod・Node・Networkレベルの多層的な実験と、オブザーバビリティとの深い統合が重要です。GameDayの定期実施とCI/CDパイプラインへの組み込みにより、レジリエンスを継続的に検証・改善することが、クラウドネイティブ時代のベストプラクティスとなっています。

参考: Principles of Chaos Engineering、LitmusChaos Documentation、Chaos Mesh Documentation

この技術を体系的に学びたいですか？

未来学では東証プライム上場企業のITエンジニアが24時間サポート。月額24,800円から、退会金0円のオンラインIT塾です。

LINEで無料相談する

← 一覧に戻る