引言:声明式运维的魅力

传统部署:编写 10+ 个 YAML 文件,手动管理 StatefulSet、Service、ConfigMap…

Operator 部署:一个 CRD 搞定所有配置,自动化管理生命周期。

# 传统方式:10+ 个 YAML 文件
kubectl apply -f namespace.yaml
kubectl apply -f configmap.yaml
kubectl apply -f nameserver-statefulset.yaml
kubectl apply -f nameserver-service.yaml
kubectl apply -f broker-statefulset.yaml
...

# Operator 方式:1 个 CRD
kubectl apply -f rocketmq-cluster.yaml  # Done!

本文目标

  • 理解 Operator 工作原理
  • 安装 RocketMQ Operator
  • 使用 CRD 部署集群
  • 实现自动化运维

一、Operator 基础

1.1 什么是 Operator?

Operator = CRD + Controller

┌──────────────────────────────────────────┐
│              Operator                    │
│                                          │
│  ┌────────────┐      ┌───────────────┐  │
│  │    CRD     │      │  Controller   │  │
│  │  (定义)     │─────>│  (执行逻辑)    │  │
│  └────────────┘      └───────┬───────┘  │
│                              │          │
└──────────────────────────────┼──────────┘
                               │
                    ┌──────────▼──────────┐
                    │   K8s API Server    │
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
  ┌─────▼─────┐          ┌─────▼─────┐         ┌─────▼─────┐
  │StatefulSet│          │ Service   │         │ ConfigMap │
  └───────────┘          └───────────┘         └───────────┘

Operator 功能

  • 自动创建资源(StatefulSet、Service)
  • 监控集群状态
  • 自动扩缩容
  • 自动故障恢复
  • 滚动升级

二、安装 Operator

2.1 安装 CRD

# 安装 RocketMQ Operator CRD
kubectl apply -f https://raw.githubusercontent.com/apache/rocketmq-operator/master/deploy/crds/rocketmq.apache.org_brokers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/apache/rocketmq-operator/master/deploy/crds/rocketmq.apache.org_consoles_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/apache/rocketmq-operator/master/deploy/crds/rocketmq.apache.org_nameservices_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/apache/rocketmq-operator/master/deploy/crds/rocketmq.apache.org_topictransfers_crd.yaml

# 验证 CRD 安装
kubectl get crd | grep rocketmq

2.2 部署 Operator

# operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rocketmq-operator
  namespace: rocketmq
spec:
  replicas: 1
  selector:
    matchLabels:
      name: rocketmq-operator
  template:
    metadata:
      labels:
        name: rocketmq-operator
    spec:
      serviceAccountName: rocketmq-operator
      containers:
      - name: rocketmq-operator
        image: apache/rocketmq-operator:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: WATCH_NAMESPACE
          value: ""
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: "rocketmq-operator"
kubectl apply -f operator.yaml

# 查看 Operator 状态
kubectl get pods -n rocketmq -l name=rocketmq-operator

三、使用 CRD 部署集群

3.1 完整集群配置

# rocketmq-cluster.yaml
apiVersion: rocketmq.apache.org/v1alpha1
kind: Broker
metadata:
  name: rocketmq-cluster
  namespace: rocketmq
spec:
  # 集群大小
  size: 3

  # NameServer 配置
  nameServers: "rocketmq-nameserver-0.rocketmq-nameserver.rocketmq.svc.cluster.local:9876"

  # Broker 配置
  brokerImage: apache/rocketmq:5.0.0
  imagePullPolicy: IfNotPresent

  # 资源配置
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"

  # 存储配置
  storageMode: StorageClass
  hostPath: ""
  scalePodName: broker-0
  volumeClaimTemplates:
    - metadata:
        name: broker-storage
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: rocketmq-storage
        resources:
          requests:
            storage: 100Gi

  # Broker 配置
  properties:
    brokerClusterName: DefaultCluster
    brokerRole: ASYNC_MASTER
    flushDiskType: ASYNC_FLUSH
    deleteWhen: "04"
    fileReservedTime: "48"

  # 环境变量
  env:
    - name: JAVA_OPT_EXT
      value: "-Xms2g -Xmx2g -Xmn1g"
kubectl apply -f rocketmq-cluster.yaml

# 查看 Broker 状态
kubectl get broker -n rocketmq

# 查看自动创建的资源
kubectl get statefulset,svc,pvc -n rocketmq

3.2 NameServer 配置

# nameserver.yaml
apiVersion: rocketmq.apache.org/v1alpha1
kind: NameService
metadata:
  name: rocketmq-nameserver
  namespace: rocketmq
spec:
  # NameServer 数量
  size: 2

  # 镜像配置
  nameServiceImage: apache/rocketmq:5.0.0
  imagePullPolicy: IfNotPresent

  # 资源配置
  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"
    limits:
      memory: "1Gi"
      cpu: "500m"

  # 主机网络(可选)
  hostNetwork: false

  # DNS 策略
  dnsPolicy: ClusterFirst
kubectl apply -f nameserver.yaml

# 查看 NameServer
kubectl get nameservice -n rocketmq
kubectl get pods -n rocketmq -l app=rocketmq-nameserver

四、运维操作

4.1 扩缩容

# 方式1:直接修改 CRD
kubectl edit broker rocketmq-cluster -n rocketmq
# 修改 spec.size: 3 → 5

# 方式2:使用 kubectl patch
kubectl patch broker rocketmq-cluster -n rocketmq \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/size", "value": 5}]'

# 查看扩容进度
kubectl get pods -n rocketmq -l app=rocketmq-broker -w

4.2 升级版本

# 修改 CRD,更新镜像版本
spec:
  brokerImage: apache/rocketmq:5.1.0  # 旧版本: 5.0.0
kubectl apply -f rocketmq-cluster.yaml

# Operator 自动执行滚动升级
# 查看升级进度
kubectl rollout status statefulset rocketmq-broker -n rocketmq

4.3 配置变更

# 修改 Broker 配置
spec:
  properties:
    brokerClusterName: DefaultCluster
    brokerRole: SYNC_MASTER  # 改为同步复制
    flushDiskType: SYNC_FLUSH  # 改为同步刷盘
kubectl apply -f rocketmq-cluster.yaml

# Operator 自动重启 Pod 使配置生效

五、监控与告警

5.1 Operator 指标

# Operator 暴露 Prometheus 指标
curl http://rocketmq-operator:8383/metrics

# 关键指标:
# - rocketmq_broker_reconcile_total
# - rocketmq_broker_reconcile_errors_total
# - rocketmq_nameserver_reconcile_total

5.2 集成 Prometheus

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rocketmq-operator
  namespace: rocketmq
spec:
  selector:
    matchLabels:
      name: rocketmq-operator
  endpoints:
  - port: metrics
    interval: 30s

六、高级功能

6.1 主从自动切换

# 配置 Dledger 模式
spec:
  replicaPerGroup: 3
  replicationMode: DLEDGER

Operator 自动配置 Dledger 参数,实现自动主从切换。


6.2 多集群管理

# 集群1
apiVersion: rocketmq.apache.org/v1alpha1
kind: Broker
metadata:
  name: cluster-prod
  namespace: rocketmq-prod
spec:
  size: 5
  ...

# 集群2
apiVersion: rocketmq.apache.org/v1alpha1
kind: Broker
metadata:
  name: cluster-test
  namespace: rocketmq-test
spec:
  size: 2
  ...

一个 Operator 管理多个集群。


七、最佳实践

7.1 版本管理

# 指定 Operator 版本
image: apache/rocketmq-operator:0.3.0  # 明确版本号

# 避免使用 latest
image: apache/rocketmq-operator:latest  # 不推荐

7.2 资源隔离

# 不同业务使用不同 Namespace
kubectl create namespace rocketmq-order
kubectl create namespace rocketmq-payment
kubectl create namespace rocketmq-inventory

7.3 备份恢复

# 备份 CRD 配置
kubectl get broker rocketmq-cluster -n rocketmq -o yaml > backup.yaml

# 恢复
kubectl apply -f backup.yaml

八、故障排查

8.1 Operator 日志

# 查看 Operator 日志
kubectl logs -f rocketmq-operator-xxx -n rocketmq

# 查看 Controller 事件
kubectl get events -n rocketmq --sort-by='.lastTimestamp'

8.2 CRD 状态

# 查看 Broker CRD 状态
kubectl describe broker rocketmq-cluster -n rocketmq

# 输出示例:
# Status:
#   Condition:
#     Status: True
#     Type:   Ready
#   Size:     3

九、总结

Operator vs 手动部署

对比项手动部署Operator
配置复杂度高(10+ YAML)低(1个 CRD)
扩缩容手动修改自动执行
升级手动滚动自动滚动
故障恢复手动介入自动恢复
学习成本

核心优势

1. 声明式配置(What,而非 How)
2. 自动化运维(减少人工干预)
3. 最佳实践内置(避免配置错误)
4. 简化管理(一个 CRD 管理所有资源)

下一篇预告:《Serverless 消息队列 - 按需使用的新模式》,探索无服务器架构下的 RocketMQ。

本文关键词Operator CRD 声明式运维 自动化管理