ETCD 主机恢复
背景:在没有master 备份的情况下, 集群中有一个master 节点被直接重装系统;该节点非 ETCD master 节点。所以集群还是处于可用状态。但是 master 由之前的三节点变为 2 节点;
openshift 版本:v170
- ETCD 备份
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18#etcd master 节点上操作
yum install -y etcd
systemctl disable etcd.service
systemctl mask etcd.service
export ETCDCTL_API=3
mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
oc get nodes -o wide|grep master |awk '{print $6":2379"}'|xargs|tr ' ' ','
ETCD_ENDPOINTS=$(oc get nodes -o wide|grep master |awk '{print $6":2379"}'|xargs|tr ' ' ',')
etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints=$ETCD_ENDPOINTS snapshot save /var/lib/etcd/snapshot.db
#要先删掉无效 node,不然redeploy 证书的时候,会卡在 remove console 的步骤
#oc delete nodes [UNKONWN_NODE]
oc get nodes|grep master|grep NotReady|awk '{print $1}'|xargs -i oc delete nodes {}
cnsz92vl12816.chenzhijun.cn - 重新部署 ETCD-CA
1 | export TOKEN=eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI6ImFkbWluIn0.5DWDErsUzcBYK-KD_j5tjemwPIrLMU3Xle5lDaoj-3HkYBeMQ2WTvF7wvkIj4Kint_XABxT7MgInCp9Z-gklyw |
- 重新部署 ETCD 证书
1 | #复制 etcd 证书更新 playbook |
- 增加 master 主机
1 | #重新部署一遍 master 证书,不然后面的Wait for /apis/metrics.k8s.io/v1beta1 when registered 会出现异常,最好重新部署一遍证书 |
- 恢复 master pod 组件
1 | ###恢复 master |
1 |
|
检查 hosts 文件,确保之前的 new_masters 配置已删除
确认 ETCD 中失败节点删除
1 |
|

- 恢复 ETCD
1 | #增加 new_etcd 分组 . 只要有 new_etcd 这个组,不要 new_masters, new_nodes |
- 验证
1 | etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints="100.75.46.76:2379,100.75.46.77:2379,100.75.46.78:2379" --write-out=table endpoint status |
附录:
- bootstrap 常用参数命令
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18#TOKEN 导入
export TOKEN=eyJhbGciOiJIUzUxMiJ9.eyJ1c2VybmFtZSI6ImFkbWluIn0.5DWDErsUzcBYK-KD_j5tjemwPIrLMU3Xle5lDaoj-3HkYBeMQ2WTvF7wvkIj4Kint_XABxT7MgInCp9Z-gklyw
#部署集群
curl -k -i -H "Content-Type: application/json" -H "Authorization: $TOKEN" --data '{}' https://localhost:5001/api/v1/playbooks/deploy_cluster.yml -X POST
#卸载集群
curl -k -i -H "Content-Type: application/json" -H "Authorization: $TOKEN" --data '{}' https://localhost:5001/api/v1/playbooks/uninstall.yml -X POST
##删除命令
curl -k -i -H "Authorization: $TOKEN" https://localhost:5001/api/v1/groups/new_etcd -X DELETE
curl -k -i -H "Authorization: $TOKEN" https://localhost:5001/api/v1/groups/new_masters -X DELETE
curl -k -i -H "Authorization: $TOKEN" https://localhost:5001/api/v1/groups/new_nodes -X DELETE
#实时日志
curl -k -i -H "Authorization: $TOKEN" https://localhost:5001/api/v1/jobs/$UUID/stdout -X GET - 卡在某个 ansible 脚本
1
2
3fatal: [cnsz92vl10442.chenzhijun.cn -> cnsz92vl10441.chenzhijun.cn]: FAILED! => {"changed": true, "cmd": ["oc", "adm", "create-api-client-config", "--certificate-authority=/etc/origin/master/ca.crt", "--client-dir=/tmp/openshift-ansible-O3mFaX", "--groups=system:masters,system:openshift-master", "--master=https://cnsz92vl10441:8443", "--public-master=https://cnsz92vl10441:8443", "--signer-cert=/etc/origin/master/ca.crt", "--signer-key=/etc/origin/master/ca.key", "--signer-serial=/etc/origin/master/ca.serial.txt", "--user=system:openshift-master", "--basename=openshift-master", "--expire-days=730"], "delta": "0:00:00.193549", "end": "2021-01-11 15:59:54.383491", "msg": "non-zero return code", "rc": 1, "start": "2021-01-11 15:59:54.189942", "stderr": "error: --signer-serial, \"/etc/origin/master/ca.serial.txt\" must be a valid file", "stderr_lines": ["error: --signer-serial, \"/etc/origin/master/ca.serial.txt\" must be a valid file"], "stdout": "", "stdout_lines": []}
看下是否 node 没有从集群中删除 - 错误
1
2
3
4
5
6
7
8
9
10
11
12TASK [openshift_ca : Install the base package for admin tooling] ***************
FAILED - RETRYING: Install the base package for admin tooling (3 retries left).
FAILED - RETRYING: Install the base package for admin tooling (2 retries left).
FAILED - RETRYING: Install the base package for admin tooling (1 retries left).
fatal: [cnsz92vl10442.chenzhijun.cn -> cnsz92vl10441.chenzhijun.cn]: FAILED! => {"attempts": 3, "changed": false, "msg": "No package matching 'atomic-openshift-3.11.170' found available, installed or updated", "rc": 126, "results": ["No package matching 'atomic-openshift-3.11.170' found available, installed or updated"]}
ansible masters -i hosts -m shell -a "yum clean all"
#有可能少包,那就需要手动装
scp atomic-openshift-3.11.170-1.git.0.00cac56.el7.x86_64.rpm cnsz92vl10441:/tmp/
rpm -Uvh atomic-openshift-3.11.170-1.git.0.00cac56.el7.x86_64.rpm1
2
3
4
5
6
7
8
9
10ansible-playbook -i ./inventory ./project/openshift-master/redeploy-openshift-ca.yml
ansible-playbook -i ./inventory ./project/redeploy-certificates.yml
ansible-playbook -i ./inventory ./project/openshift-master/redeploy-certificates.yml
osm_etcd_image=harbor.uat.chenzhijun.top/rhel7/etcd:3.2.22
openshift_pkg_version=-3.11.170
openshift_is_atomic=true