eksctl で既存の VPC にプライベートなワーカーノードが作れない

こんな感じでずっと待ち状態になってしまう原因が知りたかった。

eksctl create nodegroup \
--region ap-northeast-1 \
--cluster ekstest \
--name ng0 \
--node-type t3.medium \
--nodes 1 --nodes-min 1 --nodes-max 1 --node-ami auto --node-volume-size 10 \
--node-private-networking \
--node-security-groups sg-0d021a3fc762eed89 \
--node-labels "usage=client" \
--ssh-access \
--ssh-public-key .ssh/mymachine.pub
[ℹ]  using region ap-northeast-1
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[ℹ]  nodegroup "ng0" will use "ami-055d09694b6e5591a" [AmazonLinux2/1.14]
[ℹ]  using SSH public key ".ssh/mymachine.pem.pub" as "eksctl-ekstest-nodegroup-ng0-86:8d:7f:00:97:c2:c8:19:af:94:61:03:72:c8:31:51"
[ℹ]  1 nodegroup (ng0) was included
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "ekstest"
[ℹ]  1 task: { create nodegroup "ng0" }
[ℹ]  building nodegroup stack "eksctl-ekstest-nodegroup-ng0"
[ℹ]  deploying stack "eksctl-ekstest-nodegroup-ng0"
[ℹ]  adding role "arn:aws:iam::1234567890:role/eksctl-ekstest-nodegroup-ng0-NodeInstanceRole-1F01MBBM84FH5" to auth ConfigMap
[ℹ]  nodegroup "ng0" has 0 node(s)
[ℹ]  waiting for at least 1 node(s) to become ready in "ng0"

待ち状態になってから作成したノードに ssh でログインしていろいろ確認したら理由がわかった。

まず、kubelet.service がクラッシュループを繰り返していることを確認。どうやら aws クラウドプロバイダーの初期化に失敗しているようだ。

journalctl -u kubelet.service
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Starting Kubernetes Kubelet...
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Started Kubernetes Kubelet.
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: Flag --max-pods has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://ku
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: Flag --allow-privileged has been deprecated, will be removed in a future version
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: Flag --max-pods has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://ku
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: Flag --allow-privileged has been deprecated, will be removed in a future version
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: I0924 00:41:30.498199    3619 server.go:418] Version: v1.14.6-eks-5047ed
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: W0924 00:41:30.499460    3619 plugins.go:118] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: I0924 00:41:30.501778    3619 aws.go:1137] Zone not specified in configuration file; querying AWS metadata service
Sep 24 00:41:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: I0924 00:41:30.507431    3619 aws.go:1171] Building AWS cloudprovider
Sep 24 00:43:30 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3619]: F0924 00:43:30.852164    3619 server.go:266] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-01c729a3a7fe
Sep 24 00:43:30 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Sep 24 00:43:30 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Unit kubelet.service entered failed state.
Sep 24 00:43:30 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service failed.
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Starting Kubernetes Kubelet...
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Started Kubernetes Kubelet.
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: Flag --max-pods has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://ku
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: Flag --allow-privileged has been deprecated, will be removed in a future version
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: Flag --max-pods has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://ku
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: Flag --allow-privileged has been deprecated, will be removed in a future version
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: I0924 00:43:36.082885    3836 server.go:418] Version: v1.14.6-eks-5047ed
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: W0924 00:43:36.083072    3836 plugins.go:118] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: I0924 00:43:36.083155    3836 aws.go:1137] Zone not specified in configuration file; querying AWS metadata service
Sep 24 00:43:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: I0924 00:43:36.084224    3836 aws.go:1171] Building AWS cloudprovider
Sep 24 00:45:36 ip-192-168-10-221.ap-northeast-1.compute.internal kubelet[3836]: F0924 00:45:36.401506    3836 server.go:266] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-01c729a3a7fe
Sep 24 00:45:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Sep 24 00:45:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Unit kubelet.service entered failed state.
Sep 24 00:45:36 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service failed.
Sep 24 00:45:41 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.

そして cloud-init.service が失敗していることも確認。 yum リポジトリのアクセスに失敗している。原因は NAT ゲートウェイも NAT インスタンスも作ってなかったから。

なるほど :thinking_face: すぐに課金始まるから作ってなかったんだよね･･･

systemctl list-units | grep cloud-config
● cloud-config.service                                                      loaded failed failed    Apply the settings specified in cloud-config
journalctl -u cloud-config.service
Sep 24 00:40:52 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Starting Apply the settings specified in cloud-config...
Sep 24 00:40:52 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Cloud-init v. 18.2-72.amzn2.0.7 running 'modules:config' at Tue, 24 Sep 2019 00:40:52 +0000. Up 12.19 seconds.
Sep 24 00:40:52 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Loaded plugins: priorities, update-motd
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: One of the configured repositories failed (Unknown),
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: and yum doesn't have enough cached data to continue. At this point the only
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: safe thing yum can do is fail. There are a few ways to work "fix" this:
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 1. Contact the upstream for the repository and get them to fix the problem.
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: upstream. This is most often useful if you are using a newer
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: distribution release than is supported by the repository (and the
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: packages for the previous distribution release still work).
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 3. Run the command with the repository temporarily disabled
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: yum --disablerepo=<repoid> ...
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 4. Disable the repository permanently, so yum won't use it by default. Yum
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: will then just ignore the repository until you permanently enable it
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: again or use --enablerepo for temporary usage:
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: yum-config-manager --disable <repoid>
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: or
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: subscription-manager repos --disable=<repoid>
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 5. Configure the failing repository to be skipped, if it is unavailable.
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Note that yum will try to contact the repo. when it runs most commands,
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: so will have to try and fail each time (and thus. yum will be be much
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: slower). If it is a very temporary problem though, this is often a nice
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: compromise:
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: cloud-config.service: main process exited, code=exited, status=1/FAILURE
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Could not retrieve mirrorlist http://amazonlinux.ap-northeast-1.amazonaws.com/2/core/latest/x86_64/mirror.list error was
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: 12: Timeout on http://amazonlinux.ap-northeast-1.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Connection timed out after 5000 mill
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Sep 24 00:41:29 cloud-init[3052]: util.py[WARNING]: Package upgrade failed
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Sep 24 00:41:29 cloud-init[3052]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal cloud-init[3052]: Sep 24 00:41:29 cloud-init[3052]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_upd
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Failed to start Apply the settings specified in cloud-config.
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: Unit cloud-config.service entered failed state.
Sep 24 00:41:29 ip-192-168-10-221.ap-northeast-1.compute.internal systemd[1]: cloud-config.service failed.