Sunday, June 5, 2022

[lunar.lab] Cannot Resolve ".local" Domain from TKGm Workload Cluster

Problem Statement

  • Kubernetes Pod Status ImagePullBackOff 
  • Describe pod show error message:

dial tcp: lookup harbor-01a.corp.local: Temporary failure in name resolution

  • Container image pulled from local container registry with ".local" domain suffix

Cause

By default, lookups for domains with the ".local" suffix are not routed to DNS servers. This is expected behavior from the systemd-resolved service.

Reference

Workaround

As explained in the above VMware KB83623, there's currently only workaround available, and no resolution for this issue for TKGm. I did follow steps explained in the official documentation to resolve ".local" domain referred above. 
  • I created vsphere-overlay-dns-control-plane.yaml and vsphere-overlay-dns-workers.yaml.
  • Store those two files in directory ~/.config/tanzu/tkg/providers/infrastructure-vsphere/ytt/.
  • Deploy a new workload cluster and test.
Unfortunately the result is still the same. :'(

So this is what I do instead.
#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")

#@overlay/match by=overlay.subset({"kind":"KubeadmControlPlane"})
---
spec:
  kubeadmConfigSpec:
    preKubeadmCommands:
    #! disable dns from being emitted by dhcp client
    #@overlay/append
    - echo '[DHCPv4]' >> /etc/systemd/network/10-id0.network
    #@overlay/append
    - echo 'UseDNS=no' >> /etc/systemd/network/10-id0.network
    #@overlay/append
    - echo "192.168.110.101     harbor-01a.corp.local" >> /etc/hosts
    #@overlay/append
    - '/usr/bin/systemctl restart systemd-networkd 2>/dev/null'
  • Add lines highlighted below in vsphere-overlay-dhcp-workers.yaml.
#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")

#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"}),expects="1+"
---
spec:
  template:
    spec:
      #@overlay/match missing_ok=True
      preKubeadmCommands:
      #! disable dns from being emitted by dhcp client
      #@overlay/append
      - echo '[DHCPv4]' >> /etc/systemd/network/10-id0.network
      #@overlay/append
      - echo 'UseDNS=no' >> /etc/systemd/network/10-id0.network
      #@overlay/append
      - echo "192.168.110.101   harbor-01a.corp.local" >> /etc/hosts
      #@overlay/append
      - '/usr/bin/systemctl restart systemd-networkd 2>/dev/null'

  • Deploy a new workload cluster and test to create deployment with reference to local Harbor registry. Now it works!
  • I also experience another ImagePullBackOff issue but with error related to certificate, which caused by configure Harbor with Self-signed Certificate which I explained in this post: https://dy.si/spDBk.

Please note that my container registry FQDN is harbor-01a.corp.local with IP address 192.168.110.101. This step actually add specific record to /etc/hosts file on each workload cluster nodes deployed by TKG and NOT to resolve all ".local" domain. This is sufficient for my requirements, so I don't do more research to find out why the provided steps in the official documentation are not working. Another note is I need to redeploy workload cluster that need to resolve my local Harbor registry, the above steps do not affect workload clusters which already deployed.

No comments:

Post a Comment