LetsEncrypt Troubleshooting

About

This post described some troubleshooting scenarios for TLS certificate generation by LetsEncrypt.

It's specific to environment based on:

  • Kubernetes cluster
  • Nginx Ingress Controller
  • Cert-Manager

Such system can be achieved for example by using this Terraform module (specific to Azure cloud): https://gitlab.com/softmill/cloud/azure/aks

Cert-Manager comes with great Troubleshooting guides:

So here I describe cases not covered by those guides.

Kubernetes resources

Cert-Manager creates several resources in Kubernetes.
Debugging comes down to usual kubectl get and kubectl describe commands.

  (  +---------+  )
  (  | Ingress |  ) Optional                                              ACME Only!
  (  +---------+  )
         |                                                     |
         |   +-------------+      +--------------------+       |  +-------+       +-----------+
         |-> | Certificate |----> | CertificateRequest | ----> |  | Order | ----> | Challenge | 
             +-------------+      +--------------------+       |  +-------+       +-----------+
                                                               |

Picture - from the link above.
There are also issuer (namespace-scoped) and\or clusterissuer (cluster-scoped) resources specific to ACME way.

Troubleshooting

Problem 1: Secondary validators failing

Symptoms

order shows failing challenges from some of the validators.
Example:

»  kubectl get certificate -o wide
NAME        READY   SECRET      ISSUER        STATUS                                         AGE
admin-api   False   admin-api   letsencrypt   Issuing certificate as Secret does not exist   44me

»  kubectl get cr -o wide
NAME              READY   ISSUER        STATUS                                                                                                          AGE
admin-api-g6tfk   False   letsencrypt   Failed to wait for order resource "admin-api-g6tfk-1028245422" to become ready: order is in "invalid" state:    47m

»  kubectl describe order admin-api-g6tfk-1028245422
...
Status:
  Authorizations:
    Challenges:
      Token:        27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
      Type:         http-01
      URL:          https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/lKUiBQ
      Token:        27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
      Type:         dns-01
      URL:          https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/jd6cuw
      Token:        27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
      Type:         tls-alpn-01
      URL:          https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/Kievvg
    Identifier:     admin-api.prod.dev.mapal-os.com
    Initial State:  pending
    URL:            https://acme-v02.api.letsencrypt.org/acme/authz-v3/8048352360
    Wildcard:       false
  Failure Time:     2020-10-21T16:11:27Z
  Finalize URL:     https://acme-v02.api.letsencrypt.org/acme/finalize/99553541/5802379825
  State:            invalid
  URL:              https://acme-v02.api.letsencrypt.org/acme/order/99553541/5802379825
...
Events:
  Type    Reason   Age   From          Message
  ----    ------   ----  ----          -------
  Normal  Created  48m   cert-manager  Created Challenge resource "admin-api-g6tfk-1028245422-1337313414" for domain "admin-api.prod.dev.mapal-os.com"

Browse URLs from the Challenges to find out the actual problem, for example:

»  curl https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/lKUiBQ
{
...
  "error": {
    "type": "urn:ietf:params:acme:error:dns",
    "detail": "During secondary validation: DNS problem: query timed out looking up CAA for prod.dev.mapal-os.com",
    "status": 400
  },
...
}%                                                                                                                                                                                                                                            ivan@Ivans-MacBook-Air:~/git/gitlab.com/boiko.ivan/boiko.ru/www/content(master⚡) »  

Just to be completely sure you can check if you have indeed CAA records:

»  dig dev.mapal-os.com -t caa +short 
0 issue "letsencrypt.org"

CAA checks are done up through the domain tree, so if CAA record doesn't exist in a DNS zone, validators will check in parents zones. See more about CAA records here: https://letsencrypt.org/docs/caa/

The part “secondary validation” means validation by additional LetsEncrypt servers.
In 2020 they've implemented this multiple validation method from servers located in different AWS zones.
This was done to improve security against “man-in-the-middle” attacks.
See:

But there are still quite a lot of complains from users, usual problems being:

  • occasional timeouts from secondary validators, which seem to resolve after some time
  • firewalls around DNS servers blocking access from some of the regions

Solution

See section “Tips and Tricks” below.

  1. Use sites like https://dnschecker.org to check if your DNS server is accessible from all regions.

  2. Wait a little and re-trigger certificate generation by deleting a secret.

Problem 2: LetsEncrypt Rate Limits

Symptoms

»  kubectl get clusterissuer
NAME          READY   AGE
letsencrypt   True    19m

»  k describe order test-5crvd-1385914580
...

Status:
Events:  <none>

Note empty list of Event, which should normally contain messages about creating Challenge resources, like this:

  Type    Reason   Age   From          Message
  ----    ------   ----  ----          -------
  Normal  Created  1m    cert-manager  Created Challenge resource "example-com-2745722290-439160286-0" for domain "test1.example.com"
  Normal  Created  1m    cert-manager  Created Challenge resource "example-com-2745722290-439160286-1" for domain "test2.example.com"

I couldn't find any exact error message anywhere.
But I found that it correlates quite often with hitting rate limits of LetsEncrypt, described in: https://letsencrypt.org/docs/rate-limits/

Solutions

  1. Use “staging” clusterissuer: https://letsencrypt.org/docs/staging-environment/
    Especially for frequently running CI/CD pipelines.
    It doesn't generate trusted certificates, so can only be used for pre-Prod environments.
    Hence you will need to configure 2 issuers, and carefully use the correct one depending on environment.

  2. Change DNS name of the host in the Ingress resource.
    Quote from Rate Limits page above:

A certificate is considered a renewal (or a duplicate) of an earlier certificate if it contains the exact same set of hostnames, ignoring capitalization and ordering of hostnames. For instance, if you requested a certificate for the names [www.example.com, example.com], you could request four more certificates for [www.example.com, example.com] during the week. If you changed the set of hostnames by adding [blog.example.com], you would be able to request additional certificates.

Tips and tricks

  1. Web-sites like https://dnschecker.org allows to run various DNS requests, including from multiple regions.

  2. Web-site https://letsdebug.net allows to test up front if LetsEncrypt will be happy about a request for TLS certificate generation for a specific domain.

  3. Web-site https://letsencrypt.status.io/ allows to check status page if there were any outages of LetsEncrypt services.

  4. To re-trigger certificate generation it seems like the best way is to delete a secret.

  • to list only TLS secrets: kubectl get secret --field-selector type=kubernetes.io/tls
  • to delete a secret: kubectl delete secret <SECRET_NAME>