LetsEncrypt Troubleshooting
About
This post described some troubleshooting scenarios for TLS certificate generation by LetsEncrypt.
It’s specific to environment based on:
- Kubernetes cluster
- Nginx Ingress Controller
- Cert-Manager
Such system can be achieved for example by using this Terraform module (specific to Azure cloud): https://gitlab.com/optimisen/cloud/azure/aks
Cert-Manager comes with great Troubleshooting guides:
So here I describe cases not covered by those guides.
Kubernetes resources
Cert-Manager creates several resources in Kubernetes.
Debugging comes down to usual kubectl get
and kubectl describe
commands.
( +---------+ ) ( | Ingress | ) Optional ACME Only! ( +---------+ ) | | | +-------------+ +--------------------+ | +-------+ +-----------+ |-> | Certificate |----> | CertificateRequest | ----> | | Order | ----> | Challenge | +-------------+ +--------------------+ | +-------+ +-----------+ |
Picture - from the link above.
There are also issuer
(namespace-scoped) and\or clusterissuer
(cluster-scoped) resources specific to ACME way.
Troubleshooting
Problem 1: Secondary validators failing
Symptoms
order
shows failing challenges from some of the validators.
Example:
» kubectl get certificate -o wide
NAME READY SECRET ISSUER STATUS AGE
admin-api False admin-api letsencrypt Issuing certificate as Secret does not exist 44me
» kubectl get cr -o wide
NAME READY ISSUER STATUS AGE
admin-api-g6tfk False letsencrypt Failed to wait for order resource "admin-api-g6tfk-1028245422" to become ready: order is in "invalid" state: 47m
» kubectl describe order admin-api-g6tfk-1028245422
...
Status:
Authorizations:
Challenges:
Token: 27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
Type: http-01
URL: https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/lKUiBQ
Token: 27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
Type: dns-01
URL: https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/jd6cuw
Token: 27PpsLdmmUtpQ7rJSi2_cSAjRPOArdLz8Aeps9LgLLw
Type: tls-alpn-01
URL: https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/Kievvg
Identifier: admin-api.prod.dev.mapal-os.com
Initial State: pending
URL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/8048352360
Wildcard: false
Failure Time: 2020-10-21T16:11:27Z
Finalize URL: https://acme-v02.api.letsencrypt.org/acme/finalize/99553541/5802379825
State: invalid
URL: https://acme-v02.api.letsencrypt.org/acme/order/99553541/5802379825
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 48m cert-manager Created Challenge resource "admin-api-g6tfk-1028245422-1337313414" for domain "admin-api.prod.dev.mapal-os.com"
Browse URLs from the Challenges to find out the actual problem, for example:
» curl https://acme-v02.api.letsencrypt.org/acme/chall-v3/8048352360/lKUiBQ
{
...
"error": {
"type": "urn:ietf:params:acme:error:dns",
"detail": "During secondary validation: DNS problem: query timed out looking up CAA for prod.dev.mapal-os.com",
"status": 400
},
...
}% ivan@Ivans-MacBook-Air:~/git/gitlab.com/boiko.ivan/boiko.ru/www/content(master⚡) »
Just to be completely sure you can check if you have indeed CAA records:
» dig dev.mapal-os.com -t caa +short
0 issue "letsencrypt.org"
CAA checks are done up through the domain tree, so if CAA record doesn’t exist in a DNS zone, validators will check in parents zones. See more about CAA records here: https://letsencrypt.org/docs/caa/
The part “secondary validation” means validation by additional LetsEncrypt servers.
In 2020 they’ve implemented this multiple validation method from servers located in different AWS zones.
This was done to improve security against “man-in-the-middle” attacks.
See:
- https://community.letsencrypt.org/t/acme-v1-v2-validating-challenges-from-multiple-network-vantage-points/112253
- https://letsencrypt.org/2020/02/19/multi-perspective-validation.html
But there are still quite a lot of complains from users, usual problems being:
- occasional timeouts from secondary validators, which seem to resolve after some time
- firewalls around DNS servers blocking access from some of the regions
Solution
See section “Tips and Tricks” below.
-
Use sites like https://dnschecker.org to check if your DNS server is accessible from all regions.
-
Wait a little and re-trigger certificate generation by deleting a secret.
Problem 2: LetsEncrypt Rate Limits
Symptoms
» kubectl get clusterissuer
NAME READY AGE
letsencrypt True 19m
» k describe order test-5crvd-1385914580
...
Status:
Events: <none>
Note empty list of Event, which should normally contain messages about creating Challenge resources, like this:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 1m cert-manager Created Challenge resource "example-com-2745722290-439160286-0" for domain "test1.example.com"
Normal Created 1m cert-manager Created Challenge resource "example-com-2745722290-439160286-1" for domain "test2.example.com"
I couldn’t find any exact error message anywhere.
But I found that it correlates quite often with hitting rate limits of LetsEncrypt, described in: https://letsencrypt.org/docs/rate-limits/
Solutions
-
Use “staging” clusterissuer: https://letsencrypt.org/docs/staging-environment/
Especially for frequently running CI/CD pipelines.
It doesn’t generate trusted certificates, so can only be used for pre-Prod environments.
Hence you will need to configure 2 issuers, and carefully use the correct one depending on environment. -
Change DNS name of the host in the Ingress resource.
Quote from Rate Limits page above:
A certificate is considered a renewal (or a duplicate) of an earlier certificate if it contains the exact same set of hostnames, ignoring capitalization and ordering of hostnames. For instance, if you requested a certificate for the names [
www.example.com
,example.com
], you could request four more certificates for [www.example.com
,example.com
] during the week. If you changed the set of hostnames by adding [blog.example.com
], you would be able to request additional certificates.
Tips and tricks
-
Web-sites like https://dnschecker.org allows to run various DNS requests, including from multiple regions.
-
Web-site https://letsdebug.net allows to test up front if LetsEncrypt will be happy about a request for TLS certificate generation for a specific domain.
-
Web-site https://letsencrypt.status.io/ allows to check status page if there were any outages of LetsEncrypt services.
-
To re-trigger certificate generation it seems like the best way is to delete a secret.
- to list only TLS secrets:
kubectl get secret --field-selector type=kubernetes.io/tls
- to delete a secret:
kubectl delete secret <SECRET_NAME>