Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution failed: dns server error: 3 name error #13011

Closed
1 task done
jyc5120 opened this issue May 10, 2024 · 14 comments
Closed
1 task done

DNS resolution failed: dns server error: 3 name error #13011

jyc5120 opened this issue May 10, 2024 · 14 comments
Assignees

Comments

@jyc5120
Copy link

jyc5120 commented May 10, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.6.1

Current Behavior

Kong 3.6.1 is running in a K8S cluster and there is a plugin call a backend endpoint: https://service-name/v1/session
which is used to verify the bear token.
Requests fail randomly with the following log and it is happening sporadically(5% possibility reject as 503 error):

2024/05/09 17:52:28 [error] 43#0: *112534 [lua] init.lua:371: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)service-name:(na) - cache-hit/stale","service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dereferencing SRV","(short)6233306365613731.service-name.default.svc.cluster.local:(na) - cache-hit/stale","6233306365613731.service-name.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.ec2.internal:1 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected","6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.ec2.internal:33 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns client error: 101 empty record received","6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"], client: 100.118.0.0, server: kong, request: "GET /v1/check HTTP/1.1", host: "xxxx.com", referrer: "http:/xxxx/", request_id: "6e2b4652d6ac6713802c4a4fe87b0b53"

Expected Behavior

Requests should go through Kong as expected

Steps To Reproduce

  1. Kubernetes 1.22, and Kong 3.6.1 as API gateway
  2. token verifier plugin which verify bear token by backend endpoint (need parse the domain https://service-name/)
  3. tail the kong error log
  4. there are many DNS resolution failed errors.

Anything else?

No response

@StarlightIbuki
Copy link
Contributor

@chobits Could you take a look?

@jyc5120
Copy link
Author

jyc5120 commented May 11, 2024

after downgrade the Kong to 3.4.2, it is very rare but it did happen still.

@chobits
Copy link
Contributor

chobits commented May 13, 2024

It seems that kong attempted many queries ofdomain:type in the query sequence but could not get avaiable records, see the Tried ... attempts log. See a similar troubleshooting in this #12890 (reply in thread), which contains detailed explanation of Kong's Tried ... log.

If kong reportes this error sporadically, it means your local dns accidentally replied NXDOMAIN for all the queries domain:type.

@chobits
Copy link
Contributor

chobits commented May 13, 2024

after downgrade the Kong to 3.4.2, it is very rare but it did happen still.

Yea, you can increase dns_stale_ttl with a larger value or set the option dns_no_sync=off to mitigate this problem, but you need to check your local DNS server, it once did fail to reply with available records for the query.

@chobits chobits self-assigned this May 13, 2024
@jyc5120
Copy link
Author

jyc5120 commented May 13, 2024

3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better.
So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS.
The issue is critical because one error mean one failure of user request even in rare frequency.

@chobits
Copy link
Contributor

chobits commented May 13, 2024

3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better. So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS. The issue is critical because one error mean one failure of user request even in rare frequency.

If you could easily reproduce this problem, it's not hard to debug. And you need to follow the queried chain provided by the error log to check if you could get the DNS result from your local DNS. We could tell you how to debug, while we could not debug for you if you cannot provide a reproduce step for us.

"(short)service-name:(na) - cache-hit/stale",
 "service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dereferencing SRV",
"(short)6233306365613731.service-name.default.svc.cluster.local:(na) - cache-hit/stale",
"6233306365613731.service-name.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns client error: 101 empty record received",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"

For these chain, kong tried all the domain:type , but failed, so I think you could also checked this at that time mannually, like using a dns client $ dig @<local_dns_server_ip> 6233306365613731.service-name.default.svc.cluster.local.ec2.internal CNAME for ,"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"

DNS protocol type number:

33 - SRV
5 - CNAME
1 - A

@chobits
Copy link
Contributor

chobits commented May 13, 2024

And for this 6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected

You could provide the output of $ dig <dns ip> 6233306365613731.service-name.default.svc.cluster.local SRV, which could tell us whyrecursion detected error was reported by dns client. This info provided by kong dns client means that there some recursion loop in SRV result.

@jyc5120
Copy link
Author

jyc5120 commented May 13, 2024

Thank you for helping. we tested the dig on our cluster.
Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but nslookup 6233306365613731.xxx.xx always failed.
I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.

kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV

Got Answer:

; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION:
cluster.local.		30	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30

;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:17:36 UTC 2024
;; MSG SIZE  rcvd: 207`
and sometime it is not complaining the warning
`; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;100.64.0.10.			IN	A

;; AUTHORITY SECTION:
.			20	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400

;; Query time: 1 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE  rcvd: 115

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION:
cluster.local.		60	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60

;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE  rcvd: 196

@jyc5120
Copy link
Author

jyc5120 commented May 13, 2024

I tested the domain:
kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local
intermmitently ** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN

kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local
this domain always resolved successfully.

Do you think it could be the cause? how can we fix that?

@jyc5120
Copy link
Author

jyc5120 commented May 13, 2024

I did not know what is happening on the coreDNS, the logs show as follow:

[INFO] 100.97.128.6:52380 - 33150 "SRV IN service-name.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000204145s
[INFO] 100.97.128.6:52766 - 10810 "SRV IN service-name.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000145161s
[INFO] 100.97.128.6:37468 - 45120 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000114833s
[INFO] 100.97.128.6:43556 - 11260 "SRV IN service-name. udp 42 false 512" NXDOMAIN qr,aa,rd,ra 117 0.000069646s
[INFO] 100.113.0.2:36230 - 54693 "SRV IN service-name.default.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000164341s
[INFO] 100.110.0.4:37313 - 58428 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000196046s

@chobits
Copy link
Contributor

chobits commented May 14, 2024

Thank you for helping. we tested the dig on our cluster. Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but nslookup 6233306365613731.xxx.xx always failed. I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.

feel that it's related to k8s/dns configuration, but it's beyond my understanding 😢

From kong's output, it seems service-name.default.svc.cluster.local.svc.cluster.local: SRV returns SRV records pointing to 6233306365613731.service-name.default.svc.cluster.local, then kong tries to derefence and resolve 6233306365613731.service-name.default.svc.cluster.local:A, but gets NXDOMAIN. So you can check kong's attempts list of every domain and type, select one of them you want to contain IP addresses and configure your local DNS server to return IP address for that domain and type(usually A type). Then kong DNS client could return IP address to the upper caller.

kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV

Got Answer: `; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION: cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30

;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:17:36 UTC 2024 ;; MSG SIZE rcvd: 207and sometime it is not complaining the warning; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;100.64.0.10. IN A

;; AUTHORITY SECTION: . 20 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400

;; Query time: 1 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 115

;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION: cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60

;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 196 `

@chobits
Copy link
Contributor

chobits commented May 14, 2024

I tested the domain: kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local intermmitently ** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN

kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local this domain always resolved successfully.

If you are sure that you could use A type for service-name.default.svc.cluster.local, you can remove SRV option from the dns_order=... option in kong.conf, which is LAST,SRV,A,CNAME by defaut.

Do you think it could be the cause? how can we fix that?

@jyc5120
Copy link
Author

jyc5120 commented May 14, 2024

Thank you again!
I tested it removing SRV dns_order=LAST,A,CNAME and the errors haven't appeared any more until now.
I thought Kong would try all of 4 DNS types then complain errors if they all failed. now it looks ending up trying SRV records only?

@chobits
Copy link
Contributor

chobits commented May 15, 2024

If you remove SRV from dns_order, kong will not try SRV.

Kong tries to query all the domain:type combinations for the queried domain until it get an available result, like IP address or SRV target. If it gets IP address during the phase, it will directly return it. If it gets SRV target, it will re-query the domain pointed by SRV target.

The query sequence of these domain:type combinations is generated by domain/ search option from resolv.conf and the dns_order option from kong.conf. For example, you can check this case to see how kong dns client generate the query sequence: https://github.com/Kong/kong/blob/master/spec/01-unit/21-dns-client/02-client_spec.lua#L190

@jyc5120 jyc5120 closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants