Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.
I have nginx configured as a reverse proxy to Amazon's AWS IoT MQTT service.
This was functioning well for almost 2 months, when suddenly 20 out of 32
instances of this stopped being able to connect upstream. We started seeing
sporadic upstream SSL connection errors, followed by sporadic upstream
connection refused, and then finally, mostly connection timeouts to
upstream. Nothing short of a restart or reload of Nginx fixes this. Debug
logging is not enabled, and trying to enable it replaces the worker
processes, and effectively ends the issue. Over the next 3 days, the
remaining nodes started exhibiting this problem as well. Rather than
restarting nginx on these remaining nodes, I isolated them for study, and
stood up new nodes to replace them.
But in studying these, I cannot find any indicator as to why this is
happening. Now that these have been removed from client traffic, and I can
test with curl's... I can hit one of these 5 times, and by the 5th call, I
get a repro. Connection timeout to the upstream, resulting in a timeout to
Here is the version information for nginx, as it comes from Ubuntu 18.04:
nginx version: nginx/1.14.0 (Ubuntu)
built with OpenSSL 1.1.1 11 Sep 2018
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2
-fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time
-D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro
-Wl,-z,now -fPIC' --prefix=/usr/share/nginx
--http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit
--with-http_v2_module --with-http_dav_module --with-http_slice_module
--with-threads --with-http_addition_module --with-http_geoip_module=dynamic
--with-stream_ssl_module --with-mail=dynamic --with-mail_ssl_module
I doubt this is related to your issue, but any reason you have 'valid'
set to only 3 seconds for your resolver conf? Seems like you could be
doing a lot of unnecessary repetitive lookups because that is set so
This is a nice pointer, thank you. From nginx.org documentation "One
megabyte of the cache contains about 4000 sessions." This should definitely
be higher. Don't know if it will play into the outage I had, but good to
get this set properly going forward.