Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

Olaf van der Spek
I have nginx configured as a reverse proxy to Amazon's AWS IoT MQTT service.
 This was functioning well for almost 2 months, when suddenly 20 out of 32
instances of this stopped being able to connect upstream.  We started seeing
sporadic upstream SSL connection errors, followed by sporadic upstream
connection refused, and then finally, mostly connection timeouts to
upstream.  Nothing short of a restart or reload of Nginx fixes this.  Debug
logging is not enabled, and trying to enable it replaces the worker
processes, and effectively ends the issue.  Over the next 3 days, the
remaining nodes started exhibiting this problem as well.  Rather than
restarting nginx on these remaining nodes, I isolated them for study, and
stood up new nodes to replace them.

But in studying these, I cannot find any indicator as to why this is
happening.  Now that these have been removed from client traffic, and I can
test with curl's...  I can hit one of these 5 times, and by the 5th call, I
get a repro.  Connection timeout to the upstream, resulting in a timeout to
me.

==========================================================
Here is the version information for nginx, as it comes from Ubuntu 18.04:
nginx version: nginx/1.14.0 (Ubuntu)
built with OpenSSL 1.1.1  11 Sep 2018
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2
-fdebug-prefix-map=/build/nginx-GkiujU/nginx-1.14.0=.
-fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time
-D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro
-Wl,-z,now -fPIC' --prefix=/usr/share/nginx
--conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log
--error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock
--pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules
--http-client-body-temp-path=/var/lib/nginx/body
--http-fastcgi-temp-path=/var/lib/nginx/fastcgi
--http-proxy-temp-path=/var/lib/nginx/proxy
--http-scgi-temp-path=/var/lib/nginx/scgi
--http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit
--with-http_ssl_module --with-http_stub_status_module
--with-http_realip_module --with-http_auth_request_module
--with-http_v2_module --with-http_dav_module --with-http_slice_module
--with-threads --with-http_addition_module --with-http_geoip_module=dynamic
--with-http_gunzip_module --with-http_gzip_static_module
--with-http_image_filter_module=dynamic --with-http_sub_module
--with-http_xslt_module=dynamic --with-stream=dynamic
--with-stream_ssl_module --with-mail=dynamic --with-mail_ssl_module

==========================================================
nginx.conf:
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
worker_rlimit_nofile 30500;

events {
        worker_connections 10000;
        # multi_accept on;
}

http {
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        types_hash_max_size 2048;

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

    #IPV6 also disabled via kernel boot option and sysctl, too.
    #Couldn't get nginx to stop AAAA lookups without doing that.
    resolver 8.8.8.8 8.8.4.4 valid=3s ipv6=off;
    resolver_timeout 10;
    # enable reverse proxy
    proxy_redirect              off;
    proxy_set_header            Host            CENSORED.amazonaws.com;
    proxy_set_header            X-Real-IP       $remote_addr;
    proxy_set_header            X-Forwared-For  $proxy_add_x_forwarded_for;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;

        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log error;

        gzip on;

        # Nginx-lua-prometheus
        # Prometheus metric library for Nginx
        lua_shared_dict prometheus_metrics 10M;
        lua_package_path "/etc/nginx/nginx-lua-prometheus/?.lua";
        init_by_lua '
          prometheus = require("prometheus").init("prometheus_metrics")
          metric_requests = prometheus:counter(
            "nginx_http_requests_total", "Number of HTTP requests", {"host",
"status"})
          metric_latency = prometheus:histogram(
            "nginx_http_request_duration_seconds", "HTTP request latency",
{"host"})
          metric_connections = prometheus:gauge(
            "nginx_http_connections", "Number of HTTP connections", {"state"})
        ';
        log_by_lua '
          metric_requests:inc(1, {ngx.var.server_name, ngx.var.status})
          metric_latency:observe(tonumber(ngx.var.request_time),
{ngx.var.server_name})
        ';

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

==========================================================
iot-proxy config file:
    # Define group of backend / upstream servers:
    upstream iot-backend
    {
          server CENSORED.amazonaws.com:443;
    }

    server
    {
        #listen      443 default ssl;
        listen      443 ssl;
        server_name CENSORED.something.com;

        ssl_session_cache    shared:SSL:1m;
        ssl_session_timeout  86400;
        ssl_certificate /etc/nginx/ssl/CENSORED.crt;
        ssl_certificate_key /etc/nginx/ssl/CENSORED.key;
        ssl_verify_client off;
        ssl_protocols        SSLv3 TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers RC4:HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers on;

        location /
        {
            proxy_pass  https://iot-backend;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host "CENSORED.amazonaws.com:443";
            proxy_read_timeout 86400;
            proxy_ssl_session_reuse off;
        }
    }

==========================================================
nginx-lua-prometheus config file:
server {
  listen 9145;
  allow 0.0.0.0/0;
  allow 127.0.0.1/32;
  deny all;
  location /metrics {
    content_by_lua '
      metric_connections:set(ngx.var.connections_reading, {"reading"})
      metric_connections:set(ngx.var.connections_waiting, {"waiting"})
      metric_connections:set(ngx.var.connections_writing, {"writing"})
      prometheus:collect()
    ';
  }
}

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287081,287081#msg-287081

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

Sergey A. Osokin-2
Hi there,

thanks for the report!

Is there any third-party module there?
Could you exlplain a reason to use SSLv3 in this case.

Thanks.

--
Sergey Osokin

On Fri, Feb 21, 2020 at 04:19:46PM -0500, bdarbro wrote:

> I have nginx configured as a reverse proxy to Amazon's AWS IoT MQTT service.
>  This was functioning well for almost 2 months, when suddenly 20 out of 32
> instances of this stopped being able to connect upstream.  We started seeing
> sporadic upstream SSL connection errors, followed by sporadic upstream
> connection refused, and then finally, mostly connection timeouts to
> upstream.  Nothing short of a restart or reload of Nginx fixes this.  Debug
> logging is not enabled, and trying to enable it replaces the worker
> processes, and effectively ends the issue.  Over the next 3 days, the
> remaining nodes started exhibiting this problem as well.  Rather than
> restarting nginx on these remaining nodes, I isolated them for study, and
> stood up new nodes to replace them.
>
> But in studying these, I cannot find any indicator as to why this is
> happening.  Now that these have been removed from client traffic, and I can
> test with curl's...  I can hit one of these 5 times, and by the 5th call, I
> get a repro.  Connection timeout to the upstream, resulting in a timeout to
> me.
>
> ==========================================================
> Here is the version information for nginx, as it comes from Ubuntu 18.04:
> nginx version: nginx/1.14.0 (Ubuntu)
> built with OpenSSL 1.1.1  11 Sep 2018
> TLS SNI support enabled
> configure arguments: --with-cc-opt='-g -O2
> -fdebug-prefix-map=/build/nginx-GkiujU/nginx-1.14.0=.
> -fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time
> -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro
> -Wl,-z,now -fPIC' --prefix=/usr/share/nginx
> --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log
> --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock
> --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules
> --http-client-body-temp-path=/var/lib/nginx/body
> --http-fastcgi-temp-path=/var/lib/nginx/fastcgi
> --http-proxy-temp-path=/var/lib/nginx/proxy
> --http-scgi-temp-path=/var/lib/nginx/scgi
> --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit
> --with-http_ssl_module --with-http_stub_status_module
> --with-http_realip_module --with-http_auth_request_module
> --with-http_v2_module --with-http_dav_module --with-http_slice_module
> --with-threads --with-http_addition_module --with-http_geoip_module=dynamic
> --with-http_gunzip_module --with-http_gzip_static_module
> --with-http_image_filter_module=dynamic --with-http_sub_module
> --with-http_xslt_module=dynamic --with-stream=dynamic
> --with-stream_ssl_module --with-mail=dynamic --with-mail_ssl_module
>
> ==========================================================
> nginx.conf:
> user www-data;
> worker_processes auto;
> pid /run/nginx.pid;
> include /etc/nginx/modules-enabled/*.conf;
> worker_rlimit_nofile 30500;
>
> events {
> worker_connections 10000;
> # multi_accept on;
> }
>
> http {
> sendfile on;
> tcp_nopush on;
> tcp_nodelay on;
> keepalive_timeout 65;
> types_hash_max_size 2048;
>
> include /etc/nginx/mime.types;
> default_type application/octet-stream;
>
>     #IPV6 also disabled via kernel boot option and sysctl, too.
>     #Couldn't get nginx to stop AAAA lookups without doing that.
>     resolver 8.8.8.8 8.8.4.4 valid=3s ipv6=off;
>     resolver_timeout 10;
>     # enable reverse proxy
>     proxy_redirect              off;
>     proxy_set_header            Host            CENSORED.amazonaws.com;
>     proxy_set_header            X-Real-IP       $remote_addr;
>     proxy_set_header            X-Forwared-For  $proxy_add_x_forwarded_for;
>
> ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
> ssl_prefer_server_ciphers on;
>
> access_log /var/log/nginx/access.log;
> error_log /var/log/nginx/error.log error;
>
> gzip on;
>
> # Nginx-lua-prometheus
> # Prometheus metric library for Nginx
> lua_shared_dict prometheus_metrics 10M;
> lua_package_path "/etc/nginx/nginx-lua-prometheus/?.lua";
> init_by_lua '
>  prometheus = require("prometheus").init("prometheus_metrics")
>  metric_requests = prometheus:counter(
>    "nginx_http_requests_total", "Number of HTTP requests", {"host",
> "status"})
>  metric_latency = prometheus:histogram(
>    "nginx_http_request_duration_seconds", "HTTP request latency",
> {"host"})
>  metric_connections = prometheus:gauge(
>    "nginx_http_connections", "Number of HTTP connections", {"state"})
> ';
> log_by_lua '
>  metric_requests:inc(1, {ngx.var.server_name, ngx.var.status})
>  metric_latency:observe(tonumber(ngx.var.request_time),
> {ngx.var.server_name})
> ';
>
> include /etc/nginx/conf.d/*.conf;
> include /etc/nginx/sites-enabled/*;
> }
>
> ==========================================================
> iot-proxy config file:
>     # Define group of backend / upstream servers:
>     upstream iot-backend
>     {
>           server CENSORED.amazonaws.com:443;
>     }
>
>     server
>     {
>         #listen      443 default ssl;
>         listen      443 ssl;
>         server_name CENSORED.something.com;
>
>         ssl_session_cache    shared:SSL:1m;
>         ssl_session_timeout  86400;
>         ssl_certificate /etc/nginx/ssl/CENSORED.crt;
>         ssl_certificate_key /etc/nginx/ssl/CENSORED.key;
>         ssl_verify_client off;
>         ssl_protocols        SSLv3 TLSv1 TLSv1.1 TLSv1.2;
>         ssl_ciphers RC4:HIGH:!aNULL:!MD5;
>         ssl_prefer_server_ciphers on;
>
>         location /
>         {
>             proxy_pass  https://iot-backend;
>             proxy_http_version 1.1;
>             proxy_set_header Upgrade $http_upgrade;
>             proxy_set_header Connection "upgrade";
>             proxy_set_header Host "CENSORED.amazonaws.com:443";
>             proxy_read_timeout 86400;
>             proxy_ssl_session_reuse off;
>         }
>     }
>
> ==========================================================
> nginx-lua-prometheus config file:
> server {
>   listen 9145;
>   allow 0.0.0.0/0;
>   allow 127.0.0.1/32;
>   deny all;
>   location /metrics {
>     content_by_lua '
>       metric_connections:set(ngx.var.connections_reading, {"reading"})
>       metric_connections:set(ngx.var.connections_waiting, {"waiting"})
>       metric_connections:set(ngx.var.connections_writing, {"writing"})
>       prometheus:collect()
>     ';
>   }
> }
>
> Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287081,287081#msg-287081
>
> _______________________________________________
> nginx mailing list
> [hidden email]
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

Olaf van der Spek
Yes.  nginx-lua-prometheus

Installed in /etc/nginx/nginx-lua-prometheus and included in that included
prometheus config file.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287081,287083#msg-287083

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

Olaf van der Spek
In reply to this post by Sergey A. Osokin-2
Oh, and SSLv3 enabled because of client firmware using an old stack,
something I can do nothing about.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287081,287084#msg-287084

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

J.R.
In reply to this post by Olaf van der Spek
> resolver 8.8.8.8 8.8.4.4 valid=3s ipv6=off;

I doubt this is related to your issue, but any reason you have 'valid'
set to only 3 seconds for your resolver conf? Seems like you could be
doing a lot of unnecessary repetitive lookups because that is set so
low.

> ssl_session_cache    shared:SSL:1m;
> ssl_session_timeout  86400;

This also seems dubious. SSL session timeout is set to 24 hours, but
your session cache is only 1MB.
_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Nginx - 56 day old reverse-proxy suddenly unable to connect upstream.

Olaf van der Spek
Sure.  "valid=3s" This probably can be relaxed, yes.  The intent was to
prevent grouping too many connections on the same 8 or so IP's we get in a
single DNS query.

> ssl_session_cache shared:SSL:1m;
> ssl_session_timeout 86400;

This is a nice pointer, thank you.  From nginx.org documentation "One
megabyte of the cache contains about 4000 sessions."  This should definitely
be higher.  Don't know if it will play into the outage I had, but good to
get this set properly going forward.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287081,287097#msg-287097

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx