Issue with AWS NLB and nginx

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with AWS NLB and nginx

DreamWerx
Hi all,

I was hoping someone might have an idea here..  I have a number of nginx doing load balancing sitting behind AWS's network load balancers (TCP) - which seem to only support TCP checks.

Recently a few have stopped working / frozen - they still seem to accept a tcp connection from the NLB - which leads the health check not to fail.  But they cannot internally process the request and you cannot even ssh into the machine.  A reboot is required and that takes longer than normal.

I think the failure is related to a disk issue since the only error in the entire logs where regarding the disk. (error logs below)

Ideally if nginx or the O/S fails it would be better if the port just closed.  I've considered writing a small daemon that monitors via http locally and keeps a port open if everything is ok.

These machines have been running for months now without any issues until now.

Anyone have an idea?

Thanks!

----
[4161960.544106] INFO: task jbd2/xvda1-8:271 blocked for more than 120 seconds.

[4161960.551035]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4161960.556118] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4161960.562846] INFO: task monit:13224 blocked for more than 120 seconds.

[4161960.567394]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4161960.571120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162080.576076] INFO: task dhclient:696 blocked for more than 120 seconds.

[4162080.579596]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162080.582355] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162080.586470] INFO: task monit:13224 blocked for more than 120 seconds.

[4162080.589847]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162080.592654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162200.596100] INFO: task jbd2/xvda1-8:271 blocked for more than 120 seconds.

[4162200.599646]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162200.602422] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162200.606423] INFO: task dhclient:696 blocked for more than 120 seconds.

[4162200.610118]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162200.613093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162200.617889] INFO: task monit:13224 blocked for more than 120 seconds.

[4162200.621641]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162200.624506] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162244.551431] systemd[1]: Failed to start Journal Service.

[4162320.628099] INFO: task jbd2/xvda1-8:271 blocked for more than 120 seconds.

[4162320.631942]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162320.635012] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162320.639647] INFO: task dhclient:696 blocked for more than 120 seconds.

[4162320.643241]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162320.646233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162320.650712] INFO: task monit:13224 blocked for more than 120 seconds.

[4162320.654190]       Not tainted 4.4.0-1022-aws #31-Ubuntu

[4162320.657183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[4162334.801390] systemd[1]: Failed to start Journal Service.

[4162425.051503] systemd[1]: Failed to start Journal Service.

[4162515.301393] systemd[1]: Failed to start Journal Service.

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Issue with AWS NLB and nginx

Maxim Dounin
Hello!

On Mon, Nov 20, 2017 at 12:31:59PM +0100, DreamWerx wrote:

> I was hoping someone might have an idea here..  I have a number of nginx
> doing load balancing sitting behind AWS's network load balancers (TCP) -
> which seem to only support TCP checks.
>
> Recently a few have stopped working / frozen - they still seem to accept a
> tcp connection from the NLB - which leads the health check not to fail.
> But they cannot internally process the request and you cannot even ssh into
> the machine.  A reboot is required and that takes longer than normal.
>
> I think the failure is related to a disk issue since the only error in the
> entire logs where regarding the disk. (error logs below)
>
> Ideally if nginx or the O/S fails it would be better if the port just
> closed.  I've considered writing a small daemon that monitors via http
> locally and keeps a port open if everything is ok.
>
> These machines have been running for months now without any issues until
> now.
>
> Anyone have an idea?

Once nginx is blocked on disk, it likely won't be able to do
anything else - including closing ports, or accepting connections.  
Native TCP checks will still be able to see it as alive for some
time though, as they really check that the port is still open.  
Such check will probably only recognize that the service is down
only when listen queue will be overflowed.

Given the above, it is generally a good idea to monitor not just
ports, but some meaningful answers from a service.  You should be
able to configure such checks in AWS.

--
Maxim Dounin
http://mdounin.ru/
_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx