Cache doesn't expire as should

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Cache doesn't expire as should

Руслан Закиров
Hi,

I stuck in analyzing this and can not repeat in a test environment.

We see '/' location "freeze" from time to time as nginx stops updating cache. This lasts for 30 minutes - an hour.

Regular requests flow:
HIT
...
HIT
UPDATING
UPDATING
EXPIRED (every 30 seconds)
HIT
...

So under normal conditions a page is served from the cache, cache record expires, next request goes to upstream locking the cache record, requests still served from cache with UPDATING cache status while UPDATING request is working, on return from the upstream cache is updated, unlocked and EXPIRED cache status show up in the log. Back to step one - serving from cache.

Abnormal flow we see under nginx 1.6.x
HIT
...
HIT
UPDATING
STALE
UPDATING
UPDATING
STALE
UPDATING
STALE
STALE
....
continues for a while
...
HIT

We first noticed this when attempted upgrade to 1.8.0 where situation was much worse:
HIT
...
HIT
UPDATING
UPDATING
UPDATING
UPDATING
....
continues until cache clean
...

We downgraded nginx back to 1.6, after some time realized that this situation also happens on 1.6, but page doesn't freeze forever.

I don't say it only happens with '/', it's just the busiest page, we probably don't notice other pages. On 1.8 this happened also on top level sections /football/. I feel it's related to amount of pressure on the page.

Our error log has "ignore long locked inactive cache entry" alerts, but I really couldn't match it to "defreeze" event. Access log has STALE/UPDATING requests between the alert and EXPIRED (cache updating) request.

Any help on hunting it down would be appreciated.

--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ 

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

vedranf
Your (pastebin) config?
Cache situation? (harddisks, cache configuration, etc.)

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,263625,263626#msg-263626

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Руслан Закиров

On Tue, Dec 22, 2015 at 3:42 PM, itpp2012 <[hidden email]> wrote:
Your (pastebin) config? 
Cache situation? (harddisks, cache configuration, etc.)
 
https://gist.github.com/ruz/05456767750715f6b54e

Pasted relevant configuration. Can add more on request, our complete config is 4k lines splitted into 200 files.

Hosts are FreeBSD 8.4-RELEASE-p4 with the cache on a memory disk with UFS.

nginx/1.6.3

--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ 

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

vedranf
Have you tried without locking?

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,263625,263629#msg-263629

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Maxim Dounin
In reply to this post by Руслан Закиров
Hello!

On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:

[...]

> Our error log has "ignore long locked inactive cache entry" alerts, but I
> really couldn't match it to "defreeze" event. Access log has STALE/UPDATING
> requests between the alert and EXPIRED (cache updating) request.

The "ignore long locked inactive cache entry" alerts indicate that
a cache entry was locked by some request, and wasn't unlocked for
a long time.  The alert is expected to appear if a cache node is
locked for cache inactive time (as set by proxy_cache_path
inactive=, 10 minutes by default).

Most likely reason is that a worker died or was killed
while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).

Trivial things to consider:

- check logs for segmentation faults;

- if you are using 3rd party modules / patches, try without them;

- make sure you don't kill worker processes yourself or using some
  automation scripts (in particular, don't try to terminate old
  worker processes after a configuration reload).

--
Maxim Dounin
http://nginx.org/

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Руслан Закиров
In reply to this post by vedranf

On Tue, Dec 22, 2015 at 4:39 PM, itpp2012 <[hidden email]> wrote:
Have you tried without locking?

No as it's unacceptable load increase on backends when cache record expires.


--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ 

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Руслан Закиров
In reply to this post by Maxim Dounin


On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <[hidden email]> wrote:
Hello!

On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:

[...]

> Our error log has "ignore long locked inactive cache entry" alerts, but I
> really couldn't match it to "defreeze" event. Access log has STALE/UPDATING
> requests between the alert and EXPIRED (cache updating) request.

The "ignore long locked inactive cache entry" alerts indicate that
a cache entry was locked by some request, and wasn't unlocked for
a long time.  The alert is expected to appear if a cache node is
locked for cache inactive time (as set by proxy_cache_path
inactive=, 10 minutes by default). 

Inactive is defined in the config, but it's set to default 10m.

What happens with requests after this time? Do they hit backend and update cache? Do they use stale version?

I'm going to check "long locked" messages in the log to see how many was for "/" location.
The hash should be the same if we didn't change cache key, right?

Most likely reason is that a worker died or was killed
while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).

Shouldn't be there a record in error log? Error log level at warn.
 

Trivial things to consider:

- check logs for segmentation faults;

no seg faults in logs

- if you are using 3rd party modules / patches, try without them;

from freebsd port, updated gist [1] with `nginx -V` output
 
- make sure you don't kill worker processes yourself or using some
  automation scripts (in particular, don't try to terminate old
  worker processes after a configuration reload).

One recent appearance of the problem was at 1:30AM and I checked
logs for crazy midnight deploys - nothing.

Also, we don't use anything custom to restart nginx, just regular services
management tools.


 

--
Maxim Dounin
http://nginx.org/

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx



--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ 

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Maxim Dounin
Hello!

On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote:

> On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <[hidden email]> wrote:
>
> > Hello!
> >
> > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:
> >
> > [...]
> >
> > > Our error log has "ignore long locked inactive cache entry" alerts, but I
> > > really couldn't match it to "defreeze" event. Access log has
> > STALE/UPDATING
> > > requests between the alert and EXPIRED (cache updating) request.
> >
> > The "ignore long locked inactive cache entry" alerts indicate that
> > a cache entry was locked by some request, and wasn't unlocked for
> > a long time.  The alert is expected to appear if a cache node is
> > locked for cache inactive time (as set by proxy_cache_path
> > inactive=, 10 minutes by default).
>
>
> Inactive is defined in the config, but it's set to default 10m.
>
> What happens with requests after this time? Do they hit backend and update
> cache? Do they use stale version?

As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager.  Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.

This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set.  But in general
it indicates that there is a problem.

(Note well that "locked" here should not be confused with
request-level locks as in proxy_cache_lock.)

> I'm going to check "long locked" messages in the log to see how many was
> for "/" location.
> The hash should be the same if we didn't change cache key, right?

Any message indicate that there is a problem.  If the "/" location
is requested often, it'll likely won't be mentioned in the "ignore long
locked" messages.

> > Most likely reason is that a worker died or was killed
> > while holding a lock on a cache node (i.e., while a request was
> > waiting for a new response from a backend).
> >
>
> Shouldn't be there a record in error log? Error log level at warn.

There should be something like "worker process exited" messages,
though at least some of them may be on the "notice" level (e.g.,
if you force nginx worker to quit by sending SIGTERM to it
directly).

> > Trivial things to consider:
> >
> > - check logs for segmentation faults;
>
> no seg faults in logs

Any other alerts, crit or emerg messages?  Any alerts about
socket leaks on configuration reloads ("open socket left..."),
sockets in CLOSED state?  How long relevant requests to upstream
server take (i.e., make sure that the behaviour observed isn't a
result of an upstream server just being very slow)?

> - if you are using 3rd party modules / patches, try without them;
> >
>
> from freebsd port, updated gist [1] with `nginx -V` output
> [1] https://gist.github.com/ruz/05456767750715f6b54e

The only suspicious thing I see there is "ssi on" and
"proxy_cache_lock" at the same time.  There were some fixes
related to proxy_cache_lock with subrequests in more recent
versions.

> > - make sure you don't kill worker processes yourself or using some
> >   automation scripts (in particular, don't try to terminate old
> >   worker processes after a configuration reload).
> >
>
> One recent appearance of the problem was at 1:30AM and I checked
> logs for crazy midnight deploys - nothing.
>
> Also, we don't use anything custom to restart nginx, just regular services
> management tools.

Just a regular kill will be enough screw up things.  On the other
hand, it may be an nginx bug as well.  

Note though, that the version you are using is quite old.  
Debugging anything but 1.9.x hardly make sense.  You may want to
try upgrading to the latest nginx-devel port.

Note well that if a problem happens, it's a good idea to either
completely restart nginx or do binary upgrade procedure ("service
nginx upgrade") to make sure shared memory structures are intact.

--
Maxim Dounin
http://nginx.org/

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx
Reply | Threaded
Open this post in threaded view
|

Re: Cache doesn't expire as should

Руслан Закиров


On Wed, Dec 23, 2015 at 6:35 PM, Maxim Dounin <[hidden email]> wrote:
Hello!

On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote:

> On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <[hidden email]> wrote:
>
> > Hello!
> >
> > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:
> >
> > [...]
> >
> > > Our error log has "ignore long locked inactive cache entry" alerts, but I
> > > really couldn't match it to "defreeze" event. Access log has
> > STALE/UPDATING
> > > requests between the alert and EXPIRED (cache updating) request.
> >
> > The "ignore long locked inactive cache entry" alerts indicate that
> > a cache entry was locked by some request, and wasn't unlocked for
> > a long time.  The alert is expected to appear if a cache node is
> > locked for cache inactive time (as set by proxy_cache_path
> > inactive=, 10 minutes by default).
>
>
> Inactive is defined in the config, but it's set to default 10m.
>
> What happens with requests after this time? Do they hit backend and update
> cache? Do they use stale version?

As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager.  Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.

This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set.  But in general
it indicates that there is a problem.

uwsgi on backends has harakiri setting at 9 seconds and 10 seconds timeout in
nginx. Index page takes 0.2-0.7 seconds to render.

(Note well that "locked" here should not be confused with
request-level locks as in proxy_cache_lock.)

> I'm going to check "long locked" messages in the log to see how many was
> for "/" location.
> The hash should be the same if we didn't change cache key, right?

Any message indicate that there is a problem.  If the "/" location
is requested often, it'll likely won't be mentioned in the "ignore long
locked" messages.

> > Most likely reason is that a worker died or was killed
> > while holding a lock on a cache node (i.e., while a request was
> > waiting for a new response from a backend).
> >
>
> Shouldn't be there a record in error log? Error log level at warn.

There should be something like "worker process exited" messages,
though at least some of them may be on the "notice" level (e.g.,
if you force nginx worker to quit by sending SIGTERM to it
directly).

> > Trivial things to consider:
> >
> > - check logs for segmentation faults;
>
> no seg faults in logs

Any other alerts, crit or emerg messages?  Any alerts about
socket leaks on configuration reloads ("open socket left..."),
sockets in CLOSED state?  How long relevant requests to upstream
server take (i.e., make sure that the behaviour observed isn't a
result of an upstream server just being very slow)?

Looks like we have some misconfiguration and I get the following:

2015/12/17 01:49:13 [crit] 5232#0: *599273664 rename() "/var/nginx/cache/0080683483" to "/var/www/storage/img/ru-cyber/answers/10-5.jpg" failed (13: Permission denied) while reading upstream, client: 188.115.151.67, server: www.sports.ru, request: "GET /storage/img/ru-cyber/answers/10-5.jpg HTTP/1.1", upstream: "http://192.168.1.10:80/storage/img/ru-cyber/answers/10-5.jpg", host: "www.sports.ru", referrer: "http://cyber.sports.ru/dota2/1034151992.html"

2015/12/17 01:50:17 [crit] 5229#0: *599313395 open() "/var/www/storage/img/fantasy/shirts/ule/slo.png" failed (13: Permission denied), client: 192.168.1.10, server: www.sports.ru, request: "GET /storage/img/fantasy/shirts/ule/slo.png HTTP/1.0", host: "www.sports.ru", referrer: "http://fantasy-h2h.ru/analytics/fantasy_statistics/liga_europa_2015"

 
> - if you are using 3rd party modules / patches, try without them;
> >
>
> from freebsd port, updated gist [1] with `nginx -V` output
> [1] https://gist.github.com/ruz/05456767750715f6b54e

The only suspicious thing I see there is "ssi on" and
"proxy_cache_lock" at the same time.  There were some fixes
related to proxy_cache_lock with subrequests in more recent
versions.

> > - make sure you don't kill worker processes yourself or using some
> >   automation scripts (in particular, don't try to terminate old
> >   worker processes after a configuration reload).
> >
>
> One recent appearance of the problem was at 1:30AM and I checked
> logs for crazy midnight deploys - nothing.
>
> Also, we don't use anything custom to restart nginx, just regular services
> management tools.

Just a regular kill will be enough screw up things.  On the other
hand, it may be an nginx bug as well.

Note though, that the version you are using is quite old.
Debugging anything but 1.9.x hardly make sense.  You may want to
try upgrading to the latest nginx-devel port.

As I said before 1.8.0 made things worse. May be we will try nginx 1.9.x on one of frontends and see how it goes.
 

Note well that if a problem happens, it's a good idea to either
completely restart nginx or do binary upgrade procedure ("service
nginx upgrade") to make sure shared memory structures are intact.

--
Maxim Dounin
http://nginx.org/

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx



--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ 

_______________________________________________
nginx mailing list
[hidden email]
http://mailman.nginx.org/mailman/listinfo/nginx