Nginx had the ngx_http_limit_req_module which can be used to limit request processing rate, but most people seem to only used its basic feature: limiting request rate by the remote address, like this:
http { limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s; ... server { ... location /search/ { limit_req zone=one burst=5; }
This is the example configuration taken from Nginx’s official document, the limit_req_zone directive takes variable $binary_remote_addr as a key to limit incoming request. The key fills a zone named one, which is defined by the zone parameter, it could use up to 10m memory. And the rate parameter says that the maximum request rate of each $binary_remote_addr is 1 per second. In the search location block, we can use limit_req directive to refer the one zone, with the bursts not exceeding 5 requests.
Everything seems great, we have configured Nginx against rogue bots/spiders, right?
No! That configuration won’t work in real life, it should never used in your production environments! Take these following circumstances as examples:
- When users access your website behind a NAT, they share the same public IP, thus Nginx will use only one $binary_remote_addr to do limit request. Hundreds of users in total could only be able access your website 1 time per second!
- A botnet is used to crawl your website, it use different IP addresses each time. Again, in this situation, limiting by $binary_remote_addr is totally useless.
So what configuration should we use then? We need to use a different variable as the key, or even multiple variables combined together (since version 1.7.6, limit_req_zone‘s key can take multiple variables). Instead of remote address, it is better to use request HTTP headers to distinguish users apart, like User-Agent, Referer, Cookie, etc. These headers are easy to access in Nginx, they are exposed as built in variables, like $http_user_agent, $http_referer, $cookie_name, etc.
For example, this is a better way to define a zone:
http { limit_req_zone $binary_remote_addr$http_user_agent zone=two:10m rate=90r/m; }
It combines $binary_remote_addr and $http_user_agent together, so different user agent behind NATed network can be distinguished. But it is still not perfect, multiple users could use a same browser, same version, thus they send same User-Agent header! Another problem is that the length of $http_user_agent variable is not fixed (unlike $binary_remote_addr), a long header could use a lot of memory of the zone, may exceeds it.
To solve the first problem, we can use more variables there, cookies would be great, since different user sends their unique cookies, like $cookie_userid, but this still leaves us the second problem. The answer it to use the hashes of variables instead.
Thers is a third-party module called set-misc-nginx-module, we can use it to generate hashes from variables. If you are using Openresty, this moule is already included. So the configuration is like this:
http { ... limit_req_zone $binary_remote_addr$cookie_hash$ua_hash zone=three:10m rate=90r/m; ... server { ... set_md5 $cookie_hash $cookie_userid; set_md5 $ua_hash $http_user_agent; ... } }
It is OK we used $cookie_hash and $ua_hash in the http block before they are defined in server block. This configuration is great now.
Let’s continue with the distributed botnet problem now, we need to take $binary_remote_addr out of the key, and since those bots usually don’t send Referer header (else you can found what unique about it by yourself), we can take advantage of it. This configuration should take care of it:
http { ... limit_req_zone $cookie_hash$referer_hash$ua_hash zone=three:10m rate=90r/m; ... server { ... set_md5 $cookie_hash $cookie_userid; set_md5 $referer_hash $http_referer; set_md5 $ua_hash $http_user_agent; ... } }