Let's imagine we have a web server that uses multi-threading as its concurrency approach with a fixed thread pool size of 100. Supposing that the server's request handler sleeps for 2 seconds for every 10th request, how many requests can this server handle every second?
This question persisted at the back of my mind as I was reading Deepu's series of posts on concurrency in modern programming languages. In Deepu's series he performed several web server benchmarks with the same 2 second delay for every 10th request, and the resulting requests per second figures appeared to be capped at a fixed ceiling. This post explores the factors affecting the maximum requests per second that a multi-threaded web server's can handle.
To work out an answer to the opening question I ran several benchmark experiments on a simple web server that uses a thread pool with a fixed size for handling requests. One thread processes one request at a time. I performed the following load tests on the server using Apache Bench (ab
):
Concurrent request refers to the number of requests being made at a time.
For the web server I used Deepu's Rust web server:
use rustws::ThreadPool;
use std::fs;
use std::io::prelude::*;
use std::net::TcpListener;
use std::net::TcpStream;
use std::thread;
use std::time::Duration;
fn main() {
let listener = TcpListener::bind("127.0.0.1:8080").unwrap();
// create thread pool size of 100
let pool = ThreadPool::new(100);
let mut count = 0;
for stream in listener.incoming() {
let stream = stream.unwrap();
count = count + 1;
pool.execute(move || {
// spawn each connection in a new thread
handle_connection(stream, count);
});
}
}
fn handle_connection(mut stream: TcpStream, count: i64) {
let mut buffer = [0; 1024];
stream.read(&mut buffer).unwrap();
// add 2 second delay to every 10th request
if (count % 10) == 0 {
println!("Adding delay. Count: {}", count);
thread::sleep(Duration::from_secs(2));
}
let header = "
HTTP/1.0 200 OK
Connection: keep-alive
Content-Length: 174
Content-Type: text/html; charset=utf-8
";
let contents = fs::read_to_string("hello.html").unwrap();
let response = format!("{}\r\n\r\n{}", header, contents);
// write response
stream.write(response.as_bytes()).unwrap();
stream.flush().unwrap();
}
Here are the results of the benchmarks. We'll analyze them in the next section.
$ ab -n 1 -c 5000 http://localhost:8080/
...
Concurrency Level: 1
Time taken for tests: 1003.114 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 1415000 bytes
HTML transferred: 880000 bytes
Requests per second: 4.98 [#/sec] (mean)
Time per request: 200.623 [ms] (mean)
Time per request: 200.623 [ms] (mean, across all concurrent requests)
Transfer rate: 1.38 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 1
Processing: 0 200 600.1 0 2001
Waiting: 0 200 600.1 0 2001
Total: 0 200 600.1 1 2002
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 2000
95% 2001
98% 2001
99% 2001
100% 2002 (longest request)
$ ab -n 10 -c 5000 http://localhost:8080/
...
Concurrency Level: 10
Time taken for tests: 100.329 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 1415000 bytes
HTML transferred: 880000 bytes
Requests per second: 49.84 [#/sec] (mean)
Time per request: 200.657 [ms] (mean)
Time per request: 20.066 [ms] (mean, across all concurrent requests)
Transfer rate: 13.77 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 200 600.1 0 2002
Waiting: 0 200 600.1 0 2002
Total: 0 201 600.1 0 2002
Percentage of the requests served within a certain time (ms)
50% 0
66% 1
75% 1
80% 1
90% 2000
95% 2001
98% 2001
99% 2001
100% 2002 (longest request)
$ ab -n 100 -c 5000 http://localhost:8080/
...
Concurrency Level: 100
Time taken for tests: 10.328 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 1415000 bytes
HTML transferred: 880000 bytes
Requests per second: 484.10 [#/sec] (mean)
Time per request: 206.567 [ms] (mean)
Time per request: 2.066 [ms] (mean, across all concurrent requests)
Transfer rate: 133.79 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 2.2 0 10
Processing: 0 202 600.1 1 2013
Waiting: 0 201 600.2 1 2011
Total: 0 203 600.1 2 2013
Percentage of the requests served within a certain time (ms)
50% 2
66% 5
75% 7
80% 9
90% 2000
95% 2003
98% 2006
99% 2009
100% 2013 (longest request)
$ ab -n 500 -c 5000 http://localhost:8080/
...
Concurrency Level: 500
Time taken for tests: 10.243 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 1415000 bytes
HTML transferred: 880000 bytes
Requests per second: 488.11 [#/sec] (mean)
Time per request: 1024.349 [ms] (mean)
Time per request: 2.049 [ms] (mean, across all concurrent requests)
Transfer rate: 134.90 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 28 81.6 17 1012
Processing: 10 830 1032.3 87 3909
Waiting: 0 820 1033.9 81 3901
Total: 32 858 1034.2 107 4075
Percentage of the requests served within a certain time (ms)
50% 107
66% 1863
75% 1868
80% 1872
90% 2050
95% 2098
98% 3859
99% 3883
100% 4075 (longest request)
We'll focus on the following summary of concurrent requests vs requests per second (RPS):
Concurrent requests | 1 | 10 | 100 | 500 |
---|---|---|---|---|
Requests per second | 4.98 | 49.84 | 484.10 | 488.11 |
Hmmm...
Why did we get these specific RPS figures?
Why did the RPS increase when we increase the concurrent requests from 1 to 10, and 10 to 100?
It seemed like the RPS grew linearly with the number of concurrent requests, but this didn't hold when concurrent requests went from 100 to 500. Why do we observe this?
Let's figure it out...
When the server receives one concurrent request, only one server thread actively processes the request at a time. For every 10 requests that are received, 9 of them complete nearly instantaneously (assume 0 seconds). The 10th request takes 2 seconds to complete because the thread sleeps for 2 seconds before returning the response. For this specific scenario, a single thread handles 10 requests in 2 seconds. This equates to ~5 RPS, which roughly matches the actual RPS value of 4.98.
When ab
makes 10 concurrent requests to the server, the server handled the first 9 requests immediately and sleeps on the 10th request for 2 seconds. Since the first 9 requests have completed ab
sends the next 9 requests (i.e. the 11th to 19th requests), which are also handled instantly. Again, ab
sends another batch of 9 requests. This time the 20th request is delayed for 2 seconds, while the remaining 8 requests complete. The server now has 2 threads waiting on the 10th and 20th requests to complete.
When we project what happens within a split-second of starting the ab
load test, we will find the server will be blocked for 2 seconds handling 10 requests with 10 threads: the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th, and 100th requests. After 2 seconds have elapsed the server finishes processing these 10 delayed requests, and ab
is now ready to send more requests.
The server processed 100 requests within those 2 seconds: 90 instantly and 10 delayed. This represents 50 RPS, which is quite close to the actual value of 49.84.
Given the first two benchmarks, we can model the expected RPS with the following equation:
Requests handled per second = Concurrent requests * Average requests handled per thread per second
The equation tells us that the RPS can be determined by how many requests are hitting the server at a time and that they have a linear relationship.
Applying the above equation to guess the RPS for 100 concurrent requests gives us an estimated 500 RPS (100 concurrent request * 5 RPS). Still quite similar to the actual 484.10 RPS figure we obtained from the benchmarks. Our mental model still works.
For 500 concurrent requests, if we naively apply the earlier equation we might expect the server to achieve 2500 RPS (500 concurrent requests * 5 RPS). However, the benchmark reveals 488.11 RPS, which is similar to the result when 100 concurrent requests are made. The reason for this behavior is that the server isn't handling 500 requests at a time. Instead, the server only processes 100 requests concurrently because it has a limited thread pool size of 100 threads. The server queues the remaining 400 requests and waits until a thread in the thread pool is available before assigning the request to it.
Now that we understand the constraints imposed by a fixed thread pool size we can update our previous equation to the following:
Requests handled per second = MIN(Concurrent requests, Thread pool size) * Average requests handled per thread per second
For our server to achieve a greater RPS we have several possible solutions.
We could increase the thread pool size. However, each thread we create consumes additional memory and the server incurs a higher context switching overhead. Consequently, continuously increasing the thread pool size will eventually lead to diminishing returns or an out-of-memory situation.
Several programming language runtimes support a different concurrency model using asynchronous processing. Node.js is one such runtime that can process non-blocking requests in a single-threaded Event Loop instead of spawning a new thread for every request. Handling requests asynchronously in a single thread mitigates the context switching overheads and out-of-memory situations that a multi-threaded approach might present.
Other runtimes provide green threads. Green threads can be thought of as "virtual threads", which are created and managed by the runtime instead of the operating system. Such a runtime is responsible for scheduling green threads on native threads and deciding when to switch between green threads. The Go programming language is one popular language that provides green threads in the form of goroutines.
Suppose we have an example scenario where 500 concurrent requests are made to a server. A server with 8 CPU cores and support for green threads can spawn 500 green threads to handle the requests, but schedule them on 8 native threads. Because green threads are cheaper to create, quicker to switch between and, consume less memory than native threads, they can handle a greater number of concurrent requests.