D/Thrift: Performance and other random things
This week, I will try to keep the post short, while still informative – I spent way too much time being unproductive due to hard to track down bugs already to be in the mood for writing up extensive ramblings. So, on to the meat of the recent changes (besides the usual little cleanup commits here and there):
-
Async client design: Yes, even though it took me quite some time to come up with the original one, I had completely missed the fact that it would be unreasonably difficult to extend the support code with resource types other than sockets – long story short,
TAsyncSocketManager
now inherits fromTAsyncManager
, instead of being a part of it. Also, I splitTFuture
into two parts, aTFuture
interface for accessing the result, and aTPromise
implementation for actually setting/storing it, and only theTFuture
part is returned from the async client methods. The thrift.async docs are actually useful now. -
Async socket timeouts: Correctly handling the state of the connection after a
read
/write
timeout turned out to be a surprisingly tough problem to solve (allowing other request to be executed on the same connection after a timeout could lead to strange results). In the end, I settled for just closing the connection, which is a simple yet effective solution. To correctly implement this, I also had to finally kill theTTransport.isOpen
related contracts and replace them with exceptions in the right places, leading to modified/clarifiedisOpen
semantics. -
The non-blocking server now handles one-way calls correctly, and modifying the task pool after it is running no longer leads to undefined results. In the process, I have also turned the static
event
struct allocations into dynamic ones, since this should have no measurable performance impact, but removes the dependence on the (unstable, per thelibevent
docs) struct layout. -
D now also has a
TPipedTransport
, which forwards a copy of all data read/written to another transport, useful e.g. for logging requests/responses to disk. -
The biggest chunk of time was actually spent on performance investigations: While I was pretty certain that the D serialization code should not perform any worse than its C++ counterpart already, the difference in speed merely being compiler-dependent, I wanted to prove this fact so that I could cross this item from the list. This involved updating LDC to the 2.054 frontend (only to discover that Alexey Prokhin decided to start work on it at the same time I did, the related commits in the main repository are his now), fixing some LDC-specific druntime bugs, etc1. Unfortunately, I couldn’t test GDC because of issue 6411, but without further ado, here are the results:
Writing / kHz | Reading / kHz | |
---|---|---|
DMD v2.054, -O -release -inline | 2 051 | 1 030 |
GCC 4.6.1 (C++), -O2, templates | 5 667 | 1 050 |
LDC, -O3 -release | 2 300 | 1 077 |
LDC, -output-ll / opt -O3 | 5 500 | 3 150 |
LDC, -output-ll / opt -std-compile-opts | 6 700 | 1 950 |
At this point, I will disregard my earlier resolution and again get into the nitty-gritty details – the rest of this post can easily be summarized as the D version is indeed up to par with C++, when it is equally well optimized, but if you are curious about the details, read on.
If you read the performance figures from my last post, the first thing you will probably notice is that the C++ reading performance figure is about four times lower now. This isn’t a mistake; noting the comparatively slim advantage of the C++ version, I made a change to it quite some time ago, which avoids allocating a new TMemoryBuffer
instance on every loop iteration (the D version also reuses it). Without really considering the implications, though, I also moved the construction of the OneOfEach
struct out of the loop. This seemed like a minor detail to me, but in fact, it enabled reuse of the std::string
-internal buffers for the string members of the struct, which is unrealistic (e.g. for a pretty similar situation in the non-blocking server, there is no buffer reuse possible as well).
In a situation where a big part of the time is spent actually allocating and copying around memory, this makes a big difference. To test this assumption about the big influence of memory allocations, I compiled a version of the D benchmark where a static buffer for the strings was used instead of reallocating them every time, and indeed, the reading performance was more than twice as high.
The std::string
implementation of the GCC STL seems to be fairly inefficient in this case, because the best D result (which uses GC-allocated memory), is almost three times faster than it for the reading part. It is possible that there are some further optimizations which could improve performance (-O3
didn’t change things for the better, in case you are wondering), but as my goal wasn’t to squeeze every last bit of performance out of this synthetic benchmark, I didn’t investigate this issue any further.
But now to the D results: Simply switching to LDC 2 instead of DMD didn’t give any great speedups, because readAll()
wasn’t inlined by it either, thus leaving all the memory copying unoptimized, as discussed in the last post. To see how much of a difference this would really make, I compiled the D code to LLVM IR files and manually ran the optimizer/code generator/linker on them, with the plan being to manually add the alwaysinline
attribute to the relevant pieces of IR:
I then discovered that the method calls in question were properly inlined by the stand-alone opt
without any manual intervention anyway. I am not really sure why this happens; the inliner cost limits could be more liberal in this case, or the optimization passes being scheduled in a different way than inside LDC could have an impact, or maybe it’s connected to the fact that TMemoryBuffer
and the caller are in different modules (to my understanding, LTO shouldn’t be required to optimize in this case, but it may well be that I am mistaken here).
The LDC -output-ll
rows in the above table correspond to the benchmark compiled this way, with the -std-compile-opts
and -O3
flags passed to opt
, respectively. This is a nice example of how important compiler optimizations for this, again, synthetic benchmark really are: for the reading part of the benchmark, -O3
gives a nice speed boost because of the more aggressive inlining (-std-compile-opts
doesn’t touch TBinaryProtocol.readFieldBegin()
, which is called 15 times per loop iteration and contains some code that can completely be optimized out), but for the writing part, its result is actually slower, presumably because of locality effects (the call graphs are identical).
The only change related to benchmark performance I made since the last post was an LDC-specific workaround to stop manifest constants from incorrectly being leaked from the CTFE codegen process into the writing functions. I think the above results are justification enough to stop worrying about raw serialization performance – the results when using the Compact instead of the Binary protocol are similar – and moving on to more important topics2.
1 If you are curious about LDC 2, you can get the source I used from the official hg repo, and the LDC-specific druntime and Phobos source from my clones at GitHub. LDC is officially on GitHub now.
2 Such as performance-testing the actual server implementations, but I don't expect any big surprises there, and I am not sure how to reliably benchmark the network-related code – running server and clients on the same machine is probably a bad idea?