Another week of my Google Summer of Code project passed by, and so you are reading another status update. I am not including any core D development-related news this time, first because I didn’t do much DMD/Phobos work last week, and second because it gets tedious to list everything here – feel free to see my GitHub activity stream for more information. But still, thanks to Sean Kelly for quickly fixing the OS X threading/GC race condition I encountered the week before.
One of my targets last week was to do some preliminary performance investigations and using the insights gained to modify the protocol interface accordingly before I implement additional protocols. For this, I used the
DebugProtoTest.thrift-based serialization performance test already implemented for C++ and Java (see the D version at GitHub, a more intensive look at performance, including creation of some more extensive benchmarks is planned for later).
Ironically, the change with the biggest impact on the writing performance didn’t have anything to do with the protocol interface at all: When first writing
TMemoryBuffer, I simply implemented
write() as D array appending operation, because I didn’t want to spend much time on optimizing it yet, and I figured that as long as there would not be too many reallocations, it should be reasonably fast for testing purposes. Array appending translates to a non-inlined and not really cheap D runtime call, however, and
TMemoryBuffer.write() unsurprisingly happens to be the single most called function in the whole writing part of the benchmark. After changing
TMemoryBuffer to manual
free-style memory management, the writing part finished in less than 30% of the time.
I tried to switch to
GC.malloc instead of manual freeing afterwards because it would make getting a buffer content slice safe and the small memory allocation overhead should not really be a problem for typical
TMemoryBuffer use cases (it does not matter at all in this benchmark because the required amount of memory is pre-allocated), but I encountered some strange data corruption issues in the other larger test cases I have yet to track down. Most probably, I just missed some subtleties when treating
GC.realloc as a drop-in
free replacement, but I just didn’t find a way to pin-point the issue.
For the next step, I tackled the design of the
TProtocol interface: When building the first prototype for the library, I had the ad-hoc idea of passing in delegates to the aggregate reading/writing functions for processing their members. I figured that this would make the interface nicer as all the
*Begin()/*End() pairs could be collapsed into a single call, the struct member reading loop could be moved into the protocol itself instead of being duplicated over and over again (although this is not a real benefit besides a slight code size reduction because it is generated code anyway), and implementing protocols like JSON would be easier since the structural information would not have been completely lost compared to a »flat« interface.
I was, however, aware of the fact that this could pose a performance problem, and indeed some experimenting showed that DMD generated suboptimal code for delegate literals and was not really able to them, even for
scoped delegates. From a compiler point of view, this is not really surprising as generating better code would require a fair bit of analysis to be done, but still I decided to switch to a more simplistic protocol design for the time being – even more so, as I realized that my design idea would not really simplify implementing JSON-like protocols anyway. I chose to go with the C++/Java interface verbatim, as it is proven to work (and having a similar interface across multiple languages has its own merits as well), and with the changes in, I measured a 20% speedup, even though no inlining was possible due to virtual calls all over the place yet. (In hindsight, it might have been better to implement the template mechanism first, so that the actual impact of the protocol API change would have been more visible. Maybe I’ll revert the binary protocol back to the old interface and re-run the test to get precise numbers at some point in the future.)
Finally, I implemented a way to specify the concrete transport/protocol types used in the application at compile-time using templates (similar to C++ and the
templates Thrift compiler argument), thus eliminating most virtual calls and enable the compiler to inline calls all over the library. I expected to see a dramatic speedup here as well – when not specifying the protocol/transport type, the writing loop in the C++ benchmark is only half as fast –, but instead I saw »only« a 40% speedup overall, with the C++ version still being significantly faster.
When comparing profiling data for the optimized C++ and D versions, I noticed that in the D version
_memcpy gets called ten times as often as in the C++ version – GCC, being able to inline the
write() calls, is able to replace the calls with optimized routines for shorter lengths, and since both versions spend most of their time actually copying data around at this point, this yields a huge advantage.
After that, I did not make any further attempts at optimizing the D version, since performance was not my primary goal at this stage anyway – the basic design seems to be solid, and what left are micro-optimizations. When focussing on performance later in the term, I will certainly create more benchmarks, and also try to optimize the languages I will compare D to (C++ and Java, most likely) – for example, the current C++ serialization benchmark from the official HEAD does a lot of unneeded work in the reading loop, moving out the initialization code makes it run twice as fast. I will also have a look at using GDC and LDC instead of DMD for their more sophisticated backends, and document the exact performance findings on various platforms.
Even though I am not going to write that much about it, I spent the bigger part of my time on non-performance related work: Generated structs now have an appropriate
opEquals() implementation, the D ThriftTest client actually checks the data it sends/receives instead of just flooding the console with messages (no idea why this hasn’t already been implemented for C++ and Java), and last but not least, I implemented the Compact and JSON protocols for D. This completes the protocol section, as I do not plan to implement the Dense protocol unless there is much time left to spend during the end of the term (as previously discussed).
During the next (or rather: this) week, I am going to work on documentation, integrate a number of test cases I have already lying around with the repository/build system, and implement a simple multithreaded server.