Java and Kernal Bugs
Java Concurrency Bug
This is the first bug we found in a system outside VoltDB. We used LinkedBlockingDeque in Java’s concurrent package to maintain concurrent access in VoltDB’s network subsystem. Once in a while, we noticed cluster lockups when the system was under pressure. After looking at the stack traces of the VoltDB instances on the servers, it looked like the threads were waiting for the same lock which nothing seemed to own. This caused the whole system to deadlock.
After creating a small program which could successfully reproduce this problem without VoltDB, we were determined that this was a bug in the Java runtime system. We submitted the bug to Sun.
The fix was introduced in Sun JDK 1.6u18. In essence, the bug was caused by absent proper memory barriers in various code paths that could result in lost wake-ups of locks and hangs.
Kernel TCP Bug
A much deeper bug was revealed when we were testing VoltDB with our implementation of the TPC-C benchmark. We noticed consistent lockups of the system which rendered the whole cluster unusable. It only happened when VoltDB was run on CentOS 5.4. We tried hard to find the cause of the problem in VoltDB, but the stack traces of the cluster instances did not shed any light.
We could easily reproduce the problem by running three nodes with two clients firehosing at them. After the server had finished loading, the lockup usually happened within 20 seconds into actual transaction processing. When it happened, the cluster stopped processing any transactions. VoltDB instances on each node showed 30% to 50% CPU usage. The stack traces always showed that two of the three server instances are expecting data from each other and waiting. However, the data never came. Linux command ‘netstat’ showed that on node A, the TCP connection to node B had an empty receive queue. The same connection on node B had a full send queue. This did not seem how TCP should normally behave. So we did some further investigation with ‘tcpdump.’ It turned out that the two nodes were constantly sending TCP ACK packets to each other with non-zero window sizes. Interestingly, the packets were all duplicate ACKs. These were the only packets that they exchanged after the lockups started. No application data was sent. In other words, the server on node B pushed data to the TCP connection, asking the kernel to deliver them to node A. However, the kernel refused to do so even though node A was asking for more data.
The TCP packet dumps had finally convinced us that this was a bug in the Linux kernel. Hence we started digging through the kernel bug reports and patches. Surprisingly, the bug turned out to not be in kernel drivers, but in the kernel TCP stack. The kernel was calculating the window size incorrectly under certain conditions. When VoltDB stressed the network, the bug was triggered.
If you are not familiar with TCP, it is a connection-oriented, link-level protocol. The connected machines use window sizes to indicate how much data the other end can send to it without causing hiccups. The field used to represent window size in the TCP header was not designed with today’s broad bandwidth in mind, considering 1 Gigabit link in a LAN is normal nowadays. A new technique called window scaling was introduced to solve this problem. It basically solves the problem by using the same field in the TCP header to indicate a much larger window size. This is achieved by using a scale factor agreed upon by the two machines connected to scale the number up. In such a way, much larger TCP packets can be exchanged on a broadband link to increase efficiency.
This bug only happens when TCP window scaling is used. To get around it, we turned off the window scaling feature in the kernel. Although the TCP window size is reduced, it does not hurt cluster performance since the packets are generally small in VoltDB. To turn off TCP window scaling at runtime (no need to reboot the machine), you can use the following command as root,
/sbin/sysctl -w net.ipv4.tcp_window_scaling=0
You can also make it persistent by appending the following line to /etc/sysctl.conf,
net.ipv4.tcp_window_scaling = 0
The fix for this bug was introduced in kernel version 2.6.25. CentOS 5.4, which was our testing platform at that time, was using 2.6.18 without this patch. We have submitted a request to RedHat to backport the patch to RedHat Enterprise Linux 5.4.
Kernel Driver Bug
Besides the TCP bug we found in the kernel, there was another one we found in the Intel Gigabit Ethernet Network Driver version 1.3.16-k2 on Intel’s 82576 Gigabit Ethernet controller. The TCP connection between two nodes died shortly after our benchmark started.
The specific behavior was that node A had 0 bytes in its TCP receive queue, yet it told node B that the TCP receive window size was 0, indicating that it could not accept more data. So node B could not send anything to node A, and the cluster died.
The fix for this is to turn both TCP segmentation offload and generic receive offload off in the ethernet card driver. To do this, execute the following commands as root where X is the number of the ethernet card,
ethtool -K ethX tso off ethtool -K ethX gro off