Did you get caught by the Leap-second bug?

I happened to be one of the many thousands of systems engineers that got a call at 9:00 p.m. on Saturday, June 30. All of our linux servers with Java applications were under heavy load. Restarting the Java applications did nothing to reduce the system load. After several hours of troubleshooting, we finally decided to reboot the server. Much to my surprise, this fixed the problem.

Some have reported that this was not a Java releated problem; rather, it was a linux problem. See: http://www.pcworld.com/businesscenter/article/258688/linux_is_culprit_in_leapsecond_lapses_cassandra_exec.html

Well, linux servers with Apache, MySQL, erlang, python, PHP, C or C++ applications did not have any problems. It was only our Java applications that had problems (e.g., Neo4J, Hbase, Kafka and so on). The following is what I discovered in my troubleshooting.

Java stopped closing threads efficiently, and threads remained open in a “wait” or a “sleep” state for a long time. As a result, Java was using the maximum amount of memory allowed, and garbage collection was not working efficiently. All of our java applications were affected. None of our other applications were effected.

Java applications were the culprit for loading the systems. I agree that the only thing that fixed the problem was resetting the Linux clock (or timer). So, the problem was certainly related to the way Linux updated its clocks during the leap second. However, Oracle needs to step up and take some responsibility. The problem may have started with the Linux clock, but the result was that Java freaked out.

I should have entitled this blog entry as: JAVA Is Culprit in Leap-second Lapses.

Leave a Reply