Cloudera Blog · Support Posts

Meet the Engineer: Kathleen Ting

In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).

What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.

My role is a hybrid “player/coach” model: in addition to doing managerial things like leading a team and addressing customer escalations, I also answer customer support cases directly, which is a fairly unique combination. This is an effective approach: giving me direct insights into customer concerns that I otherwise wouldn’t get, helping me stay grounded, and ensuring I appreciate the work the team is doing, first-hand.

Secrets of Cloudera Support: The Champagne Strategy

At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.

Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed. 

Today, the system provides real-time support via an application we call CSI. When a support employee looks at a ticket, they can use CSI to examine the customer’s latest snapshot and see cluster stats such as version information, number of nodes in service, which services are used, and so on. CSI also visualizes different aggregations and groupings, such as versions, which allows us to detect misconfigured clusters, or issues caused during upgrade or installation.

Watching the Clock: Cloudera’s Response to Leap Second Troubles

At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.

Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.

Background

Leap seconds are added to the UTC to correct for Earth’s slowing rotation. The latest leap second was added last Saturday (6/30) at 23:59:60 UTC (5 pm PDT). Due to a missed function call in the Linux timekeeping code, the leap second was not accounted for properly. As a result, after the leap second, timers expired one second earlier than requested. Many applications use a recurring timer of 1 second or less; such timers expired immediately, causing the application to immediately try to set another timer, ad infinitum. This infinite loop led to CPU load spikes that launched 21 separate support tickets.