Third-Party Libraries in C6

Third-Party Libraries in C6

Cloudera has put a significant amount of work into upgrading the third-party libraries used in our just-released C6 version. This major upgrade of our software has given us the opportunity to upgrade many of the libraries we use. These upgrades allow us to avoid security vulnerabilities, use modern versions of libraries, and to standardize versions of libraries across CDH.

Modern software development relies on reusing other people’s code. Code reused in this fashion is called a “third-party library.” There are many, many examples of this: engineers who need a web server don’t start writing code for a web server—they use a popular third-party library like Jetty instead. When file compression is needed, Apache Commons-Compress does the job. To serialize data, Jackson-Databind is a good bet. The list goes on and on.

There are many benefits to using third-party libraries. Developers don’t have to reimplement a perfectly good wheel. The most popular libraries are high quality—they’re well tested, widely used, and have good governance. The popular ones have permissive open source licenses. In almost all cases, they do a better job at a particular task than most developers could create on their own. Things are easier and better when someone else’s already-written code can be used for a task and the software developer can focus on building something new and interesting. Modern software tools recognize this. For example, in Java, the Maven ecosystem makes interactions with third-party libraries quite easy.

Although they can save a lot of time, third-party libraries are not entirely hassle-free. They do require some maintenance. In particular, regular upgrades are required for the following reasons:

  • If the third-party library has a security vulnerability, we must quickly upgrade to a version of that library that has addressed the security vulnerability.
  • Sometimes third-party libraries reach end-of-life, become obsolete, or change names. If we don’t migrate to a newer version there might never be fixes for security vulnerabilities or other issues found.
  • To make it easier to use different CDH components together, it’s best to use the same version of each third-party library in all the projects we support.
  • Upgrading across major, backwards-compatibility-breaking versions can be challenging. It’s best to take on that challenge as part of the normal software release process and not as part of a patch fixing a security vulnerability.

The most important reason to upgrade is security vulnerabilities. A memorable example is the Equifax breach. Equifax used Apache Struts, but didn’t upgrade it after a major security vulnerability was discovered. Attackers used this vulnerability to steal the data of 143 million Americans. We don’t ever want to put our customers in a position where something similar could happen.

The maintenance of third-party libraries is complicated by the fact that they might not be used directly by a project. For example, we might use library X, and library X might use library Y. We call library X a “direct” dependency, and library Y an “indirect” or “transitive” dependency. It’s generally straightforward to upgrade direct dependencies, but indirect dependencies are specified by the library that includes them, and as a result are more difficult to change. For example, let’s say that we use library X, and library X uses a version of library Y that has a security vulnerability. At this point, we have to wait for library X to upgrade to a version of library Y without the security vulnerability, which is likely out of our control. It can get worse with more levels—library X can include library Y, which includes library Z, for example.

Toward the end of C5, many of the third-party libraries that we used were looking rather old. We wanted to upgrade them, but were unable to due to fears of breaking backwards compatibility. When changing a major version number (like from C5 to C6), however, it’s permissible to break backwards compatibility, and we were therefore able to make a significant amount of progress.

The backwards compatibility-related changes that came about from this project are noticeable, but should not be a cause for alarm for developers. Users will have to recompile jobs, but shouldn’t have to rewrite them. Users of services that expose SQL interfaces like Apache Impala should see no difference.

Some statistics should communicate the scale of the problem. Across all the software we support, we have over 600 unique direct dependencies. Including indirect dependencies, that number climbs to over 1500! Several of our larger projects, such as Apache Hive, Cloudera Manager, and Apache Hadoop, have more than 100 direct dependencies and over 300 indirect dependencies.

To track and measure our usage of third-party libraries, we built a tool called “Dependency Report,” which takes the following input:

  • The third-party libraries that each project uses, according to Maven
  • The most recent version of every library, according to Maven Central
  • The age of every library
  • Whether or not each library has a security vulnerability, according to OWASP’s “Dependency Check

Using this wealth of data, we were able to build a dashboard for each project. That dashboard can say things like, “Here are all the libraries with security vulnerabilities,” “These libraries are over 1 major version behind and should be upgraded,” and “These libraries are over 10 years old.” A score is assigned to each library that can help developers prioritize which libraries to upgrade. There’s an All-Projects view that shows how the projects compare to one another. One view shows every single library in use across all our projects, and another highlights discrepancies in versions of third-party libraries in all projects.

This is the view for Apache Hadoop dependencies:

Dependencies for Hadoop

This view shows where multiple versions of the same library are used in CDH:

Duplicate dependencies for Hadoop

We’re happy to say that this effort has been successful. Over the course of C6 development, we have addressed over 400 issues! Here are some of the more notable accomplishments:

  • In C5, we used two different types of web server—Jetty and Tomcat. The Jetty version was over 8 years old, and the Tomcat version was past its end-of-life. In C6, we have standardized on a modern version of Jetty across all our projects.
  • In C5, we used at least six different old and unsecure versions of the jackson-databind library. In C6, we have standardized on one modern version of jackson that has no known security vulnerabilities.
    These achievements required changes in nearly every project and required a significant amount of coordination and effort from all teams.

Going forward, we have constructed a dashboard to track metrics of our third-party libraries over time. We’ll be able to monitor new security vulnerabilities as they come in, alert teams, and address issues more quickly.

Thanks to these changes, C6 delivers a safer, more modern set of third-party libraries, and Cloudera will be able to maintain this standard of quality as we continue to invest in the future security of our products.

Michael Yoder
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.