How Cloudera Uses Open Source

Categories: CDH

This article was originally posted by Tom Smith Research Analyst and Business StratgistDZone, Inc on their website and is being shared here with permission.

Doug Cutting, Chief Architect at Cloudera, shares how the company uses open-source software to help companies use data to improve their business.

What does your company use open-source software to accomplish?

Everything we do. Cloudera is an open-source company. Most of our development efforts are spent creating and enhancing open-source software. Our platform (CDH) is an entirely open-source stack. Nearly every component is developed at The Apache Software Foundation. Our platform is collaboratively developed, with a diverse community. We contribute mightily to established projects like Apache Hadoop, Hive, Spark, and Solr, and have also instigated substantial new projects at Apache like Impala, Kudu, and Sentry. Our company’s central challenge is to make this complex suite of open-source software work seamlessly together to help institutions get more value from their data.

What Open Source software do you use?

Like the rest of the world, we critically depend on open-source software. Open-source now forms an essential part of the air that computing breathes. Most of our servers and many of our phones run Linux. The services our business uses, from Google, Apple, Workday, Salesforce, etc., are largely built on open-source foundations. Our developers use open-source tools like git, emacs, Jenkins, etc.

Our core product is an open-source platform (CDH) comprised of a collection of mostly Apache projects: Hadoop, HBase, Spark, Impala, Kudu, Solr, etc.  We also sell commercial software to help manage CDH, software that installs and configures all this open-source software.

Internally, we use our own open-source software stack to better understand our customers. We collect data from their use of our software, including installation and configuration logs, crash reports, service calls, etc., and then analyze these to improve both the software and our service.

What do you consider to be the most important elements of the open-source ecosystem? 

Open source is a better development model for platform software. Folks don’t want to build hard dependencies on potentially fragile companies into their businesses. They’d rather depend on robust open-source communities. It’s partly about lower costs, but more about long-term risk.

I’ve worked on software for about 30 years now. In that time, I’ve been an employee of seven different companies. Most of those companies are not around anymore. Most of the commercial software I wrote for them is not supported anymore. But for the past 18 years and five companies, I’ve worked on open source. While not all five companies are still around, nearly all of the open-source software I’ve worked on is still actively maintained, with commercial support available from multiple parties. So, in my personal experience, open-source software is a much better bet for long-term dependencies.

We see this validated more broadly in the market. New successful platforms are overwhelmingly open-source. Linux is the winning operating system. The Apache Hadoop stack is dominant in big data. Kubernetes, Docker, and others are vying to become the standard for containers and virtualization. The leading machine learning libraries are all open-source. Open source is now table stakes for aspiring platform technologies.

Who are the most important players in the open-source ecosystem?

Developers. They are the primary actors. But while some developers are independent and self- supporting, most are paid by employers to work on open-source software. So their employers are also important players. There’s a balance of power. Developers’ involvement with open source is often longer than their current employment and is a matter of public record. A developer’s reputation in an open-source community is thus a critical part of their long-term career. So while they are obligated to act consistently with their employer’s goals, they are wise to also act in the long-term interests of the open-source project. Fortunately, these are rarely at odds: employers also want the open-source projects they invest in to be long-lived and have happy, healthy developer communities.

What have been the most significant changes to the open-source ecosystem in the past year? 

The ecosystem continues to mature and grow. Older, established components get stronger and more featured. Hadoop 3.0 is currently making its way to users. Its hallmark change is the addition of erasure-coding based storage in HDFS. This doesn’t fundamentally change the sorts of applications that can be built, but it does let folks store 50% more data — which is significant.

I’m very excited about Apache Kudu, a relational storage engine for the Hadoop ecosystem. This lets folks more easily build applications whose data is rapidly updated and mutable. Combined with Kafka and Impala, folks can quickly develop large-scale real-time systems, queryable with SQL. This works well for IoT applications where lots of devices might be streaming data to the cloud and real-time analytics are needed.

As we hear every day, AI is having a renaissance. This has resulted in a number of very popular and useful machine learning libraries, nearly all open-source. This newfound success of machine learning methods in so many new areas has in large part been fueled by open-source collaboration.

What are real-world problems being solved by open-source software today?

Open source is no longer only used by web companies. It’s now mainstream. It’s hard to find an industry that’s not solving problems with open-source big data technology. Open-source tools help companies efficiently capture, store, process, and analyze vast amounts of data and transform that data into clear and actionable insights. Banks and telcos around the world turn to the Hadoop stack to better understand their customers, reducing fraud and customer churn and improving product quality. Retailers are optimizing inventory, pricing, and advertising. Manufacturers are monitoring and improving production. Hospitals are minimizing costs, curing diseases, and providing more humane care. Governments are stopping money laundering and making services more efficient. Farmers are monitoring their fields. Open-source data technology turns up almost everywhere.

What are the most common problems with the Open Source ecosystem today?

A perennial tension in the open-source ecosystem is fragmentation versus experimentation. The ecosystem is more efficient when everyone agrees on a single software project to address a given problem. The development and maintenance of that project can thus be shared by a larger community, resulting in greater software quality as well as faster progress on new features. A single standard solution accelerates the entire ecosystem since other elements need only integrate with that one system, rather than multiple, incompatible implementations. Sometimes, however, not everyone can agree on how to approach a particular technical problem, and multiple solutions are created as different software projects. Sometimes, this is productive, as the different projects may serve different needs. When they’re truly duplicative efforts, usually one ends up capturing the majority of mindshare and the ecosystem consolidates around it, with its alternatives fading away. However, in some cases, multiple solutions persist longer than optimal for the ecosystem. These are often kept alive by commercial rivalries. They become a tax on the ecosystem, slowing its progress and adoption. That’s, however, the exception. The vast majority of successful projects provide unique value.

What’s the future of open-source software, from your perspective?

Open source has become the standard development model for platform software, those elements of applications that are shared building blocks. It is a more effective and efficient mechanism to create and maintain such software, as the market has now shown on multiple occasions. In the past decades, Linux has become the most popular operating system and Apache the most popular web server. More recently, with the Hadoop ecosystem, open-source now dominates big data technologies. Now we’re seeing it as the standard in the next technology wave of machine learning. Open source is now expected for platform technologies. Few even try to establish new commercial software platforms, and, when they do, they tend to fail.

Since open source is a more effective development model, I hope it continues its rise up the application stack into more verticals. We’re already seeing how open-source genomics platforms can accelerate progress in that area. Similarly, we need standard open-source tools for precision medicine, education, climate science, etc. so that we may make more rapid progress in these areas. And if we can establish open standards in other industries, then we can improve their productivity. Manufacturing, telecommunications, banking, transportation, healthcare, and most other industries all have common data problems that can be more efficiently addressed through the use of shared open-source software.

What’s your biggest concern with the current state of the Open Source ecosystem?

The skills gap is one of the primary factors gating the rate of platform change, but it’s also a sign that innovation is at hand. The skills gap in big data will remain relatively constant in the next year and may deter people from adopting open-source technologies. When new technologies are created and vie for users, they are known by a few. Only once a particular type of software is a mature standard do we begin to have a substantial number of folks skilled in its use — but even then, the skills gap can persist. It will disappear only when we stop seeing big improvements to the stack, which we don’t want.

How do you ensure the security of open-source software?

It’s hard to catch new attacks that way with the classic approach of having filters that are scanning for particular kinds of behavior that someone has manually coded in terms of prior attacks. If one builds models that define usual behavior, one can catch anomalies. A standard format for network data lets different firms build different applications that detect intrusions, providing a cybersecurity ecosystem, an open data model for cybersecurity.

What do developers need to keep in mind when working with Open Source software?

Not focusing on individual technologies but on understanding the best uses of each of the components of the open-source data ecosystem and how they can be connected together to solve problems. That high-level architectural understanding is the most valuable skill and understanding how new technologies fit in, what they might replace, and what they might enable.


Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *