A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
Given the challenge of creating a market based on ongoing data collection and massive query ability, the data warehouse organization ultimately plays the most important role in the persuasion marketing value chain, assuring a steady and unobstructed multidirectional flow of information. My team continuously ensures Persado’s infrastructure is aligned to the needs of our data scientists, including regularly generating KPI reports, managing data from heterogeneous sources, preparing customized analyses, and even implementing specific statistical algorithms in Java based on reference implementations of R.
As a senior engineer in the data operations organization in Athens, Greece, and the first to sit on both Persado’s data warehouse team and the data reporting team at Upstream Systems (which incubated and spun off Persado in 2012), I was responsible for the technical recommendation to step away from RDBMS and take strides towards the magical realm of NoSQL several years ago. This decision became a primary enabler for Persado’s transition from being a useful attribute of Upstream’s platform to becoming a full-on software company, delivering real value to clients.
Since 2010, my team has designed and implemented a variety of NoSQL systems. Although our initial experiments were frustrating at times, we eventually succeeded in creating a world-class Online Transactional Processing (OLTP) system based on MongoDB to handle ad interactions with customers. The database’s internal MapReduce mechanism was able to generate the required reports, the aggregation framework introduced new features to our reporting platform, and we applied machine learning algorithms to help process the data. As our analytics and report needs became more sophisticated, we eventually needed to decouple OLAP into a technology stack of its own.
We had too few experienced Big Data engineers on staff to grow our capabilities.
We quickly identified Apache Hadoop as the perfect solution to help us pick up, aggregate, and process the data from heterogeneous sources like MongoDB, MySQL config servers, and Apache logs that were being populated in documents within AWS S3 buckets and consumed by Apache Kafka and Apache ZooKeeper. However, like many organizations that mature into sophisticated systems and analytics, we faced a fundamental problem: we had too few experienced Big Data engineers on staff to grow our capabilities and scale out our systems. Given the strategic priority of developing the best data warehouse platform to fulfill our customers’ needs, we decided that Hadoop training was the most immediately actionable solution and would help us choose the right vendor to support our long-term Big Data objectives.
We evaluated the three most well-known Hadoop companies, including two supporting an open-source platform and one selling a proprietary distribution. We ultimately chose Cloudera because of the experience of its instructors, its vast partner ecosystem, its role as the innovator driving Hadoop advancement, its fundamental commitment to open source, and its reputation as an amazing company with which to work. In the end, there was evidence in the market that Cloudera would be able to support our use case and growth from the first step of implementation, while the claims of the other two companies could not be as readily and independently validated.
Fear Not the Pig
We worked with Cloudera University’s expert curriculum team to tailor a private weeklong training that would meet our immediate and long-term needs. We started benefiting from our decision to work with Cloudera almost right away since no other company offers a full Data Analyst training targeted at both developers and analysts, which was one of our biggest priorities. The intensive workshop also included the full Cloudera Developer Training for Apache Hadoop with the option of testing for the sought-after CCDH certification following the class.
Working with Cloudera helped us save time, money, and productivity.
Having an instructor onsite allowed the team to ask questions based on our actual experiences and explain our architecture and goals to validate that we were moving in the right direction. Working with Cloudera, who has the only (and best) full-time Hadoop trainers in mainland Europe, helped us save time, money, and productivity that would have otherwise been lost to travel, jet lag, and the stresses of being away from family and colleagues.
Our takeaways were significant in both the general technical and software engineering domains. The team learned to “embrace the Pig,” and we are now able to combine Hive and Pig jobs as appropriate to our use case. Throughout the excellent labs, we saw the importance of custom partitioning and how it can affect our MapReduce jobs’ performance.
One of the greatest values of the live training was learning pointers for handling common issues and useful tricks for the more complex challenges. For example, we had run into an imbalanced user attribute that was resulting in few reducers taking the bulk of the load. A custom partitioner remedied this, allowing for an even distribution of load with improved speeds and reduced times for the execution of the MapReduce job flow.
By learning more about the HDFS internals, we realized the need to balance writing to and reading from files further down our data pipeline. Our Kafka system was previously getting messages as JSON documents and dumping them onto S3. This can work, but it imposes severe overhead in terms of searching through a huge number of files for processing. Also, if we were to use HDFS in local EC2 nodes, this could create an issue with the NameNode, and we would have to resort to federating our HDFS namespace or scaling our NameNode capacity.
As engineers, we are naturally inclined to research new technologies, work on our own pet projects, and join interesting communities every day. However, Hadoop has proven to be an exceptional case in that true expertise, particularly tied to a specific use case, is very difficult to achieve, even with intense study. Cloudera training expedited our way through the learning curve, helped us answer our specific questions, and offered best practices derived from the engineers who built the platform. Cloudera also offers the unique added value of incorporating insights into its courses from the use cases that are driving the Hadoop market. Moreover, everyone on my team is confident that they learned Hadoop on the world’s most relevant and up-to-date open-source distribution in CDH4.
Trained to Persuade
At Persado, we collect data from a wide variety of sources, convert it to a base reference, and finally perform aggregations to derive meaningful reports for our internal teams and clients. An array of libraries, from R to Mahout to Java, enables a wide range of functions. Processes such as clustering, classification, and recommendations feed our objective to constantly identify the best message to serve each specific audience. Needless to say, without the right solution, we could have a Big Data problem on our hands.
We were able to quickly implement Hadoop as a key component of our data warehouse.
After the Cloudera training, it was evident to all participants that the strategy to move towards Hadoop was the right solution for achieving the company’s vision and goals. Cloudera is helping us discover useful insights from our data and allowing our employees around the world to better analyze both ad hoc queries and precalculated aggregates. We were able to quickly implement Hadoop as a key component of our data warehouse, which removed the burden we anticipated for our other systems and coordinated diverse projects for deeper, more relevant queries and greater speed to insight.
Cloudera’s tailored private training was a perfect fit for our objectives and was the cost-effective option for our needs. With the right training to get the team up to speed and working towards our Hadoop strategy, we are now well along on our journey to deliver unmatched value to our customers from the most sophisticated persuasion marketing platform in the industry. Cloudera has not only prepared us for success today, but has also trained us to face and prevail over our Big Data challenges in the future.