Introduction In our previous blog post in this series, we explored the benefits of using GPUs for data science workflows, and demonstrated how to set up sessions in Cloudera Machine Learning (CML) to access NVIDIA GPUs for accelerating Machine Learning Projects. While the time-saving potential of using GPUs for complex and large tasks is massive, […]
When working on complex, or rigorous enterprise machine learning projects, Data Scientists and Machine Learning Engineers experience various degrees of processing lag training models at scale. While model training on small data can typically take minutes, doing the same on large volumes of data can take hours or even weeks. To overcome this, practitioners often […]
Today’s enterprise data science teams have one of the most challenging, yet most important roles to play in your business’s ML strategy. In our current landscape, businesses that have adopted a successful ML strategy are outperforming their competitors by over 9%. The implications of ML on the future of business are clear. However, only 4% […]
In this last installment, we’ll discuss a demo application that uses PySpark.ML to make a classification model based off of training data stored in both Cloudera’s Operational Database (powered by Apache HBase) and Apache HDFS. Afterwards, this model is then scored and served through a simple Web Application. For more context, this demo is based […]
2020 was a year of immense change and disruption. Despite the challenges, 2020 also provided positive opportunities for forward leaps to be made in the realm of digital transformation. At Cloudera, an example of this leap is our first virtual Data Impact Awards, which was held in November last year. One of our stand out […]
In this installment, we’ll discuss how to do Get/Scan Operations and utilize PySpark SQL. Afterward, we’ll talk about Bulk Operations and then some troubleshooting errors you may come across while trying this yourself. Read the first blog here. Get/Scan Operations Using Catalogs In this example, let’s load the table ‘tblEmployee’ that we made in the […]
Introduction Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. For data professionals that want to make use […]
In this blog we will take you through a persona-based data adventure, with short demos attached, to show you the A-Z data worker workflow expedited and made easier through self-service, seamless integration, and cloud-native technologies. You will learn all the parts of Cloudera’s Data Platform that together will accelerate your everyday Data Worker tasks. This […]
COVID-19 has forced virtually every industry to embrace an acceleration in digital capabilities. While it can be argued that digital transformation was already underway; it’s hard to dispute that it has accelerated in recent months. A recent McKinsey survey, cited in CRN, shows that worldwide, 58 percent of customer interactions were digital as of July […]
Apache Hadoop Distributed File System (HDFS) is the most popular file system in the big data world. The Apache Hadoop File System interface has provided integration to many other popular storage systems like Apache Ozone, S3, Azure Data Lake Storage etc. Some HDFS users want to extend the HDFS Namenode capacity by configuring Federation of […]