Category Archives: Graph Processing

How-to: Do Scalable Graph Analytics with Apache Spark

Categories: Data Science Graph Processing How-to Spark

Get started with scalable graph analysis via simple examples that utilize GraphFrames and Spark SQL on HDFS.

Graphs—also known as “networks”—are ubiquitous across web applications. As a refresher, a graph consists of nodes and edges. A node can be any object, such as a person or an airport, and an edge is a relation between two nodes, such as a friendship or an airline connection between two cities.

Read More

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Categories: Graph Processing How-to

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

One major task in the above was to apply structural analysis to networks reconstructed by time-series analysis techniques.

Read More

How-to: Manage Time-Dependent Multilayer Networks in Apache Hadoop

Categories: Graph Processing Hadoop Use Case

Using an appropriate network representation and the right tool set are the key factors in successfully merging structured and time-series data for analysis.

In Part 1 of this series, you took your first steps for using Apache Giraph, the highly scalable graph-processing system, alongside Apache Hadoop. In this installment, you’ll explore a general use case for analyzing time-dependent, Big Data graphs using data from multiple sources.

Read More

How-to: Write and Run Apache Giraph Jobs on Apache Hadoop

Categories: General Graph Processing Hadoop

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets.

Read More