Introduction to Apache Beam Using Java

2022-08-20 11:04:39 By : Ms. Rensolin Service

Attend QCon San Francisco (Oct 24-28) and find practical inspiration from software leaders. Register

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

The panelists discuss ways to improve as developers. Are better tools the solution, or can simple changes in mindset help? And what practices are already here, but not yet universally adopted?

Legacy applications actually benefit the most from concepts like a Minimum Viable Product (MVP) and its related Minimum Viable Architecture (MVA). Once you realize that every release is an experiment in value in which the release either improves the value that customers experience or doesn’t, you realize that every release, even one of a legacy application, can be thought of in terms of an MVP.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Arpit Mohan about the importance and value of interpersonal skills in teamwork

Erin Schnabel discusses how application metrics align with other observability and monitoring methods, from profiling to tracing, and the limits of aggregation.

Learn how cloud architectures help organizations take care of application and cloud security, observability, availability and elasticity. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

InfoQ Homepage Articles Introduction to Apache Beam Using Java

Lire ce contenu en français

In this article, we are going to introduce Apache Beam, a powerful batch and streaming processing open source project, used by big companies like eBay to integrate its streaming pipelines and by Mozilla to move data safely between its systems.

Apache Beam is a programming model for processing data, supporting batch and streaming.

Using the provided SDKs for Java, Python and Go, you can develop pipelines and then choose a backend that will run the pipeline.

Beam Model (Frances Perry & Tyler Akidau)

The key concepts in the Beam programming model are:

A basic pipeline operation consists of 3 steps: reading, processing and writing the transformation result. Each one of those steps is defined programmatically using one of the Apache Beam SDKs. 

In this section, we will create pipelines using the Java SDK. You can choose between creating a local application (using Gradle or Maven) or you can use the Online Playground. The examples will use the local runner as it will be easier to verify the result using JUnit Assertions.

In this first example, the pipeline will receive an array of numbers and will map each element multiplied by 2.

The first step is creating the pipeline instance that will receive the input array and run the transform function.  As we're using JUnit to run Apache Beam, we can easily create a TestPipeline as a test class attribute. If you prefer running on your main application instead, you'll need to set the pipeline configuration options, 

Now we can create the PCollection that will be used as input to the pipeline. It'll be an array instantiated directly from memory but it could be read from anywhere supported by Apache Beam:

Then we apply our transform function that will multiply each dataset element by two:

To verify the results we can write an assertion:

Note the results are not supposed to be sorted as the input, because Apache Beam processes each item independently and in parallel.

The test at this point is done, and we run the pipeline by calling:

The reduce operation is the combination of multiple input elements that results in a smaller collection, usually containing a single element.

MapReduce (Frances Perry & Tyler Akidau)

Now let's extend the example above to sum up all the items multiplied by two, resulting in a MapReduce transform.

Each PCollection transform results in a new PCollection instance, which means we can chain transformations using the apply method. In this case, the Sum operation will be used after multiplying each input by 2:

FlatMap is an operation that first applies a map on each input element that usually returns a new collection, resulting in a collection of collections. A flat operation is then applied to merge all the nested collections, resulting in a single one.

The next example will be transforming arrays of strings into a unique array containing each word. 

First, we declare our list of words that will be used as the pipeline input:

Then we create the input PCollection using the list above:

Now we apply the flatmap transformation, which will split the words in each nested array and merge the results in a single list:

A common job in data processing is aggregating or counting by a specific key. We'll demonstrate it by counting the number of occurrences of each word from the previous example.

After having the flat array of string, we can chain another PTransform:

One of the principles of Apache Beam is reading data from anywhere, so let's see in practice how to use a text file as a datasource.

The following example will read the content of a "words.txt"  with the content "An advanced unified programming model". Then the transform function will return a PCollection containing each word from the text.

As seen in the previous example for the input, Apache Beam has multiple built-in output connectors. In the following example, we will count the number of each word present in the text file "words.txt" that contains only a single sentence ("An advanced unified programming model") and the output will be persisted in a text file format.

Even the file writing is optimized for parallelism by default, which means Beam will determine the best number of shards (files) to persist the result. The files will be located on folder src/main/resources and will have the prefix "wordcount", the shard number and the total number of shards  as defined in the last output transformation.

When running it on my laptop, four shards were generated:

First shard (file name: wordscount-00001-of-00003):

Second shard (file name: wordscount-00002-of-00003):

Third shard (file name: wordscount-00003-of-00003):

The last shard was created but in the end was empty, because all words were already processed.

We can take advantage of Beam extensibility by writing a custom transform function. A custom transformer will improve code maintainability as will remove duplication.

Basically we'd need to create a subclass of PTransform, stating the type of the input and output as Java Generics. Then we override the expand method and inside its content we place the duplicated logic, that receives a single string and returns a PCollection containing each word.

The test scenario refactored to use WordsFileParser now become:

The result is a clearer and more modular pipeline.

Windowing in Apache Beam (Frances Perry & Tyler Akidau)

A common problem in streaming processing is grouping the incoming data by a certain time interval, specially when handling large amounts of data. In this case, the analysis of the aggregated data per hour or per day is more relevant than analyzing each element of the dataset.

In the following example, let's suppose we're working in a fintech and we are receiving transactions events containing the amount and the instant the transaction happened and we want to retrieve the total amount transacted per day.

Beam provides a way to decorate each PCollection element with a timestamp. We can use this to create a PCollection representing 5 money transactions:

Next, we'll apply two transform functions:

In the first window (2022-02-01) it's expected the total amount of 30 (10+20), while in the second window (2022-02-05) we should see 120 (30+40+50)  in the total amount.

Each IntervalWindow instance needs to match the exact beginning and end timestamps of the chosen duration, so the chosen time has to be "00:00:00".

Apache Beam is a powerful battle-tested data framework, allowing both batching and streaming processing. We have used the Java SDK to build map, reduce, group, windowing and other operations.

Apache Beam can be well suited for developers who works with embarrassingly parallel tasks to simplify the mechanics of large-scale data processing.

Its connectors, SDKs and support for various runners bring flexibility and by choosing a cloud native runner like Google Cloud Dataflow, you get automated management of computational resources.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

Uncover emerging trends and practices from domain experts. Attend in-person at QCon San Francisco (October 24-28, 2022).

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy