in this series

Let's try: Apache Beam part 1 – simple batch
Let's try: Apache Beam part 2 – draw the graph
Let's try: Apache Beam part 3 - my own functions
Let's try: Apache Beam part 4 - live on Google Dataflow
Let's try: Apache Beam part 5 - transform it with Beam functions
Let's try: Apache Beam part 6 - instant IO
Let's try: Apache Beam part 7 - custom IO

When it comes to complex transformations, we would design our flows to be more organized and clean. Yes, I’m talking about functions and classes.

How should we create and use them in Apache Beam?

ParDo

ParDo is a function of Apache Beam to execute a given class to do some transformations. To implement ParDo, we can start from this syntax.

Custom class

We start from defining as class and inherits beam.DoFn. A required function here is process(). We need to implement this and Beam will call the function automatically.

There are some other functions we will see later.

Main script

We can just import it and execute the custom class by beam.ParDo(<class name>()).

Example

Let’s say we are running Beam to transform a CSV as same as the last part but we want to use our own functions, also wrapped in a class.

Here it is. We have a class CSVToDictFn. It inherits beam.DoFn here. This class transforms from a row of CSV to a dict.

Then we go back to main and call the class, like this.

It will call the function process itself.

Here is the output.

Because beam.ParDo will accumulate the result in the end, we should use yield here where yield produces an iterator of each element while return returns an iterator of all elements.

This is what we will get if we use return. It is only the keys of the dictionary.

Example with parameters

Let's move to another level.

If we want to parse parameters to the class, we can implement __init__() to set them.

Here we have CSVToDictFn class. This version has __init__() to receive variable schema. The variable will be used to cast CSV fields to different types.

And we can call the class like this.

We prepared schema at line 8 and call the class at line 23. When calling the class, we also parse schema in the parentheses.

The logic of this version of CSVToDictFn is to get field names, field types, and value, map with zip and construct a dictionary. This can be illustrated as the following figure.

มาใช้ Apache Beam กันเถอะ – ตอนที่ 7 IO ที่สร้างได้

Let's try: Apache Beam part 7 - custom IO

มาใช้ Apache Beam กันเถอะ – ตอนที่ 6 IO สำเร็จรูป

Let's try: Apache Beam part 6 - instant IO

Let's try: Apache Beam part 3 - my own functions