When it comes to complex transformations, we would design our flows to be more organized and clean. Yes, I’m talking about functions and classes.

How should we create and use them in Apache Beam?


ParDo

ParDo is a function of Apache Beam to execute a given class to do some transformations. To implement ParDo, we can start from this syntax.

Custom class

We start from defining as class and inherits beam.DoFn. A required function here is process(). We need to implement this and Beam will call the function automatically.

There are some other functions we will see later.

Main script

We can just import it and execute the custom class by beam.ParDo(<class name>()).


Example

Let’s say we are running Beam to transform a CSV as same as the last part but we want to use our own functions, also wrapped in a class.

Here it is. We have a class CSVToDictFn. It inherits beam.DoFn here. This class transforms from a row of CSV to a dict.

Then we go back to main and call the class, like this.

It will call the function process itself.

Here is the output.

Because beam.ParDo will accumulate the result in the end, we should use yield here where yield produces an iterator of each element while return returns an iterator of all elements.

This is what we will get if we use return. It is only the keys of the dictionary.


Example with parameters

Let's move to another level.

If we want to parse parameters to the class, we can implement __init__() to set them.

Here we have CSVToDictFn class. This version has __init__() to receive variable schema. The variable will be used to cast CSV fields to different types.

And we can call the class like this.

We prepared schema at line 8 and call the class at line 23. When calling the class, we also parse schema in the parentheses.

The logic of this version of CSVToDictFn is to get field names, field types, and value, map with zip and construct a dictionary. This can be illustrated as the following figure.