Let's try: Apache Beam part 3 - my own functions
When it comes to complex transformations, we would design our flows to be more organized and clean.
When it comes to complex transformations, we would design our flows to be more organized and clean. Yes, I’m talking about functions and classes.
How should we create and use them in Apache Beam?
ParDo
ParDo
is a function of Apache Beam to execute a given class to do some transformations. To implement ParDo
, we can start from this syntax.
Custom class
We start from defining as class and inherits beam.DoFn
. A required function here is process()
. We need to implement this and Beam will call the function automatically.
There are some other functions we will see later.
Main script
We can just import it and execute the custom class by beam.ParDo(<class name>())
.
Example
Let’s say we are running Beam to transform a CSV as same as the last part but we want to use our own functions, also wrapped in a class.
Here it is. We have a class CSVToDictFn
. It inherits beam.DoFn
here. This class transforms from a row of CSV to a dict.
Then we go back to main and call the class, like this.
It will call the function process
itself.
Here is the output.
Because beam.ParDo
will accumulate the result in the end, we should use yield
here where yield
produces an iterator of each element while return
returns an iterator of all elements.
This is what we will get if we use return
. It is only the keys of the dictionary.
Example with parameters
Let’s move to another level.
If we want to parse parameters to the class, we can implement __init__()
to set them.
Here we have CSVToDictFn
class. This version has __init__()
to receive variable schema
. The variable will be used to cast CSV fields to different types.
And we can call the class like this.
We prepared schema
at line 8 and call the class at line 23. When calling the class, we also parse schema
in the parentheses.
The logic of this version of CSVToDictFn
is to get field names, field types, and value, map with zip
and construct a dictionary. This can be illustrated as the following figure.