Let's try: Apache Beam part 6 - instant IO
Apache Beam provides inputs and outputs for PCollection in many packages.
Apache Beam provides inputs and outputs for PCollection in many packages. We just import and call them properly and get the job done.
This blog we will see 3 IO (input/output) modules that I usually work with.
1. Text (Google Cloud Storage)
A very basic one.
Beam has beam.io
library and there are ReadFromText()
and WriteToText()
in order to read and write a text file respectively.
We also use them to work with files in Google Cloud Storage as they are text files.
- line 14:
ReadFromText()
to read a file atinput_file
which is fromargparse
. - line 16:
WriteToText()
to create a file atoutput_file
.
This pipeline can be drawn to diagram like this.
sequenceDiagram
autonumber
participant i as input text file
actor b as Apache Beam
participant o as output text file
i->>b: read file
note over b: transform
b->>o: write file
2. Database (Google BigQuery)
Another Google Cloud service that I use so often.
- line 14:
ReadFromBigQuery()
and supplyquery=
to run the query. - line 18: supply
temp_dataset=
in order to allow Beam can use the given dataset to store temporary data generated by Beam.
If we don’t supply
temp_dataset=
, Beam will automatically create a new dataset every time it runs.
Beam automatically generates temporary datasets.
This pipeline is as the diagram below:
sequenceDiagram
autonumber
participant s as BigQuery<br/>source table
actor b as Apache Beam
participant d as BigQuery<br/>destination table
note over b: read query file
b->>s: request query job
activate s
s->>b: return query result
deactivate s
b->>d: write to destination
3. Messaging (Google Cloud Pub/Sub)
Come to real-time things.
Google Cloud Pub/Sub is one of source and sink integrated with Beam. We are able to setup Beam to listen to a publisher or a subscriber by design.
For this time, I setup Beam to read data from a subscriber then transform before send to another topic.
- line 10: set the option with
streaming=True
. Allow Beam to run as a streaming pipeline. - line 15:
ReadFromPubSub()
by reading from a specific subscriber. - line 26: After transforming, share the result to a topic through
WriteToPubSub()
.
Make sure the PCollection is a byte string by using
.encode()
before throw them toWriteToPubSub()
.
We can test publish something on the topic and pull from the subscription. Like this.
originMessageId
is parsed from the topic in transformation step.
This diagram describes the flow of this Beam pipeline.
sequenceDiagram
autonumber
actor u as User
participant p1 as First<br/>publisher
participant s1 as First<br/>subscriber (pull)
actor b as Apache Beam
participant p2 as Second<br/>publisher
participant s2 as Second<br/>subscriber (pull)
u->>p1: publish a message
p1->>s1: send a message
b->>s1: pull a message
activate s1
s1->>b: return a message
deactivate s1
note over b: transform
b->>p2: publish a message
activate s1
p2->>s2: send a message
Repo
Feel free to review full code here.