Let's try: Apache Beam part 6 - instant IO
Apache Beam provides inputs and outputs for PCollection in many packages.
Apache Beam provides inputs and outputs for PCollection in many packages. We just import and call them properly and get the job done.
This blog we will see 3 IO (input/output) modules that I usually work with.
1. Text (Google Cloud Storage)
A very basic one.
Beam has beam.io library and there are ReadFromText() and WriteToText() in order to read and write a text file respectively.
We also use them to work with files in Google Cloud Storage as they are text files.
- line 14: 
ReadFromText()to read a file atinput_filewhich is fromargparse. - line 16: 
WriteToText()to create a file atoutput_file. 
This pipeline can be drawn to diagram like this.
sequenceDiagram
  autonumber
  participant i as input text file
  actor b as Apache Beam
  participant o as output text file
  
  i->>b: read file
  note over b: transform
  b->>o: write file
2. Database (Google BigQuery)
Another Google Cloud service that I use so often.
- line 14: 
ReadFromBigQuery()and supplyquery=to run the query. - line 18: supply 
temp_dataset=in order to allow Beam can use the given dataset to store temporary data generated by Beam. 
If we don’t supply
temp_dataset=, Beam will automatically create a new dataset every time it runs.
 Beam automatically generates temporary datasets.
This pipeline is as the diagram below:
sequenceDiagram
  autonumber
  participant s as BigQuery<br/>source table
  actor b as Apache Beam
  participant d as BigQuery<br/>destination table
  
  note over b: read query file
  b->>s: request query job
  activate s
  s->>b: return query result
  deactivate s
  b->>d: write to destination
3. Messaging (Google Cloud Pub/Sub)
Come to real-time things.
Google Cloud Pub/Sub is one of source and sink integrated with Beam. We are able to setup Beam to listen to a publisher or a subscriber by design.
For this time, I setup Beam to read data from a subscriber then transform before send to another topic.
- line 10: set the option with 
streaming=True. Allow Beam to run as a streaming pipeline. - line 15: 
ReadFromPubSub()by reading from a specific subscriber. - line 26: After transforming, share the result to a topic through 
WriteToPubSub(). 
Make sure the PCollection is a byte string by using
.encode()before throw them toWriteToPubSub().
We can test publish something on the topic and pull from the subscription. Like this.
originMessageId is parsed from the topic in transformation step.
This diagram describes the flow of this Beam pipeline.
sequenceDiagram
  autonumber
  
  actor u as User
  participant p1 as First<br/>publisher
  participant s1 as First<br/>subscriber (pull)
  actor b as Apache Beam
  participant p2 as Second<br/>publisher
  participant s2 as Second<br/>subscriber (pull)
 
  u->>p1: publish a message
  p1->>s1: send a message
  b->>s1: pull a message
  activate s1
  s1->>b: return a message
  deactivate s1
  note over b: transform
  b->>p2: publish a message
  activate s1
  p2->>s2: send a message
Repo
Feel free to review full code here.
