Let's try: Apache Beam part 6 - instant IO

Apache Beam provides inputs and outputs for PCollection in many packages.

Posted Apr 21, 2024

3 min read

In this series

Apache Beam provides inputs and outputs for PCollection in many packages. We just import and call them properly and get the job done.

This blog we will see 3 IO (input/output) modules that I usually work with.

1. Text (Google Cloud Storage)

A very basic one.

Beam has beam.io library and there are ReadFromText() and WriteToText() in order to read and write a text file respectively.

We also use them to work with files in Google Cloud Storage as they are text files.

line 14: ReadFromText() to read a file at input_file which is from argparse.
line 16: WriteToText() to create a file at output_file.

This pipeline can be drawn to diagram like this.

sequenceDiagram
  autonumber

  participant i as input text file
  actor b as Apache Beam
  participant o as output text file
  
  i->>b: read file
  note over b: transform
  b->>o: write file

2. Database (Google BigQuery)

Another Google Cloud service that I use so often.

line 14: ReadFromBigQuery() and supply query= to run the query.
line 18: supply temp_dataset= in order to allow Beam can use the given dataset to store temporary data generated by Beam.

If we don’t supply temp_dataset=, Beam will automatically create a new dataset every time it runs.

Beam automatically generates temporary datasets.

This pipeline is as the diagram below:

sequenceDiagram
  autonumber

  participant s as BigQuery<br/>source table
  actor b as Apache Beam
  participant d as BigQuery<br/>destination table
  
  note over b: read query file
  b->>s: request query job
  activate s
  s->>b: return query result
  deactivate s
  b->>d: write to destination

3. Messaging (Google Cloud Pub/Sub)

Come to real-time things.

Google Cloud Pub/Sub is one of source and sink integrated with Beam. We are able to setup Beam to listen to a publisher or a subscriber by design.

For this time, I setup Beam to read data from a subscriber then transform before send to another topic.

line 10: set the option with streaming=True. Allow Beam to run as a streaming pipeline.
line 15: ReadFromPubSub() by reading from a specific subscriber.
line 26: After transforming, share the result to a topic through WriteToPubSub().

Make sure the PCollection is a byte string by using .encode() before throw them to WriteToPubSub().

We can test publish something on the topic and pull from the subscription. Like this.

originMessageId is parsed from the topic in transformation step.

This diagram describes the flow of this Beam pipeline.

sequenceDiagram
  autonumber
  
  actor u as User
  participant p1 as First<br/>publisher
  participant s1 as First<br/>subscriber (pull)
  actor b as Apache Beam
  participant p2 as Second<br/>publisher
  participant s2 as Second<br/>subscriber (pull)
 
  u->>p1: publish a message
  p1->>s1: send a message
  b->>s1: pull a message
  activate s1
  s1->>b: return a message
  deactivate s1
  note over b: transform
  b->>p2: publish a message
  activate s1
  p2->>s2: send a message

Repo

Feel free to review full code here.

References

data, data engineering

This post is licensed under CC BY 4.0 by the author.