Working in data pipelines sometimes are with very large files and we need a solution to ensure the process must be done with less frustrations like lack of resources or glitches in files.

Then how?

Typically we write files

Here is a sample Python script. Open a target file in w (write) mode then put some contents in it.

Then we meet chunks

Let's say we want to read a large file and write it to the destination but we can't read all at once. Here is an example of chunk processing we can use. Chunk means a small piece of something big so we are trying to split that big thing into pieces and transfer them one-by-one until finished. Example below shows that we can read (r) from a file at a specific size then write into another file.

chunk and decompression

In case we need some extra operation, we are able to execute it to each chunk. For example, this is how we can do to decompress a gzip file with chunk processing. We, this time, use rb to read as binary from the gzip source file and wb to write as binary into the target text file. Decompression can be completed thanks to zlib library.

Chunk from database connectors

Simply dumping database data. First, we need to connect to the database and execute a query then fetchmany() to get each chunk so you can process the chunk as you desire. This time we append each chunk into the file with a mode and use csv library for csv formatting.

These are sample codes you can try and adapt to your work for performance optimization.

มาใช้ Apache Beam กันเถอะ – ตอนที่ 8 โพย side inputs และการติด tag

Let's try: Apache Beam part 8 - Tags & Side inputs

มาใช้ Apache Beam กันเถอะ – ตอนที่ 7 IO ที่สร้างได้

Let's try: Apache Beam part 7 - custom IO

File is too big? Make it chunks.

Typically we write files

Then we meet chunks

chunk and decompression

Chunk from database connectors

Tags