Post

Data Integration (EP 2) - Take it out

Let's say, we need to sum up the data in 10 files into one. How can we do?

Data Integration (EP 2) - Take it out
In this series

Hello myself and around the world!!

As we talked in the last episode, here now we go build a simple workflow to integrate data on Talend.

Let’s say, we need to sum up the data in 10 files into one. How can we do?


Talend’s interface

talend workspace parts

Left panel is called Repository. It is a bank of all our jobs and their compositions e.g. custom source code and metadata.

And the right one is Palette which displays a list of available components

Below one is Property:

  • Property – Job for job description: creation date and version
  • Property – Context or job variables
  • Property – Component displays settings and configurations depend on selected component
  • Property – Run for manually run the program

Let’s do the exercise

talend exercise

There are 2 boxes of the operations. It means we need at least 2 components that are for reading 10 files and for writing 1 file.

talend add components

Those are tFileInputDelimited for reading one CSV file and tFileOutputDelimited for writing one CSV file.

talend add row

Next, we connected both by Right-click and select Row > Main. therefore, a row from tFileInputDelimited will be a row in tFileOutputDelimited.

Right now we have a design for a single file. Then we check if it is needed to transform data and luckily no.

Next, we have to define the schema. For example, the file contain 2 columns; first name and last name.

talend schema

Click ... and a schema box will be appeared.

  • first_name as String (text)
  • last_name also as String

Don’t forget to set the exact schema on both tFileInputDelimited and tFileOutputDelimited (we can click Sync Columns to immediately copy schema.)

tFileInputDelimited supports only a single file so we need another component that is tFileList for retrieving a list of files in a specific folder.

talend row linked

On tFileList, we fill the folder path that contains those 10 files in the “Directory” box. The figure below is my 10 files prepared in the place.

prepare files

After that, we connect tFileList and tFileInputDelimited with Row > Iterate for read each file in the folder.

talend row iterate

tFileList will list all files that meet our conditions and expose as the variable CURRENT_FILEPATH so we will read them one by one on tFileInputDelimited.

One thing, CSV files use comma as a separator by default. Don’t forget to check it.

talend iterate filename

On tFileOutputDelimited, we want a destination file store all data of 10 source file so check “Append” to allow program add data at the end. Checking “Include Header” when we need headers.

talend check append

Putting destination filename at our desire and finally we go run the program as below.

talend test run

And here is the result.

file outputs

See, the program works properly. Next time we will find the way how to schedule the program at a time of clock we want.


See you again

This post is licensed under CC BY 4.0 by the author.