Data Integration (EP 2) - Take it out
Let's say, we need to sum up the data in 10 files into one. How can we do?
Hello myself and around the world!!
As we talked in the last episode, here now we go build a simple workflow to integrate data on Talend.
Let’s say, we need to sum up the data in 10 files into one. How can we do?
Talend’s interface
Left panel is called Repository. It is a bank of all our jobs and their compositions e.g. custom source code and metadata.
And the right one is Palette which displays a list of available components
Below one is Property:
- Property – Job for job description: creation date and version
- Property – Context or job variables
- Property – Component displays settings and configurations depend on selected component
- Property – Run for manually run the program
Let’s do the exercise
There are 2 boxes of the operations. It means we need at least 2 components that are for reading 10 files and for writing 1 file.
Those are tFileInputDelimited
for reading one CSV file and tFileOutputDelimited
for writing one CSV file.
Next, we connected both by Right-click and select Row > Main. therefore, a row from tFileInputDelimited
will be a row in tFileOutputDelimited
.
Right now we have a design for a single file. Then we check if it is needed to transform data and luckily no.
Next, we have to define the schema. For example, the file contain 2 columns; first name and last name.
Click ... and a schema box will be appeared.
first_name
as String (text)last_name
also as String
Don’t forget to set the exact schema on both tFileInputDelimited
and tFileOutputDelimited
(we can click Sync Columns to immediately copy schema.)
tFileInputDelimited
supports only a single file so we need another component that is tFileList
for retrieving a list of files in a specific folder.
On tFileList
, we fill the folder path that contains those 10 files in the “Directory” box. The figure below is my 10 files prepared in the place.
After that, we connect tFileList
and tFileInputDelimited
with Row > Iterate for read each file in the folder.
tFileList
will list all files that meet our conditions and expose as the variable CURRENT_FILEPATH
so we will read them one by one on tFileInputDelimited
.
One thing, CSV files use comma as a separator by default. Don’t forget to check it.
On tFileOutputDelimited
, we want a destination file store all data of 10 source file so check “Append” to allow program add data at the end. Checking “Include Header” when we need headers.
Putting destination filename at our desire and finally we go run the program as below.
And here is the result.
See, the program works properly. Next time we will find the way how to schedule the program at a time of clock we want.
See you again