Creating your Talend jobs
It would take me too much time to write down a step by step guide for this part so I’ll share a few resources for you to get started with Talend and I’ll share a zip file with a copy of the jobs I used for my demo, you would just need to import them in your Talend Open Studio and adjust the context or the components.
Resources:
- If you’re totally new to Talend:
Getting Started with Talend Open Studio for Data Integrations - If you know how to use Talend:
How to go serverless with Talend & AWS Lambda
Download and Import the project files
Download the files from here, then right-click on “Job Designs” and use the Import Items functionality:
Adjust the DF19 Context variables with your own database details and your
Case Safe Org ID (18-chars)
Talend job description
The two demo jobs have the same identical components and they just differ in the fields which are parsed and mapped to the database schema.
At the beginning (top-left), we open a shared connection to the database.
We then obtain the raw message from a tFixedFlowInput component using a single input field containing the context variable AWS_SQSmessage (which will be passed as an argument to the executable JAR).
The “Parse Accounts” component is a tExtractXMLField component which will read the XML contained in the input field and extract one row for each loop element (in our case each Notification tag):
All successfully parsed rows are sent forward to the “Filter OrgId” where we can check if the orgId contained in the messages is matching with the one we expect within this Context:
We can then map all fields to a relevant column in the database table: for simplicity I kept identical names (warning: PostgreSQL is case sensitive!)
Finally we store the changed data into the database using tDBOutput or tDBOutputBulkExec components.
NOTE:
Upsert operations are slower than “Insert” or “Update“ so I generally split them.
When you use tDBOutputBulkExec you can get up to 10x faster because it uses the COPY command of PostgreSQL.
tDBOutputBulkExec requires temporary CSV files to be created, Lambda functions allow you to write up to 512 MB in the /tmp/ folder.
How to test your jobs
You can test your Talend job by copying a message from the SQS queue and pasting it into the Context variable AWS_SQSmessage. I also use a lot of tLogRow components to see the data flow at every stage of the ETL process (you can deactivate the unneeded ones, if you don’t need some table to appear in your CloudWatch logs).
Build your Talend jobs
Talend allows you to export your jobs as executable JARs
Save the file to your preferred location:
Now we need to create a wrapper Java application that will import the Amazon SDK classes and will run the main method from our JAR.
Go to the next page to see how to do that.
Leave a Comment