How to get data from Amazon DynamoDB and how to aggregate them in Talend Open Studio Big Data

How to get data from Amazon DynamoDB and how to aggregate them in Talend Big Data

English: Talend logo Français : Talend logo
English: Talend logo Français : Talend logo (Photo credit: Wikipedia)

Why Talend Open Studio for Big Data is good for Amazon DynamoDB?

Since NoSQL is getting popular to handle huge amount of data these days. However, there is no easy way to manipulate or aggregate them. If we pay some money, yes, there are some software to help us, but I am going to introduce a easy way by utilizing Talend Big Data software.

Talend Open Studio for Big Data Download

Talend provides Talend Open Studio for Big Data free version and enterprise edition, but we are going to use free version in this article.

Talend Open Studio for Big Data retrieves data from Amazon DynamoDB (Of course, we can do the same thing for other NoSQL databases such as ElasticSearch or Cassandra, MongoDB, etc). Once we configured database connection setting, it automatically retrieve data and give us an ability to manipulate or aggregate them.



Overview of a Talend Job we are going to create

By utilizing Talend Open Studio for Big Data, we retrieve data from Amazon DynamoDB. Then, we aggregate them. Talend job looks like as follows.

Most NoSQL databases don't support aggregate function, so we cannot use like "GROUP BY". However, we can do that in Talend Open Studio for Big Data easily.

Create a table and sample data in Amazon DynamoDB

Let's create the following test table in Amazon DynamoDB and insert the following data.

$ dql -r ap-northeast-1

ap-northeast-1> scan * from test ;
--------------------------
| order_id |  uuid | qty |
--------------------------
| 100      | '2'   | 8   |
| 200      | '1'   | 21  |
| 300      | '5'   | 3   |
| 200      | '4'   | 7   |
| 100      | '200' | 3   |
| 100      | '100' | 24  |
| 100      | '3'   | 2   |
--------------------------

Talend job will retrieve data from Amazon DynamoDB, then aggregate them by order_id and get sum of "qty".

Place Talend components

Talend Open Studio for Big Data has a component to retrieve data from Amazon DynamoDB. We need to place tDynamoDBInput component (tDynamoDBInput_1).

Also, in order to aggregate Amazon DynamoDB data, we need to place tAggregateRow component (tAggregateRow_1).

Lastly, in order to display aggregated data, we need to place tLogRow component.

Once we place the 3 components, each component needs to be connected as the above figure.

Setting up Amazon DynamoDB component

In order to get data from DynamoDB, let's configure tDynamoDBInput_1 component. tDynamoDBInput_1 needs to be configured as the following figure.

Once we typed all the fields as the above figure, we need to click "Edit schema" button and add 3 fields as follows. Also, since we are going to aggregate "qty" value, "Db Column" for "qty" must be "Integer"as the following figure.


Aggregate Amazon DynamoDB data

In order to aggregate Amazon DynamoDB data, let's configure tAggregateRow component. Since we are going to aggregate them by order_id, let's add order_id into "Group by" section. This is only for grouping information.

Also, we need to sum of "qty" value, we need to add qty into "Operations" section and "Function" has to be "sum" as the following figure.
Based on the configuration, the component aggregates data and calculate sum of qty. This is very easy!

Also, we need to click "Edit schema" button to display aggregated result. Once we aggregated data, output data is different from what we received from Amazon DynamoDB, therefore we only need "order_id" and "qty" in the "Output" section as the following figure.

Display aggregated data

In order to display aggregated data, we also need to configure tLogRow component to get aggregated data from tAggregateRow. Please make sure tLogRow's schema looks like as follows (Can see the schema definition by clicking "Edit schema").

Run Talend job

Since the job has been created, let's run the job.

As the following figure, data is retrieved from Amazon DynamoDB and are aggregated by order_id. Also, qty is summed per order_id.

Conclusion

As we saw, we could retrieve data from Amazon DynamoDB even though it's NoSQL database and could aggregate them in Talend Open Studio for Big Data. In addition, programming skill is not required, so it's very easy to manipulate NoSQL data.