bods misc

When you hear the term “Big Data”, there are usually multiple problems

associated with it. Of course the size of the source information is one, but why

is the data so large? Because it is not ERP data but something else, such as

weblogs, social media data, and so on. This data is either completely

unstructured or semi-structured, and this data needs to be related back to the

ERP data.

Let us have a look at three examples of what one might want to call Big Data.

5

You might have an extremely large table in the database with billings you want

to sum up and show in a report for further analysis. This is an example of what

Hadoop can do very well, but so can databases, HANA in particular. So unless

the data is in Hadoop already, you would probably not use Hadoop for that

purpose.

6

A weblog is a nice example of what Hadoop can do very well. It is a large

dataset, it is not very well structured, you do not want to do simple aggregation

functions like “How often was the URL xyz called”, but more complex

algorithms like “What is the time when a single user opened one page and then

another – the potential reading time of the first page”. Expressing these

calculations in SQL is almost impossible and even if it can be done, it would

execute for a long time.

7

For simple statistical means you might want to count the number of times a

word occurs. But how do you break apart a long text into its words in SQL? And

even if you can, the amount of data would grow so much that it cannot be

processed anymore.

8

Hadoop by itself is very simple, it is a file storage system and can execute

programs reading the files and writing new ones. The program needs to be

written in Java and can consist of two methods only, a map method and a

reduce (aggregation) method.

What is unique about Hadoop however is that if both requirements are met,

files as source and map-reduce based programs, the Hadoop environment can

execute these in a massively parallel way. So your input of the map method

might be the Amazon product review as suggesting in Example 3 and the Map

phase will output the list of words. The reduce method will then take the words

and count them across all reviews so that per product you can see how often

the word “issue” was found. Just as an example of a first simple approach, we

of course would use the Text Data Processing Transform for that.

Hadoop has two extensions we are using also, the Hive which gives you SQL

like options to join and summarize data. And PIG to write Map-Reduce logic not

in low level Java code but a higher level scripting language compiled into

Map_Reduce logic at runtime.

9

It does not make sense to fill up a database with terabytes of text data if this

data cannot be analyzed because the logic cannot be expressed in SQL or

other database languages. And if the source data is just rough data that has to

be processed in order to get a measure out of it. Then you use Hadoop for the

cheap storage of the data and for processing it, the results are then loaded into

HANA or any other system to enrich the analysis possible there.

10

And why would we use weblogs, text information entered by millions of users?

Because it can help us to find out the Why.

11

Combining the ERP numbers with the Hadoop-produced statistics for each

product will allow you to relate the two, for example, relate revenue numbers

with the hit rates and the things said about the product to get a big picture of

the sales situation.

12

It is quite possible that you may have many tools from different vendors

supporting your EIM and/or data management initiatives for ETL, data quality,

data profiling, metadata management, and text analytics.

What if you could simplify your IT environment with a single foundation to

deliver data services across your entire enterprise (for current or new projects).

Such a foundation needs to be open to support all data sources and targets,

scalable to handle small and extreme data volumes, and reusable so that you

can build once and reuse for other projects.

Such a Data Services foundation can be your standard for delivering all of the

critical information management capabilities across your enterprise. As a result,

you can significantly gain greater IT efficiency and deliver maximum business

effectiveness.

14

For us as Data Services users, Hadoop is just yet another source target. We

can read and write files into Hadoop, and we can push down logic into Hadoop

so we do not have to do that in the engine.

15

The internal flow logic when Data Services is using Hadoop is quite simple, as

Hadoop is quite simple. We prepare the file, we generate the processing logic,

and let this script then be executed in Hadoop.

One important limitation at the moment is that we cannot load a remote Hadoop

system, so a job server needs to be installed in the Hadoop system.

16

The “CDC” checkbox and type can only be set when creating the datastore,

it cannot be modified later on.

Once a CDC datastore is created, it will show tables only where CDC or

change tracking is activated.

18

Change Data Capture (CDC) and Change Tracking enable applications to

determine the Insert, Update, and Delete operations that were made to user

tables in a database.

19

After each successful execution (at the end of the job), the watermark is written

to a status table; for the next execution, only changes after this watermark are

extracted.

You can use the CDC subscription name (for change tracking only) when

multiple “subscribers” need to access the delta independently.

For the type “CDC”, low and high watermarks are set to ensure data

consistency. This applies to all dataflows in the job.

Additional CDC columns are available in the input schema.

20

bods misc

Documents