using big data for the analysis of - fiware€¦ · using big data for the analysis of historic...
TRANSCRIPT
![Page 1: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/1.jpg)
0
![Page 2: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/2.jpg)
Using Big Data for the analysis of
historic context information
Francisco Romero Bueno
Technological Specialist. FIWARE data engineer
![Page 3: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/3.jpg)
2
Big Data:
What is it and how much data is there
![Page 4: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/4.jpg)
What is big data?
3
> small
data
![Page 5: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/5.jpg)
What is big data?
4
> big data
http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg
![Page 6: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/6.jpg)
Not a matter of thresholds
5
If both the data used by your app and the processingcapabilities your app logic needs fit the available
infrastructure, then you are not dealing with a bigdata problem
If either the data used by your app either theprocessing capabilities your app logic needs don’t fit
the available infrastructure, then you are facing a bigdata problem, and you need specialized services
![Page 7: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/7.jpg)
How much data is there?
6
![Page 8: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/8.jpg)
Data growing forecast
7
http://www.cisco.com/c/en/us/solutions/collater
al/service-provider/visual-networking-index-
vni/vni-hyperconnectivity-wp.html
![Page 9: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/9.jpg)
8
Two (three) approaches for dealing with Big Data:
Batch and stream processing (and Lambda architectures)
![Page 10: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/10.jpg)
Batch processing
9
• It is about joining a lot of data (batching)– A lot may mean Terabytes or more…– Most probably, data cannot be stored in a single
server
• Once joined, it is analyzed– Most probably, aata cannot be analyzed using a single
process
• Time is not a problem– Batching can last for days or even months– Processing can last for hours or even days
• Analysis can be reproduced
![Page 11: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/11.jpg)
Stream processing
10
• It is about not storing the data and analyzingit on the fly– Most probably, data cannot be analyzed by a
single process
• Time is important– Since the data is not stored, it must be analyzed
as it is received
– The results are expected to be available in nearreal-time
• Analysis cannot be reproduced
![Page 12: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/12.jpg)
Lambda architectures
11
• A Big Data architecture is Lambda compliant if itproduces near-real time data insights based on thelast data only while large batches are accumulatedand processed for robust insights– Data must feed both batch-based and stream-based
sub-systems– Real-time insights are cached– Batch insights are cached– Queries to the whole system combine both kinds of
insights
• http://lambda-architecture.net/
![Page 13: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/13.jpg)
12
Distributed storage:
The Hadoop reference (HDFS)
![Page 14: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/14.jpg)
What happens if one shelving is not
enough?
13
You buy more shelves…
![Page 15: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/15.jpg)
… then you create an index
14
“The Avengers”, 1-100, shelf 1
“The Avengers”, 101-125, shelf 2
“Superman”, 1-50, shelf 2
“X-Men”, 1-100, shelf 3
“X-Men”, 101-200, shelf 4
“X-Men”, 201, 225, shelf 5
TheAvengers
TheAvengers
TheAvengers
TheAvengers
TheAvengers
Superman
Superman
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
![Page 16: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/16.jpg)
Hadoop Distributed File System (HDFS)
15
• Based on Google File System• Large files are stored across multiple machines
(Datanodes) by spliting them into blocks that are distributed
• Metadata is managed by the Namenode• Scalable by simply adding more Datanodes• Fault-tolerant since HDFS replicates each block (default
to 3)• Security based on authentication (Kerberos) and
authorization (permissions, HACLs)• It is managed like a Unix-like file system
![Page 17: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/17.jpg)
Spliting, replication and distribution
16
1
2
3
4
1 4
2
1
3
4
2 1
3
4
3 2
large_file.txt
(4 blocks) rack 1: datanodes 1 to 4 rack 2: datanodes 5 to 8
![Page 18: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/18.jpg)
Namenode metadata
17
Path Replicas Block IDs
/user/user1/data/large_file.txt
3 1 {dn1,dn2,dn5}2 {dn3,dn5,dn8}3 {dn3,dn6,dn8}4 {dn1,dn4,dn7}
/user/user1/data/other_file.txt
2 5 {…}6 {…}7 {…}
… … …
![Page 19: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/19.jpg)
Datanodes failure recovering
18
1
2
3
4
1 4
2
1
3
4
2 1
3
4
3 2
large_file.txt
(4 blocks) rack 1: datanodes 1 to 4 rack 2: datanodes 5 to 8
2
3
![Page 20: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/20.jpg)
Namenode failure recovering
19
Path Replicas Block IDs
/user/user1/data/large_file.txt
3 1 {dn1,dn2,dn5}2 {dn2,dn5,dn8}3 {dn4,dn6,dn8}4 {dn1,dn4,dn7}
/user/user1/data/other_file.txt
2 5 {…}6 {…}7 {…}
… … …
![Page 21: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/21.jpg)
Managing HDFS
20
HUE
custom apps
WebHDFS
HttpFS
browser ssh client
ssh daemon
se
rvic
es
no
de
cli
en
tm
ac
hin
e
AP
I R
ES
T
AP
I R
ES
T
htt
p
Had
oo
p
Co
mm
an
ds
HDFS
![Page 22: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/22.jpg)
Managing HDFS: HTTP REST API
21
• The HTTP REST API supports the complete File System interface for HDFS– Other Hadoop commands are not available through a
REST API
• It relies on the webhdfs schema for URIs
• HTTP URLs are built as:
• Full API specification– http://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-hdfs/WebHDFS.html
webhdfs://<HOST>:<HTTP_PORT>/<PATH>
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=…
![Page 23: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/23.jpg)
Managing HDFS: HTTP REST API
examples
22
$ curl –X GET “http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=
frb”
CHAPTER 1
OUR PICTURE OF THE UNIVERSE
A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on
astronomy. He described how the earth orbits around the sun and how the sun, in turn,
orbits around the center of a vast
$ curl -X PUT "http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"
{"boolean":true}
$ curl –X GET "http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"
{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE",
"length":3431,"owner":"frb","group":"cosmos","permission":"644","accessTime":1425995831
489,"modificationTime":1418216412441,"blockSize":67108864,"replication":3},{"pathSuffix
":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos
","permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"block
Size":67108864,"replication":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE"
,"length":5257,"owner":"frb","group":"cosmos","permission":"644","accessTime":141821641
2515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathSuffi
x":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":
"755","accessTime":0,"modificationTime":1425995941361,"blockSize":0,"replication":0}]}}
$ curl -X DELETE "http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"
{"boolean":true}
![Page 24: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/24.jpg)
23
Distributed batch computing:
The Hadoop reference
(MapReduce)
![Page 25: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/25.jpg)
What happens if you cannot read all
your books?
24
![Page 26: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/26.jpg)
Hadoop was created by Doug Cutting at
Yahoo!...
25
… based on the MapReduce patent by Google
![Page 27: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/27.jpg)
Well, MapReduce was really invented
by Julius Caesar
26
Divide et
impera*
* Divide and conquer
![Page 28: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/28.jpg)
An example
27
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
LATIN
REF1
P45
GREEK
REF2
P128
EGYPT
REF3
P12
LATIN
pages 45
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
EGYPT
REF6
P10
GREEK
REF7
P20
GREEK
REF8
P230
45 (ref 1)
still reading…
Mappers
Reducer
![Page 29: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/29.jpg)
An example
28
GREEK
REF2
P128
still
reading…
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
EGYPT
REF6
P10
GREEK
REF7
P20
GREEK
REF8
P230
GREEK
45 (ref 1)
Mappers
Reducer
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
![Page 30: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/30.jpg)
An example
29
LATIN
pages 73
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
GREEK
REF7
P20
GREEK
REF8
P230
LATIN
pages 34
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Mappers
Reducer
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
![Page 31: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/31.jpg)
An example
30
GREEK
GREEK
GREEK
REF7
P20
GREEK
REF8
P230
idle…
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Mappers
Reducer
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
![Page 32: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/32.jpg)
An example
31
idle…
idle…
idle…
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
152 TOTAL
Mappers
Reducer
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
![Page 33: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/33.jpg)
Another example
32
How much pages are written in all the languages among
the books in the Ancient Library of Alexandria?
LATIN
REF1
P45
GREEK
REF2
P128
EGYPT
REF3
P12
(lat,45)
(egy,12)
LATIN
REF4
P73
LATIN
REF5
P34
EGYPT
REF6
P10
GREEK
REF7
P20
GREEK
REF8
P230
lat,45
still reading…
Mappers
Reducer
Reducer
egy,12
![Page 34: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/34.jpg)
Another example
33
GREEK
REF2
P128
still
reading…
(egy,10)
LATIN
REF4
P73
LATIN
REF5
P34
GREEK
REF7
P20
EGYPT
REF6
P10
(gre,128)
Mappers
How much pages are written in all the languages among
the books in the Ancient Library of Alexandria?
Reducer
Reducer
lat,45
egy,12
egy,10
gre,128GREEK
REF8
P230
![Page 35: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/35.jpg)
Another example
34
(lat,73)
(greek,230)
LATIN
REF4
P73
LATIN
REF5
P34
GREEK
REF7
P20
(lat,34)
Mappers
How much pages are written in all the languages among
the books in the Ancient Library of Alexandria?
Reducer
Reducer
lat,45
lat,73
lat,34
egy,12
egy,10
gre,128
gre,230GREEK
REF8
P230
![Page 36: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/36.jpg)
Another example
35
(gre,20)
idle
GREEK
REF7
P20
idle…
Mappers
How much pages are written in all the languages among
the books in the Ancient Library of Alexandria?
Reducer
Reducer
lat,45
lat,73
lat,34
egy,12
egy,10
gre,128
gre,230
gre,20
![Page 37: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/37.jpg)
Another example
36
idle…
idle…
idle…
Mappers
How much pages are written in all the languages among
the books in the Ancient Library of Alexandria?
Reducer
Reducer
lat,45
lat,73
lat,34
lat,152
egy,12
egy,10
egy,22
gre,128
gre,230
gre,20
gre,378
![Page 38: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/38.jpg)
Writing MapReduce applications
37
• MapReduce applications are commonly written in Java language:– Can be written in other languages through Hadoop Streaming
• A MapReduce job consists of:– A driver, a piece of software where to define inputs, outputs,
formats, etc. and the entry point for launching the job
– A set of Mappers, given by a piece of software defining its behaviour
– A set of Reducers, given by a piece of software defining its behaviour
• https://hadoop.apache.org/docs/current/api/(MapReduce section)
![Page 39: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/39.jpg)
Implementing the example
38
• The input will be a single big file containing:
• The mappers will receive pieces of the above file, which will be read line by line– Each line will be represented by a (key,value) pair, i.e. the offset on the file and the real
data within the line, respectively
– For each input pair a (key,value) pair will be output, i.e. a common “num_pages” key and the third field in the line
• The reducers will receive arrays of pairs produced by the mappers, all having the same key (“num_pages”)– For each array of pairs, the sum of the values will be output as a (key,value) pair, in this
case a “total_pages” key and the sum as value
symbolae botanicae,latin,230mathematica,greek,95physica,greek,109ptolomaics,egyptian,120terra,latin,541iustitia est vincit,latin,134
![Page 40: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/40.jpg)
Implementing the example: JCMapper.class
39
public static class JCMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final Text globalKey = new Text(”num_pages");
private final IntWritable bookPages = new IntWritable();
@Override
public void map(Object key, Text value, Context context)
throws Exception {
String[] fields = value.toString().split(“,”);
system.out.println(“Processing “ + fields[0]);
if (fields[1].equals(“latin”)) {
bookPages.set(fields[2]);
context.write(globalKey, bookPages);
} // if
} // map
} // JCMapper
![Page 41: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/41.jpg)
Implementing the example: JCReducer.class
40
public static class JCReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable totalPages= new IntWritable();
@Override
public void reduce(Text globalKey, Iterable<IntWritable>
bookPages, Context context) throws Exception {
int sum = 0;
for (IntWritable val: bookPages) {
sum += val.get();
} // for
totalPages.set(sum);
context.write(globalKey, totalPages);
} // reduce
} // JCReducer
![Page 42: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/42.jpg)
Implementing the example: JC.class
41
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(
new Configuration(), new CKANMapReduceExample(), args);
System.exit(res);
} // main
@Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, ”julius caesar");
job.setJarByClass(JC.class);
job.setMapperClass(JCMapper.class);
job.setCombinerClass(JCReducer.class);
job.setReducerClass(JCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
} // run
![Page 43: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/43.jpg)
42
Simplifying the batch analysis:
Querying tools
![Page 44: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/44.jpg)
Querying tools
43
• MapReduce paradigm may be hard to understand and, the worst, to use
• Indeed, many data analyzers just need to query for the data
– If possible, by using already well-known languages
• Regarding that, some querying tools appeared in the Hadoop ecosystem
– Hive and its HiveQL language quite similar to SQL
– Pig and its Pig Latin language a new language
![Page 45: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/45.jpg)
Hive and HiveQL
44
• HiveQL reference– https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• All the data is loaded into Hive tables– Not real tables (they don’t contain the real data) but metadata
pointing to the real data at HDFS
• The best thing is Hive uses pre-defined MapReduce jobsbehind the scenes!– Column selection
– Fields grouping
– Table joining
– Values filtering
– …
• Important remark: since MapReduce is used by Hive, the queries make take some time to produce a result
![Page 46: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/46.jpg)
Hive CLI
45
$ hive
hive history
file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt
hive>select column1,column2,otherColumns from mytable where
column1='whatever' and columns2 like '%whatever%';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201308280930_0953, Tracking URL =
http://cosmosmaster-
gi:50030/jobdetails.jsp?jobid=job_201308280930_0953
Kill Command = /usr/lib/hadoop/bin/hadoop job -
Dmapred.job.tracker=cosmosmaster-gi:8021 -kill
job_201308280930_0953
2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%
2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%
2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%
2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%
![Page 47: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/47.jpg)
Hive Java API
46
• Hive CLI and Hue are OK for human-driven testing purposes– But it is not usable by remote applications
• Hive has no REST API• Hive has several drivers and libraries
– JDBC for Java– Python– PHP– ODBC for C/C++– Thrift for Java and C++– https://cwiki.apache.org/confluence/display/Hive/HiveClient
• A remote Hive client usually performs:– A connection to the Hive server (TCP/10000)– The query execution
![Page 48: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/48.jpg)
Hive Java API: get a connection
47
private static Connection getConnection(String ip, String port,
String user, String password) {
try {
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
} catch (ClassNotFoundException e) {
System.out.println(e.getMessage());
return null;
} // try catch
try {
return DriverManager.getConnection("jdbc:hive://" + ip
+ ":” + port + "/default?user=" + user + "&password=“
+ password);
} catch (SQLException e) {
System.out.println(e.getMessage());
return null;
} // try catch
} // getConnection
![Page 49: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/49.jpg)
Hive Java API: do the query
48
private static void doQuery() {
try {
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery(
"select column1,column2,”
+ "otherColumns from mytable where “
+ “column1='whatever' and “
+ "columns2 like '%whatever%'");
while (res.next()) {
String column1 = res.getString(1);
Integer column2 = res.getInteger(2);
} // while
res.close(); stmt.close(); con.close();
} catch (SQLException e) {
System.exit(0);
} // try catch
} // doQuery
![Page 50: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/50.jpg)
Hive tables creation
49
• Both locally using the CLI, or remotely using the Java API, use thiscommand:
• CSV-like HDFS files
• Json-like HDFS files
create [external] table...
create external table <table_name> (<field1_name>
<field1_type>, ..., <fieldN_name> <fieldN_type>) row format
delimited field terminated by ‘<separator>' location
‘/user/<username>/<path>/<to>/<the>/<data>';
create external table <table_name> (<field1_name>
<field1_type>, ..., <fieldN_name> <fieldN_type>) row format
serde 'org.openx.data.jsonserde.JsonSerDe' location
‘/user/<username>/<path>/<to>/<the>/<data>’;
![Page 51: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/51.jpg)
50
Distributed streaming computing:
The Storm reference
![Page 52: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/52.jpg)
Storm project
51
• Created by Natham Marz at BackType/Twitter
• Distributed realtime computation system
![Page 53: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/53.jpg)
Storm basics
52
• Based on processing building blocks that can be composed in a topology– Spouts: blocks in charge of polling for data streams, producing
data tuples– Bolts: blocks in charge of processing data tuples, performing basic
operations• 1:1 operations: arithmetics, transformations…• N:1 operations: filtering, joining…• 1:N operations: spliting, replication…
• It is scalable and fault-tolerant– A basic operation can be replicated many times in a layer of bolts– If a bolt fails, there are serveral other bolts performing the same
basic operation in the layer
• Guarantees the data will be processed– Storm perform an ACK mechanism for data tuples
![Page 54: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/54.jpg)
53
Big Data in FIWARE Lab:
Cosmos and Sinfonier
![Page 55: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/55.jpg)
Cosmos
54
• Cosmos is the name of the Hadoop-based global instance in FIWARE Lab
• Nothing has to be installed!
• There are two clusters exposing some services:– Storage (storage.cosmos.lab.fiware.org)
• WebHDFS REST API (TCP/14000)
– Computing (computing.cosmos.lab.fiware.org)• Tidoop REST API (TCP/12000)
• Auth proxy (TCP/13000)
• HiveServer2 (TCP/10000)
![Page 56: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/56.jpg)
Feeding Cosmos with context data
55
• Cygnus tool – Apache Flume-based
• Standard NGSI connector for FIWARE
• Provides connectors for a wide variety of persistence backends– HDFS
– MySQL
– CKAN
– MongoDB
– STH Comet
– PostgreSQL
– Kafka
– DynamoDB
– Carto
![Page 57: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/57.jpg)
Sinfonier
56
• Sinfonier will be the name of the Storm-based global instance in FIWARE Lab
• Nothing will have to be installed!
• There will be one cluster exposing streaming analysis services through an IDE
• Will be fed using Cygnus and Kafka queues
• Coming soon!
![Page 58: Using Big Data for the analysis of - FIWARE€¦ · Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer](https://reader030.vdocument.in/reader030/viewer/2022041105/5f07267a7e708231d41b8e62/html5/thumbnails/58.jpg)
Thank you!
http://fiware.org
Follow @FIWARE on Twitter