implementing hadoop distributed file system (hdfs) cluster
TRANSCRIPT
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Implementing Hadoop distributed filesystem (hdfs) Cluster for BI Solution
Jorge Afonso Barandas Queirós
Mestrado Integrado em Engenharia Eletrotécnica e de Computadores
Company Supervisor: Engenheiro Francisco Capa
Supervisor: Professor João Moreira
February 24, 2021
Abstract
Currently, there is a large influx of information services online, where the amount of informationthat goes through each user of the system is gigantic. This data, like any other information, obeys acertain general behavior: storage, processing, and loading (ETL concept). To this end, informationstorage systems have been created, and have evolved technologically since then, with severalimplementation options and for different purposes. World-renowned services such as Facebookor Instagram are based on this type of information as a basis for storing information. However,each system has its advantages and disadvantages. The most important indicators for evaluatinga storage system are cost-benefit and performance (speed of analysis and storage capacity) giventhe amount and flow of data presented. This works aims to implement a possible low-cost solutionto store safely a great amount of data, based on a Hadoop Cluster, with other frameworks thattogether can create an efficient and viable Big Data solution.
Besides that, this work presents a study for other possible distributed solutions, where thecomparison between different frameworks will be evaluated, as well as the distinction betweensolutions based on local versus cloud-based environments.
The responsible company in question is known for creating Business Intelligence solutions,that is, creating solutions and indicators that derive from conventional information analysis, topresent important results on a specific case study. The analysis, formatting, and simplicity ofthe data is a factor present in this concept, therefore its development refers to the request for alarge-scale storage system, hence the great need to carry out this study in a business environment.Besides, to test the viability of the implemented solution, it was created a Web Page extractionmechanism, more specifically related to the stock market, storing these values in tables, accord-ing to the universal format of column-lines, to later analyze stored data and present it on a datavisualization tool. The reason why the stock market analysis was carried out is due to the highimportance of using a large amount of data if the ultimate goal of the user is to study deeplysome type of behavior related to this area. Another factor is the mutual interest on the side of thecompany in creating a BI solution based on stock market values, for future implementations andstudies. If possible, also create some predictive models, or that give some future forecast of thebehavior of the extracted values, to improve the quality of the final decision.
i
ii
Resumo
Atualmente, há um grande fluxo de serviços de informação online, onde a quantidade de infor-mação que passa por cada utilizador do sistema é gigantesca. Esses dados, como qualquer outrainformação, obedecem a um determinado comportamento geral: armazenamento, processamento ecarregamento (conceito ETL). Para tal, foram criados sistemas de armazenamento de informação,que evoluíram tecnologicamente desde então, com várias opções de implementação e para difer-entes finalidades. Serviços de renome mundial, como Facebook ou Instagram, são baseados nestetipo de informação como base para o armazenamento de informação. No entanto, cada sistematem suas vantagens e desvantagens. Os indicadores mais importantes para avaliar um sistema dearmazenamento são: custo-benefício e desempenho (velocidade de análise e capacidade de ar-mazenamento), considerando a quantidade e o fluxo de dados apresentados. Este trabalho visaimplementar uma possível solução de baixo custo para armazenar com segurança uma grandequantidade de dados, baseada num Cluster Hadoop, com outras frameworks que juntas possamcriar uma solução de Big Data eficiente e viável.
Além disso, este trabalho apresenta um estudo para outras possíveis soluções distribuídas,onde será avaliada a comparação entre diferentes frameworks, bem como a distinção entre soluçõesbaseadas em ambientes locais e em nuvem.
A empresa responsável em questão é conhecida por criar soluções de Business Intelligence, ouseja, criar soluções e indicadores derivados da análise de informação convencional, para apresentarresultados importantes num estudo de caso específico. A análise, formatação e simplicidade dosdados é um fator presente neste conceito, pois o seu desenvolvimento refere-se à solicitação de umsistema de armazenamento em larga escala, daí a grande necessidade de realização deste estudoem ambiente empresarial. Além disso, para testar a viabilidade da solução implementada, foicriado um mecanismo de extração de páginas web, mais especificamente relacionadas ao mercadode ações, armazenando esses valores em tabelas, de acordo com o formato universal de linhas-colunas, para posteriormente analisar os dados armazenados e apresentá-los numa ferramenta devisualização de dados. O motivo pelo qual a análise do mercado de ações foi realizada deve-seà grande importância do uso de uma grande quantidade de dados se o objetivo final do utilizadorfor estudar profundamente algum tipo de comportamento relacionado com esta área. Outro fatoré o interesse mútuo por parte da empresa em criar uma solução de Business Intelligence baseadaem valores de bolsa, para futuras implementações e estudos. Se possível, criar também algunsmodelos preditivos, ou que dêem alguma previsão futura do comportamento dos valores extraídos,para melhorar a qualidade da decisão final.
iii
iv
Acknowledgements
Firstly, I would like to thank B2F, the responsible enterprise for the project. They welcome meand give me all the material need to perform this project. I would like to thank my supervisorsfrom B2F and from the faculty, namely Engenheiro Francisco Capa, Engenheiro Jorge Amaral,and Professor João Moreira, for helping me through this period, giving me the best patience andknowledge they can. Not less important, I also give many thanks to B2F’s Pedro Roseira, forhelping me too, whether in after-hours or not, even not being directly related to the dissertationproject. His knowledge about Stock Market and HDFS technologies improve the level of mywork. I would like to express my gratitude to all the professors that I had this entire course, fortheir availability and patient to make me a better person, for sure.
Finally and most importantly, I need to emphasize my gratitude to my parents. Without youtwo, I will never be able to fulfill all my childhood dreams, and one of them is to complete thiswork successfully.
Jorge Afonso Barandas Queirós
v
vi
Contents
xiii
1 Introduction 11.1 Business to Future (B2F) Presentation . . . . . . . . . . . . . . . . . . . . . . . 11.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Dissertation’s Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Hadoop Cluster Base Architecture . . . . . . . . . . . . . . . . . . . . . 52.1.2 BI Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art . . . . . . . . . . 102.2.1 Traditional time series prediction . . . . . . . . . . . . . . . . . . . . . 102.2.2 Artificial Intelligence: Neural Networks . . . . . . . . . . . . . . . . . . 112.2.3 High-speed Learning Algorithm - Supplementary Learning . . . . . . . . 112.2.4 Traditional time series prediction vs Neural Networks . . . . . . . . . . 13
2.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Hadoop MapReduce vs Apache Spark . . . . . . . . . . . . . . . . . . . 20
3 Implemented Solution 213.1 Comparision between Hadoop and other Frameworks . . . . . . . . . . . . . . . 21
3.1.1 Hadoop : Modern Data Warehouse versus Traditional Data Warehouses . 223.1.2 Hadoop versus Azure Databricks . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Hadoop Cluster Solution: Presentation of all used frameworks . . . . . . . . . . 243.2.1 Name Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 Data Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 YARN : a task/job scheduler and manager . . . . . . . . . . . . . . . . . 253.2.4 Spark Framework: Data loading framework . . . . . . . . . . . . . . . . 263.2.5 Apache Hive: mySQL database and JDBC Driver . . . . . . . . . . . . 263.2.6 Power BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines . . . . . . . 273.4 Stock Market’s Web Scraping: Extracting stock indicators . . . . . . . . . . . . 34
3.4.1 Important stock market’s indicators to extract . . . . . . . . . . . . . . . 343.4.2 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
viii CONTENTS
3.5 Spark Framework: Advantages to other solutions and configuration . . . . . . . 393.5.1 Apache Spark vs Hadoop MapReduce for running applications . . . . . 393.5.2 Apache Spark configuration over HDFS . . . . . . . . . . . . . . . . . . 39
3.6 Apache Spark Script to store extracted data in HDFS . . . . . . . . . . . . . . . 423.7 Connection to Power BI with Apache Spark framework: Apache HIVE and Spark
Thriftserver Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.7.1 Power BI: Data Load and Processing in real time . . . . . . . . . . . . . 57
4 Result of Implementation and Tests 614.1 HDFS architecture availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 HDFS extraction mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3 HDFS Performance Results - Data Extracting and Load: Spark Jobs with CSV
files vs Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 Power BI Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Conclusions and Future Work 695.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
References 73
List of Figures
2.1 Hadoop Architecture. [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 MapReduce Architecture. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Spark Architecture. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Power BI Desktop Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Pfizer maximum stock market value after first dose injected in a person (Google
Finance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Neural Network Architecture for learnig the behaviour of Stock market based on
key indicators.[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Prediction Simulation.[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Prediction results with collected data. [5] . . . . . . . . . . . . . . . . . . . . . 152.9 Prediction Model Diagram. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.10 Time series Algorithm. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.11 Predictive Results. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.12 ARIMA model [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.13 ARIMA prediction results using datamining [7] . . . . . . . . . . . . . . . . . . 192.14 ARIMA prediction results using datamining [7] . . . . . . . . . . . . . . . . . . 192.15 MapReduce versus Apache Spark tests [8] . . . . . . . . . . . . . . . . . . . . 20
3.1 General Architecture of a Data Warehouse. [9] . . . . . . . . . . . . . . . . . . 223.2 Internet usage between 1990 and 2016 [10] . . . . . . . . . . . . . . . . . . . . 233.3 Azure Databricks and Oracle RAC pricing [11] [12] . . . . . . . . . . . . . . . 243.4 Implemented HDFS Data Extracting Architecture. . . . . . . . . . . . . . . . . 253.5 Hive Architecture. [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Hosts file located at /etc folder. . . . . . . . . . . . . . . . . . . . . . . . . . . 273.7 Configuration of core-site file. . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Configuration of hdfs-site file. . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.9 Memory allocation configurations on yarn-site xml file . . . . . . . . . . . . . . 303.10 yarn-site xml file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.11 Deamons in master machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.12 Deamons in slave machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.13 HDFS Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.14 HDFS Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.15 YARN Job Manager Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . 333.16 Anaconda framework’s interface. . . . . . . . . . . . . . . . . . . . . . . . . . . 353.17 GoogleFinane’s code for extraction. . . . . . . . . . . . . . . . . . . . . . . . . 363.18 Inspector tool to find id of data tags. . . . . . . . . . . . . . . . . . . . . . . . . 373.19 MarketWatch’s extraction algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 383.20 Output list of stock data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
ix
x LIST OF FIGURES
3.21 Apache Spark versus Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . 403.22 Apache Spark download page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.23 spark-default.conf file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.24 History Server Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.25 Bash Script to Extract and Load Data using Crontab. . . . . . . . . . . . . . . . 443.26 Python Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.27 First part: Exraction of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.28 HDFS folder with extracted files . . . . . . . . . . . . . . . . . . . . . . . . . . 453.29 Final stage of loading data to HDFS part 1. . . . . . . . . . . . . . . . . . . . . 463.30 Final stage of loading data to HDFS part 2. . . . . . . . . . . . . . . . . . . . . 463.31 Apple January extracted stock market values’ HDFS file(Portion of the file). . . 473.32 Hive downloaded compressed file ( Version 2.3.7) . . . . . . . . . . . . . . . . . 483.33 Hive’s hive-conf.sh file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.34 MySQL schema metastore creation . . . . . . . . . . . . . . . . . . . . . . . . . 503.35 Permissions to new Hive and MySQL user . . . . . . . . . . . . . . . . . . . . . 513.36 Metastore server username. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.37 Metastore server password. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.38 Connection URL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.39 Driver Name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.40 Created Hive table for Apple’s stock values in December 2020. . . . . . . . . . . 533.41 Created Hive table for Apple’s stock values in December 2020. . . . . . . . . . . 533.42 Hive’s table in MySQL domain. . . . . . . . . . . . . . . . . . . . . . . . . . . 543.43 ThriftServer configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.44 Connectors available at Power BI. . . . . . . . . . . . . . . . . . . . . . . . . . 553.45 ThriftServer Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.46 Credentials to connect to HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 563.47 Hive’s table on Power BI : preview . . . . . . . . . . . . . . . . . . . . . . . . . 573.48 Power BI Fields toolbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.49 Moving average of Google Finance’s Close Price indicator , on Apple, in December 593.50 Hive’s table, with source font’s column. . . . . . . . . . . . . . . . . . . . . . . 593.51 Power BI Implemented Line Charts and Tables . . . . . . . . . . . . . . . . . . 60
4.1 HDFS availability check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 YARN scheduling and motorization test. . . . . . . . . . . . . . . . . . . . . . . 624.3 Spark History Server test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4 Hive server test with beeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5 Hadoop Folder of Extracted data . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6 Output of data in HDFS files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.7 CSV File sizes: 1,10, 100 million of rows. . . . . . . . . . . . . . . . . . . . . . 644.8 1 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.9 10 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.10 100 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.11 Time performance test in Spark: CSV files. . . . . . . . . . . . . . . . . . . . . 664.12 Time performance test in Spark: Parquet files. . . . . . . . . . . . . . . . . . . . 664.13 Power BI Final Dashboard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Mlib library to data training. [14] . . . . . . . . . . . . . . . . . . . . . . . . . 71
List of Tables
2.1 Correlation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xi
xii LIST OF TABLES
Abbreviations and Symbols
AI Artificial Intelligence
BI Business Intelligence
CSV Comma-separated Values
DAX Data Analysis Expressions
ELT Extract, Lioad, Transform
ETL Extract, Transform, Load
HDFS Hadoop Distributed File System
JDBC Java Database Connectivity
MDA Multiple Discriminant Analysis
ML Machine Learning
NN Neural Network
SSH Secure Shell
URL Uniform Resource Locator
xiii
Chapter 1
Introduction
This document reflects the work performed in the final curricular unit of Integrated Master in
Electrical and Computer Engineering, telecommunications major in the current year of 2020/2021.
This dissertation was done in a professional environment at B2F - Business to Future. My
work was supervised by professor João Moreira of the Faculty of Engineering of Porto and Co-
supervised by Engineer Francisco Capa and Jorge Amaral of B2F.
1.1 Business to Future (B2F) Presentation
Business To Future (B2F) is the organization responsible for carrying out this project, providing
all the material and help needed for its development. Based in Porto, it focuses on Business Intelli-
gence (BI) solutions, with experience in large-scale projects, with other well-known organizations,
such as Amorim, HBFuller, Sonae, STCP, among others.
1.2 Context
Currently, all companies need to have an information storage system, with a capacity for a high
volume of data, where it is possible to guarantee its integrity, facility, and speed of access to
stored information. In the current market there are several solutions, but in general, they present
factors that compromise their use. These reasons may be due to the high cost of the equipment,
the difficulty in adapting the system to the architecture already implemented previously (lack of
compatibility between frameworks previously used in a business environment), security and data
integrity, among others.
To create a customized solution in the face of the requirements demanded by the entity re-
sponsible for the project, the implementation of a Cluster-based on Hadoop Distributed File Sys-
tem (HDFS) was put to the test. Its implementation, together with tests and studies carried out
on other identical solutions, based on the same concept (Big Data), will be the main points to be
explored in this project.
This concept is increasingly present in current times.
1
2 Introduction
The main focus of the responsible organization is to build Business Intelligence solutions for
its customers, where the entire storage and analysis process is done. So, there is the curiosity to
find out if a solution with these specifications would be useful for future implementations, or even
for a possible internal product if it proves to be an asset for other projects/organizations.
1.3 Motivation
In order to store a large volume of data in a secure, affordable, scalable and customized way,
Hadoop Distributed File System (HDFS) was created. In parallel with other widely used frame-
works, such as Power BI, Apache Spark and Apache Hive, it is possible to generate a visual report
with the acquired data, in this case indicators related to the stock exchange that are calculated
according to this information.
To test Big Data performance with a big amount of data, stock market quotes are extracted and
stored in HDFS system to after that could be analyzed and presented on a user-friendly interface.
The main reason for doing this work is to find out if a solution based on Hadoop, together with
some tools already used in a business environment, will be viable to implement as an Big Data
and BI solution, replicating this idea in future works. Therefore, if the results are to create and
satisfactory, there would be the possibility of implementing this technology, replacing traditional
data storage technologies still used, which present several limitations in terms of performance.
Other reason to implement this work is to,after building a Big Data infrastructure, try to imple-
ment some statistical models on suggested extracted data: Stock market quotes, in a data analysis
framework, if possible, applying linear regression models or machine learning algorithms, in order
to create a predictive analysis on stock market quotes. This requirement is not as important as the
first one presented, but if implemented, the value of the solution would increase significantly.
If this step is not carried out successfully, it is suggested that a small analysis could be made in
a final reporting tool, based on some values and subsequent basic analytical calculations, followed
by its presentation, for example, in Power BI framework - widely used in enterprise’s domain.
In order to carry out the implementation, the main advice is based on using the tools best rated
by the majority of the existing community around this concept, and trying as much as possible to
use existing tools in the business domain, in order to reduce additional costs and strengthen the
compatibility and prior knowledge in its use.
Another important factor required by the company would be to test different types of formats
used to store data, more specifically a theoretical and practical analysis of two main formats for
storing information: CSV, recognized worldwide, and Parquet (Specially designed for solutions
that use the Hadoop distribution). In this way, new solutions could be used, bypassing general
problems that these files bring, such as, for example, a high amount of unnecessary information
inside for the solution in question, which may compromise in some aspects.
1.4 Objectives 3
1.4 Objectives
The main objective of the project is to create an architecture for the extraction, storage, analysis,
and presentation of a large amount of data, being able to execute and consult millions of records
in a short period, containing the least expensive software and hardware possible. There are several
paths to be taken, where each one can present different methods for the required processes. There-
fore, with the progression of the, it will also be useful to make a comparative analysis between
the various tools and frameworks available to use, to understand what are the advantages and dis-
advantages for the work in question. Finally, the main objective will be to turn this project into a
product, where it will compete with other solutions on the market, standing out from the rest for
its specifications and performance.
1.5 Dissertation’s Document Structure
Firstly, the document presents a brief introduction to explain what is the purpose of the project,
and why the enterprise bets on this.
Next, the document has a state-of-the-art section, where the implemented architecture will
be compared to other solutions in the market, with different variations in the components used.
Also, a brief analysis on some models for stock market prediction, where methods like Machine
Learning and Artificial Intelligence are essential to perform such math calculations.
After that, it will be explained how this solution is indeed created, making it clear why it was
used in some technologies, compared to other tools available for use.
To prove the idealized project, practical results are presented, in parallel with explanations
for each proceeding. Finally, conclusions are made, pointing to some aspects that can be more
explored in the future, with more time and experience with all the technologies used.
4 Introduction
Chapter 2
Hadoop distributed file system(hdfs)Cluster for BI Solution: State of the Art
2.1 Introduction
Nowadays, some architectures can store and process a lot of information, but few bring to the table
the best to the user. Some aspects like price, efficiency, scalability, and robustness are essential to
make a product reliable in the market. This chapter will introduce some technologies, comparing
different works made around the world according to this concept.
The main idea of this work is to implement a Big Data solution for a BI solution.
A Big Data solution consists of a collection of a big amount of data, that could be even bigger,
stored in multiple machines that work together to analyze and prevent loss of information.
So, this technology demands a big allocation of space and a fast responsive behavior, if the
purpose is to create real-time results.
Business Intelligence is another concept that is directly related to Big Data, where data is pro-
cessed to create better decision options for a certain implementation. Putting these two concepts
together in an innovative architecture allows creating a strong decision-making tool.
2.1.1 Hadoop Cluster Base Architecture
In the present days, there are several architectures for implement a Big Data solution, but all of
them present four main steps:
• Program to extract data and store in the HDFS;
• Framework to run Hadoop jobs in order to process data;
• Connector from stored data to presentation framework;
• Presentation of data in BI tool.
5
6 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
Hadoop (Hadoop, 2020)[15] is the open-source implementation of Google’s Map-Reduce
model. Hadoop is based on The HDFS (Hadoop Distributed File system) (HDFS, 2020)[16].
It is a system that tolerates faults from a certain node, allows high-throughput data streaming and
robustness. Hadoop provides wide-node storage, and parallel processing across the cluster us-
ing the Map-Reduce paradigm: programming paradigm that enables scalability across the entire
cluster.
Figure 2.1: Hadoop Architecture. [1]
In this work, Map Reduce (Map Reduce Tutorial,2020) [17] jobs are replaced by Spark Jobs
(Apache SparkTM is a unified analytic engine for large-scale data processing.) [18], because this
last one works in a different way, giving the possibility to work about 100 times faster, due to
memory-processing,instead of read-write from a disk, i.e the MapReduce’s work ethic.So, Spark
is a framework that runs over Hadoop, and can be faster and more efficient. This comparison will
be more detailed in the section Spark Framework: Advantages to other solutions and configuration.
Figure 2.2: MapReduce Architecture. [2]
2.1 Introduction 7
Figure 2.3: Spark Architecture. [3]
Therefore, the brain of the Hadoop resides on the Resource Manager layer, provided by YARN
(Apache Hadoop YARN,2020). YARN is a manager that can split the functionalities of resource
management and job scheduler/monitoring into separate processes. The main idea is to have a
Resource Manager process to manage all the jobs and an Application Master to specify resources
for each job and work along with Slave Nodes’s Node Manager to execute the different processes
in a Hadoop[19].
If the system is only projected to store data and process it, these frameworks are enough. But
to present it on a Business Intelligence platform, Hadoop must have a universal connector driver.
In this case, Apache Hive [20] is used, in parallel with a MySQL connector, to create a metastore
database with generated and processed data and querying it to the BI tool.
“The MetaStore serves as Hive’s table catalog. In conjunction with SerDes, the Meta-
Store allows users to cleanly decouple their logical queries from the underlying phys-
ical details of the data they are processing. The MetaStore accomplishes this by main-
taining lookup tables for several different types of catalog metadata.” [21]
Some of the existent architectures have the same implemented frameworks, but they differ in
the way they are linked, either in the language or in the task execution engine. Hadoop Architec-
ture is the successor of the traditional data warehouse, used by several projects, which is outdated.
In fact, Hadoop is widely used because of its commodity and low-price processing, and compa-
nies such as Facebook or Yahoo use this architecture to store and process their large amount of
data.[22].
2.1.2 BI Tools
After processing data, for the required format, there is the need to present it, in graphs or tables,
to finish the entire process of Big Data. There are some frameworks in the market to execute
this final step. The more used applications are Power BI, by Microsoft, Tableau, Qlik, Grafana,
among others Power BI is the implemented solution to this project because there is already a
paid license running on this application on another enterprise’s implementations. Other tools are
8 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
widely used in this type of project, but imply different costs, connection configurations/number
of connectors, and performance overall. Power BI is a framework that was created to perform
Business Intelligence analytic.
This application allows connecting to multiple data sources, creating visual reports of data.
Power BI supports any platform, like Windows, Linux, MAC, even Android, and IOS. The Re-
ports can be accessed by every platform where the user is logged. Besides that, it is possible to
implement calculations over the data, using DAX (Data Analytics Expressions). The figure below
shows an example of a report, generated in this project, where is possible to see data in form of
charts, according to the timestamp of each record.
Figure 2.4: Power BI Desktop Report.
2.1.3 Data format
The data to be analyzed next has to be stored in Hadoop Cluster, even this data is temporary or will
be discarded in the future. For that, most of the projects that use this concept adopt a couple of
formats. The most common and known is CSV format, due to its simplicity and high-compatibility
with every language or system. But, to create a more efficient way to store it, Apache Foundation
created the Apache Parquet format, which is basically "" a columnar storage format available to
any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data
model, or programming language.""[23] I.e, Parquet is a columnar-format compressed file, and
store the data in a more efficient form, with a compressed dictionary based on record shredding
and assembly algorithm. For example, a CSV file with 1TB will be reduced to approximately
130G, according to DataBricks Platform [24]. The next chapter contains tests about these two
formats, showing the performance battle between them.
2.1 Introduction 9
2.1.4 Stock Market
In the present day, several persons are trying to work and make money with stock market vari-
ances. It is a difficult area to analyze, since stock volatility is high, and the concept is complex,
bringing thousands of indicators and key values that define a market. For example, due to the
recent pandemic, Pfizer Company’s stock market increase its value from a maximum of 38 dollars
at the beginning of 2020 to 42 dollars in December of 2020, reaching its limit.
To study the stock market’s behavior, a great amount of data has to be processed, as well as
well-known financial indicators. Since the enterprise focuses its solutions on products that are
related to financial or business problems, and the interest in exploring new horizons, such like the
stock market is high. This theme is the one to implement as a core of the solution.
Besides that, it is the most indicated to test in a brand-new Hadoop Architecture, just like
mentioned in the paragraph before: it requires millions of millions of records about stock values
to give a more precise prediction. Some indicators are common to all the fonts presenting that
market, but there are some differences, small ones but can change an entire vision about a stock’s
prediction. One percent of a dollar in a collection of 200 actions pursued can lead to significant
losses, or even to a bad correlation calculation between all of the scraped fonts.
All online stock market fonts provide indicators like Price, Open, Close, High, Close, Market
Cap, PE Ratio, and Dividend. These values give an idea of how a stock has changed historically,
but these indicators are not sufficient to classify a prediction. For that, there are some more impor-
tant indicators, like interest rate, vector curves, turnovers, or foreign exchange rate. They can build
a strong correlation using, for example, algorithms based on Machine Learning ( Artificial Intel-
ligence) and Deep Learning. Also, it is possible to create a prediction model based on financial
news or earning reports.
10 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
Figure 2.5: Pfizer maximum stock market value after first dose injected in a person (Google Fi-nance).
2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art
2.2.1 Traditional time series prediction
Traditional statistical models are widely used, even nowadays, due to their capability to modeling
linear relationships with key values that influence a certain stock market value. There are two
types of time series: simple regression and multivariate regression. Paul D. Yoo( et al.) studied
these methods, presenting the advantages and disadvantages of using another mathematical model.
[25]
Box-Jenkins model is a widely used example of a univariate model, also known as simple
regression. This model contains an equation that presents only one incognito. Although, this
model is not appropriated to use because it requires a lot of data to make the result precise. That
is, if the data amount is big and contains data of a low-interval period, the efficiency is high but
not a great value, compared to other architectures, presented below (60 percent).
Multivariate models are univariate models with more complexity attached,i.e contain more
than one variable. One example is regression analysis, which is compared with neural networks.
The estimates relationships between variables using a criterion, which is normally an equation.
Now, Neural Networks fully replaced these algorithms, due to their performance.
2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art 11
2.2.2 Artificial Intelligence: Neural Networks
There are two main principles to predict the stock market’s values, with a considerable trust rate.
One is based on historical values, correlated with key indicators like interest rate, price records for
a long time interval, etc. And other consists of collecting text information related to enterprise’s
behaviors or even about some current event important to the majority of people/organizations.
A neural network is a series of algorithms based on a human neural system to create relations
with a big amount of data. It is useful to prediction systems because they have the power to do
several calculations and give outputs with high correlation.
The figure below presents the general architecture of a neural network system. It consists of
three layers: the input, hidden, and output layers. Each value in the network receives inputs from
another value in a low-level hierarchy and performs a weighted addiction to calculate the output.
The output’s general function to this algorithm is the Sigmoid Function, with the expression:
S =1
1+ e−x (2.1)
, where: e = Euler’s number (approx. 2.71828).
Figure 2.6: Neural Network Architecture for learnig the behaviour of Stock market based on keyindicators.[4]
2.2.3 High-speed Learning Algorithm - Supplementary Learning
The first algorithm to be presented is the High-speed Learning Algorithm - Supplementary Learn-
ing, mentioned in [4]. It takes the error backpropagation proposed by Rumelhart[26] and improves
it by automatically using scheduling pattern presentation and changing learning constants. In this
algorithm, weights are updated according to the sum of the error after processed all data. During
the learning, errors are backpropagated only to the data that the output exceeds the maximum toler-
ance. In other words, only the major unit errors will be again learned. Besides that, supplementary
12 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
learning allows automatic change of learning constants, i.e, the parameters for each type/volume
of data will be flexible. The weight factor, as result, is expressed by:
Aw(t) = − Elearningpatterns
∗aE/dW +oAw(t −1), (2.2)
where: E = learning rate; a = momentum; learningpatterns = number of learning data items that
require error back propagation. The learning rate is automatically reduced when learning data
increases. This allows use of the constants E regardless of the amount of data.
2.2.3.1 Teaching Data
In the article mentioned [4], they create space patterns based on time-space patterns of key indica-
tors, convert them to analog values in range [0,1] (Due to Sigma function).
After that, the timing to sell or buy stocks is indicated in the Sigma function section in one unit
of output. This timing is used as teaching data and is the weighted sum of weekly records. They
study the TOPIX , which is a stock exchange index, like NASDAQ,defining the weekly return by
the stock as:
rt = lnTOPIX(t)
TOPIX(t −1), (2.3)
where TOPIX(t) is TOPIX average at week j;
rn = ∑i
φi ∗ rt+1, (2.4)
where φ is weight.
Input indexes converted to space patterns and teaching data usually comes with variations. So,
to eliminate this error, the data is pre-processed by log or error functions in order to make them as
regular as possible.
After that, a normalization function is implemented in the range [0,1] section to correct the
data distribution.
2.2.3.2 Prediction Simulation
Four different learning data types are used to simulate the neural network. They used a 33 month-
interval , and calculated the values presented in the table below.
2.2.3.3 Simulation for Buying and Selling Simulation
To demonstrate the practical behavior of the implemented algorithm, they create a simulation with
the teaching data. The strategy to buy and sell stocks is called one-point, where all money is used
2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art 13
Table 2.1: Correlation Function
Correlation CoefficientNetwork 1 0.435Network 2 0.458Network 3 0.414Network 4 0.457
System 0.527
to buy stocks and all stocks held are sold at the same time. An output above 0.5 in the analog
system determines the buy of a stock and below this value means the sell of the stock.
The figure shows a plot illustrating the prediction of the TOPIX Stock.
Figure 2.7: Prediction Simulation.[4]
This simulation shows that is possible to predict the expected behavior of a stock. The buy-
and-hold line is the real price of TOPIX, and the line above represents the stock prediction of the
modular neural network architecture. The two lines show identical evolution behaviors, differing
only on the real value of the stock. But the more important factor is that in almost all the studied
periods, the prediction system can confirm the increase or decrease of TOPIX stock price, giving
a great trust index about buying and selling opportunities.
2.2.4 Traditional time series prediction vs Neural Networks
Paul D. Yoo (et al.)[25], mentioned in the subsection Traditional time series prediction compared
traditional prediction systems with recent architectures, namely Neural Network algorithms. [27]
The first difference is related to speed and fault-tolerant problems. Neural Networks execute
their tasks in parallel channels, giving reliability and high-response speed to the system. Compared
to a traditional statistical models, like Box-Jenkins, they perform better and they are more efficient.
Lawrence[27] used the JSE-system, a system that uses Neural Networks with Generic Algo-
rithms. This is a system widely used due to its high efficiency. It presents 63 indicators with the
finality of get an overall view of the market. This system normalizes all data into the range [-1,1],
14 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
which is a normal procedure in this type of systems. Simpler systems only use indicators like his-
torical price and chart information. This method shows the capability of predicting stock market
correctly 92% of the time, and Box-Jenkins only 60%.
Paul D. Yoo (et al.)[28] studied the performance of Multiple Discrminant Analysis ( MDA):
a multivariate regression model, versus Neural Networks. Their NNs based on the prediction of
stock presents a 91%of accuracy according to only 74 % of MDA methods. In fact, the conclusion
is that Neural Networks outperform any of traditional series algorithm. They learn with their own
data rather than the relationship induced. Another advantage is that NNs have non-linear and non-
parametric learning proprieties that improving forecasting and prediction efficient. So, NN are
compatibile with Stock market data: Stock market data is hard to model and it is non-linear.
2.3 Related works
Currently, there are several projects and studies regarding Big Data and Business Intelligence
solutions. Some of them present also some implementations based on complex data analysis,
using sophisticated mathematical models.
Zhihao [5] developed an architecture for Big Data analysis that uses Hadoop, Spark and Ma-
chine Learning, in order to calculate stock market variations, based on stock indicator : return. He
mentioned the power of HDFS for store a great amount of data, capable of failure-detection and
job non-stop, that is, take care of system failures in order to not compromising all the system. He
use the system to try to predict US oil stocks, retrieving data from Yahoo Finance website.
He uses Map Reduce framework to write jobs and applications and Spark over Map Reduce
Framework to processing data, since Hadoop by itself can’t take care of it in real time.
The data comes directly from the website, in a CSV format,collecting the return of 13 oil
stocks since 2006 until 2019. He use Flume to inject data into HDFS. For processing data, he uses
Spark’s API PySpark, which works with Python language, creating data sets with the data. After
that, he uses the "Mlib" function available in Spark , creating a Linear Regression model to give
a prediction of the next prices of US oil stocks. The evaluation metrics for evaluing data are R
squared value( squared correlation).
The Results of the experiment are based on the results of the regression model.
The figure represents the prediction of the data, calculates with the "Mlib" library.
2.3 Related works 15
Figure 2.8: Prediction results with collected data. [5]
It is possible to see that some values are negative, meaning that the Oil stocks are not related
between themselves. The model was built using a regularization parameter equals to 0.3. The
Mean Average Error is equal to 1.95%, creating an idea that this model is not suitable for data
with high dimensionality.
He concludes that it is hard to predict stock data, but this technology brings to the table the
possibility of better performance if used with other tools, such as neural networks.
The proposed project at Mrs. Lathika (et al.) [6] consists of using the Map-Reduce technique
along HDFS architecture to develop a prediction model. They created a system that uses a histor-
ical stock dataset acquired from finance.yahoo.com to predict a company’s next-quarter prices.
After that, they create a prediction model. The workflow of this prediction model is shown in
the next figure.
16 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
Figure 2.9: Prediction Model Diagram. [6]
Also, they use a time series algorithm for calculating all past movements of a certain com-
pany’s behaviour, based on several variables of stock market analytic.
Figure 2.10: Time series Algorithm. [6]
In the end, using these two main concepts, they can present some results in a CSV file, calling
it a test dataset. Using Map Reduce, they can simplify the processed data, for presenting it in
the final in a prediction model result table. The Map-Reduce reduces information by about 98.5
2.3 Related works 17
percent and the final results are presented in figure [6], relative to predicted average stock hold in
three different companies.
Figure 2.11: Predictive Results. [6]
The results are good, but the method to present the information is not pretty, because they did
not create a BI interface. Besides that, the maximum trust interval calculated for these three com-
panies is only 92 percent. This value is not high enough, because this market is very inconstant,
presents high variations: for example, eight percent in a stock market availed in 100000 euros is
8000 euros. This difference can take an investor to easily invest or not in a certain company’s
stock. An acceptable value for this prediction must be around 98 percent. So, the Achilles tendon
in this project is a low trust percentage.
Arkili [29] create an architecture using Mahout and Pydoop technologies, along with high-
performance computing tools, to try to predict stock movements, over various periods. Mahout is
a distributed algebra framework that provides mathematical models that can be used to predict the
stock market. Pydoop is a Python interface that interacts with HDFS directly, providing a write of
applications on Hadoop. They use a linear regression model for the prediction of the stock market
based on the Python scikit library. The results are based on a ten-year stock analysis on Home
depot enterprise, giving an accuracy of the results of 0.85 or 85%, based on comparisons about the
actual values and the calculated values.
M.D. Jaweed (et al.) [30] created an architecture for analyzing large datasets of stock market
data using HDFS architecture, along with QlikView, a Business Intelligence tool to present the
data, creating plots and graphics to illustrate the stock market tendency. In this project, there is no
post-processing of data, to put the data on a predictive model or similar analytics, but Qlikview’s
results can help the user to find some patterns in a user-friendly interface.
Mahantesh C. Angadi (et al.) [7] created a Data Mining stock market analysis using time series
model: auto-regressive integrated moving average(ARIMA). They wrangle the data and store it in
an R-language data frame, collecting it from the Google Finance website.
The ARIMA model is shown in the figure below, based on pre-processed historical data.
18 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
Figure 2.12: ARIMA model [7]
The ARIMA model uses the autocorrelation function and partial autocorrelation function to
identify p,d, and q, respectively order of autoregressive part, degree of first differencing, and order
of moving average part.
The results are very conclusive, as they present some plots that illustrate the predictive values,
comparing to the actual and past ones. The figure below shows the two plots that mean real versus
predicted stock values.
2.3 Related works 19
Figure 2.13: ARIMA prediction results using datamining [7]
Figure 2.14: ARIMA prediction results using datamining [7]
The second plot presents predicted data, and both represent INFY stock values between 2014
and 2015. Both are similar so that this algorithm is effective on a short-term basis. This work has
a lack of file-storage, giving only good responses to short-term intervals because there is not built
a distributed file storage to store a great amount of data. The stock market is a non-linear concept,
and it requires a lot of data to make a prevision of its behavior. This architecture can create bad
results in a larger amount of stock predictions.
20 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art
2.3.1 Hadoop MapReduce vs Apache Spark
Satish Gopalani (et. al) [8] compared the performance of Map-Reduce and Spark frameworks to
process data in a Hadoop Cluster.
Both frameworks use the Map-Reduce paradigm to distribute files in the system, but the two
presents different architectures.
They use a data set that allows performing clustering using the K-means algorithm ("k-means
is one of the simplest unsupervised learning algorithms that solve the well-known clustering prob-
lem). The procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for
each cluster. " [31]
Speed performance tests were made, and the results are in the figure below, where is possible
to see different tests with different file sizes, using the same clustering algorithm.
Figure 2.15: MapReduce versus Apache Spark tests [8]
These results state the significant difference between these two, giving Spark better chances
for performing data streaming, machine learning concerning Big Data.
They stated that the better way to create a Big Data architecture is to mix these two frame-
works, where Hadoop will be used to cluster information while Spark will be used to process data.
This combination brings to the table both framework’s advantages.
Chapter 3
Implemented Solution
This chapter will be presented the architecture created, explaining all the technologies and method-
ologies used to perform the required parameters.
The hardware used is given by the company, B2F, well as the framework that presents the
processed data, Power BI.
The first section compares different file systems technologies, enhancing Hadoop advantages
related to multiple performance and security factors.
The second section represents the installation of the Hadoop Framework, to make it distributed
between all the nodes, that is, the configurations made to create the shared environment.
The third section explains the stock market’s Web scraping script, explaining all the code
implemented, and their relation with the rest of the architecture.
The fourth section contains an explanation about the framework responsible to create a schema
for the data and load it in the final stage.
The final section talks about how data is loaded and presented, and the alterations made, to
plot and calculate a predictive behavior of the stock’s scraped values.
All sections present a brief comparison to other solutions existent in the market, and why the
chosen ones are better for this solution.
The next couple of sections present the implemented solution in the developed Hadoop archi-
tecture, naming all used frameworks, technologies installed and configurations made to store the
data. After that, it will be presented all algorithms created to extract, load, and transform the final
data (ELT process).
3.1 Comparision between Hadoop and other Frameworks
There are some distributed file systems available on the market. Some of them are very expen-
sive and presents other disadvantages comparing to Hadoop Framework. This architecture allows
scaling the number of nodes available easily, being flexible too, and resilient to failure, because of
ambiguous configurations and files on every node of the system. This section presents advantages
and disadvantages to another similar system available in the market.
21
22 Implemented Solution
3.1.1 Hadoop : Modern Data Warehouse versus Traditional Data Warehouses
Traditional Data Warehouses are systems that are responsible to analyze data just like Hadoop
and another similar system. Hadoop is a Data Warehouse, but this technology presents some
upgrades related to ancient models, to make it more reliable. A Data Warehouse is a system that
stores, process, and retrieve data to support a decision. [9] The figure below represents the general
architecture of a Data Warehouse.
Figure 3.1: General Architecture of a Data Warehouse. [9]
This architecture is the base for every file-storage system, but traditional Warehouses have
been surpassed by the newest systems. Nowadays, technology is in a state that was unthinkable
twenty years ago, where the Internet did not have such an important role in our lives. Hadoop, in
contrast to older Data Warehouse systems, can run ETL processes in parallel, that is, on multiple
processes. Also, presents a failure system that prevents that data or processes are lost forever in
the system, creating backup files linked to the original one, on every machine, and the master node
always keeps track of information on the system domain.
The figure below represents the growth of Internet usage, as a curiosity indicator, from 1990,
where the first Data Warehouse had been implemented until now, in recent years, where every
person can access to the Internet easily.
3.1 Comparision between Hadoop and other Frameworks 23
Figure 3.2: Internet usage between 1990 and 2016 [10]
With the big grown of the internet across the world, Traditional Data Warehouses begin to
present some difficulties in keep tracking of the requirements, because technologies implemented
in these systems are not capable to process a great amount of data in a small time interval. This is
a key factor to begin to create new store mechanisms.
So, the principal difference between these two systems is related to scalability and fast re-
sponse: A traditional Data Warehouse was created to run on a single machine, instead of deploy
files and, respectively, jobs and tasks related to them, on multiple machines. So, the concept of
Big Data can not be related to this older systems. For performing analysis of a big amount of data,
Hadoop is the better mechanism, and this is why this technology has been chosen for implementing
this project.
3.1.2 Hadoop versus Azure Databricks
Recently, Microsoft created a cloud-based file-system: Azure Databricks. It is Spark-based, allow-
ing languages like Python, R, and SQL to be used. This is a system that integrates easily with other
frameworks, such as Power BI, Azure SQL database. The performance of this system can be more
effective than Hadoop, if the configuration is made to support another Microsoft Frameworks, to
process and load data.
The objective of this work is to create a solution that is low-cost and would be able to perform
as well as high-cost solutions like Databricks, which requires a high a monthly-cost. Hadoop
is a technology that brings to the table good indicators, that concern Big Data analysis on regular
machines, not requiring the best Software and Hardware on the market. So, at a long-term analysis,
opting for Hadoop would bring the same performance results, but the cost will be much smaller.
There are other similar technologies in the market, like Oracle, but, once again, these tech-
nologies do not obey one of the principal requirements of this work, namely being a low-cost
solution.
24 Implemented Solution
Figure 3.3: Azure Databricks and Oracle RAC pricing [11] [12]
The figure presents the cost of owning an Oracle and an Azure Cluster, and it is possible to
see that these two solutions are very expensive compared to a Hadoop System, where the only real
investment is on the physical machines. An Azure Databricks cluster with 32GB of RAM will
cost about 708 dollars a month. A good physical machine of 32 GB can be bought for 500 dollars
or less. So, this is a high investment, and opting for physical instead of virtual machines can save
a lot of money to an enterprise.
3.2 Hadoop Cluster Solution: Presentation of all used frameworks
This section represents all work done around the machines available, to create a Hadoop Dis-
tributed File System (HDFS). The first thing discussed is the number of nodes needed to perform
this task. After a lot of research and a couple of meetings with the enterprise, the conclusion
reached brings the perfect number of nodes to the requirements asked: Three. That is, the imple-
mented solution has three working machines. One of them is the master node, also known as the
"Name Node".
The two remaining machines are known as "Data Nodes". HDFS works with master/slave
architecture. The figure below presents the implemented architecture, showing all components
that represent the system.
3.2.1 Name Node
Name Node is the engine responsible to manage the entire system and regulates access to the
outside. This software is allocated in one machine, connected to all machines in the system. and
3.2 Hadoop Cluster Solution: Presentation of all used frameworks 25
Figure 3.4: Implemented HDFS Data Extracting Architecture.
ensures that data never flows by this machine, keeping it to the Data Nodes. Name Node works
with metadata information about all the files stored. In other words, this node maps all the files in
the system, managing all the executed tasks on their files.
In parallel to that, it is possible to run a secondary Name Node, to prevent a single point of
failure. So, the system presents a secondary system to work if the primary manager is failing or
down for some reason.
3.2.2 Data Nodes
Data Nodes are the machines that run an engine that store and manage the HDFS’ files. These
files are spat in blocks, and these blocks have the same specifications. The size of them can be
configured, having a default size of 128 MB for the last versions and 64 MB for older ones. Data
Nodes also can create, delete, change blocks, demanded by the Name Node. Usually, this software
is allocated to each slave node. It is not impossible to run in more than one, but this solution is not
recommended in most cases. The better option is always to have one Data Node software for each
slave node.
3.2.3 YARN : a task/job scheduler and manager
YARN is the framework responsible to manage and schedule all the jobs put on the system. YARN
works on top of Hadoop and creates two distinct engines: Resource Manager and Node Manager.
The first one is allocated on the Name Node, and the second one is created on all slave nodes.
The Resource Manager has two components: Scheduler and Applications Manager.
The Scheduler allocates resources for all applications running on top of Hadoop. This engine
does not guarantee failure prevention, so that Applications Manager was created, to monitoring the
26 Implemented Solution
state of each job. By default, the jobs running on Hadoop are from the Map-Reduce framework.
In this project, it was defined that no Map Reduce Jobs will be scheduled, but Spark Jobs. The
first reason is for compatibility reasons to the Web Scraping algorithm and the second reason is
due to performance factors. The section Spark Framework: Advantages to other solutions and
configuration is presented the pros of using this system on top of Hadoop, showing and explaining
the performance advantages compared to MapReduce jobs.
3.2.4 Spark Framework: Data loading framework
In order to take the input data from Web Scraping algorithm presented in the below section Stock
Market’s Web Scraping: Extracting stock indicators, and loaded in the final stage (where all data
will be processed and presented), Spark framework was implemented.
Apache Spark substitutes MapReduce jobs, like stated above, and take advantage of memory-
processing instead persisting temporary data in disk, which is a lot faster. The principal advantage
is that requires a minimum specification on the machine that perform this tasks, in order to not
consume all available memory. 8GB of RAM is enough to perform data loading in Spark, and this
value is nowadays cheap and easy to have on a single machine.
3.2.5 Apache Hive: mySQL database and JDBC Driver
To connect all the data in real-time to Power BI, Apache Hive was implemented, along with
MySQL. Joining these two concepts it is possible to send data to outside, where MySQL is used to
create a metastore database for Hive created tables, responsible to create its schema. Hive tables
will have inside all extracted data, using a live connection to preview its data on the final stage’s
used tool.
Apache Hive brings inside a JDBC Driver along Thriftserver script, allowing the connection
to Hive tables via outside, on a HTTP communication .
Figure 3.5: Hive Architecture. [13]
3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 27
3.2.6 Power BI
This framework will be used to process and represent data calculations from HDFS files, connect-
ing to all the systems via JDBC(Java Database Connectivity) Driver from Apache Hive tool. Other
tools available in the market like Tableau or QlikView, presented in chapter two, also perform sim-
ilar computations. But this framework is already deployed on the enterprise responsible for this
project.
So, fewer costs are avoided, and the internal support, if any problem is stated, is much stronger.
Power BI presents a user-friendly interface, with hundreds of connectors, like MySQL databases,
CSV, Excel, Spark, among others. Besides that, this framework allows the use of some calculations
related to Machine Learning of data, like linear regressions, correlations, Moving Averages, etc.
So, after a brief study and some discussions with the B2F enterprise’s experts in this area, this
tool is the most indicated to perform the bridge Between local file system: HDFS and live data
analyzing: Power BI.
3.3 Setup of Cluster: Installation of Hadoop / YARN on three ma-chines
First of all, in order to store all the scraped data, it was implemented a Hadoop Cluster, and it
consists on three working machines,connected with Ethernet, with a Intel I5 core and 8GB of
RAM available,using the Linux Distribution : Ubuntu 20.04 (It can be downloaded in this link:
https://ubuntu.com/download/desktop/thank-you?version=20.04.1&architecture=amd64), connected
to each other, sharing information about stored files. The first step is to connect them in the same
network, and test if the three nodes can ping to all the architecture.
For that, the file in /etc/hosts is edited on all machines. The picture below represent the actual
configuration of this file on the three nodes.
Figure 3.6: Hosts file located at /etc folder.
The master node will use an SSH connection to speak to the other nodes, so that, for security
and active connection reasons, an authentication made by key-pair is used. In the master node, the
SSH key was generated by the following command in terminal:
ssh-keygen -b 4096
When running this command, a password prompt is opened. It was leaved in blank so that
Hadoop can communicate unprompted.
28 Implemented Solution
After that, in each work node, a master.pub file was created. This file contains the public key
of the master node, in order to create an only and valid authentication between the three machines.
The command used is:
cat /.ssh/master.pub » /.ssh/authorized_keys.
With that configuration, Every machine can communicate with all the nodes, and the next
step is to download the repository from Hadoop, containing the most recent framework, 2.10.1,
released on 21 of september of 2020 . It can be downloaded directly from the Hadoop site, located
on https://hadoop.apache.org/. From the terminal, the next commands are performed, to download
and extract the files to the home folder.
cd wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-2.10.1.tar.gz
tar -xzf hadoop-2.10.1.tar.gz
mv hadoop-2.10.1 hadoop
The next step is to set the environment variables, that is, giving to the system the location of
all Hadoop’s files. For this, the command below is added to .profile file located on home folder:
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH.
In the .bashrc file, located on the same folder, the following commands was set, in order to
give the path to the shell: export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin.
The system can now locate the Hadoop directory. Hadoop communicates using JAVA lan-
guage. So, the next step is to download or find JAVA path and set it on the Hadoop environment
file, hadoop-env.sh. In this case, the JAVA installed is 8, so, the line included is :
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre. The next step is to define a lo-
cation for the namenode. On file /hadoop/etc/hadoop/core-site.xml , name and port of NameNode
is defined, like represented in figure below.
Figure 3.7: Configuration of core-site file.
The port used is 9000, and the location is the localhost, also know as node-master, in this case.
The path for HDFS is the next step. The figure below shows the configuration made on the file
located at /hadoop/etc/hadoop/hdfs-site.xml.
Hadoop configuration is concluded. After that, the YARN configuration was made. YARN
will keep track of all jobs executed on the system, and this engine will allocate memory and
3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 29
Figure 3.8: Configuration of hdfs-site file.
process resources to each task. The figure below presents the configuration created in the file
/yarn-site.xml, located at /hadoop/etc/hadoop, to give an address to YARN.
In the workers’ file, at /hadoop/etc/hadoop/workers, the name of the Data Nodes are required,
to NameNode recognize them, and initializes their scripts when Cluster is enabled.
To YARN allocate all the necessary resources, the memory allocation is required to configure.
For this architecture, few calculations were made to find the configuration values. To choose the
right values, total RAM, disks, and CPU cores are considered. [32]
In the implemented solution, the number of disks are three, each machine has 8 GM of RAM
and each CPU has two cores. For 8 GB of RAM , is recommended to allocate 2 GB to system
memory and 1 GB to HBase process,if used. But, the only important value is the first one, which
represents the stack memory to run jobs.
numbero f containers = min((2∗CORES,1.8∗DISKS),TotalavailableRAM
MIN_CONTAINER_SIZE).(3.1)
The documentation present the value 512 MB for the minimum container size, for a total RAM
per node between 4 and 8 GB, which is solution’s case. So, using 8 GB of available RAM, 6 cores
and 3 disks, the minimum value is 5.4 , which is equals to 5. The final calculation is to determine
the amount of RAM per container, presented in the next equation.
30 Implemented Solution
RAMpercontainer = max(MIN_CONTAINER_SIZE,(TotalAvailableRAM)
numbero f containers. (3.2)
The final result , using the parameters presented above, is 3.33GB of RAM per container. This
values (given and calculated) represent all information needed to configure memory allocation.
The figure below represents the entries that will be needed to configure, in the file yarn-site.xml,
located at /etc/hadoop/ folder.
Figure 3.9: Memory allocation configurations on yarn-site xml file .
The yarn-site.xml file is represented below, in order to give an idea of how this file is respon-
sible to allocate all resources to Hadoop future jobs.
3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 31
Figure 3.10: yarn-site xml file.
With YARN and Hadoop Installed , it was made a copy of all files to worker nodes, using scp
protocol , like stated below:
for node in node1 node2; do scp /hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;
done Now, as any classic file system, HDFS need to be formated, so the command below was ex-
ecuted.
hdfs namenode -format
After that, Hadoop is ready to run and perform task-scheduling. Starting the script on Hadoop
"start-dfs.sh", a deamon in the NameNode as been created named "NameNode" and "Secondary
Namenode", for preventing a single point of failure. In the DataNodes, the deamons are also
initialized , named "DataNode". An interface via web is also prompted, stating all information
about the Cluster developed. It indicates the number of ative nodes, the files on HDFS, well as all
32 Implemented Solution
specifications(space available, Heap memory, etc.)
Using jps command is possible to see which deamons are running on the machine.
Figure 3.11: Deamons in master machine
Figure 3.12: Deamons in slave machines
Figure 3.13: HDFS Local Web Site.
3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 33
Figure 3.14: HDFS Local Web Site.
It is also already possible to run YARN Resource Manager and Node Manager daemons, run-
ning the script "start-yarn.sh". As states before, a daemon will be created on the masternode,
called Resource Manager, responsible for monitoring and allocating resources. NameNodes are
the "servos" of Resource Manager, launching and managing containers on their nodes, reporting
the results to ResourceManager.
This framework also can be accessed via a Web user interface, to control all jobs and tasks
created /finished/running.
Figure 3.15: YARN Job Manager Local Web Site.
34 Implemented Solution
3.4 Stock Market’s Web Scraping: Extracting stock indicators
After configuring Hadoop, the next phase was to implement a system to extract data from a set of
sources, namely stock market data from Google Finance, Yahoo Finance, Market Watch, and The
Wall Street Journal websites. These sources present quotes from every stock listed on NASDAQ
and NYSE, and enterprise’s indicators and historical values. After a brief analysis, the solution
opted was to create a Python Script, autonomously, scraping a set of actual indicators that are
important to use as an after-study and data processing. Data will be stored on a single list, and
after that an Apache Spark’s script was implemented too, storing the data on a Parquet format,
ready to import to Hive’s table and posterity to the analysis and presentation tool, Power BI.
3.4.1 Important stock market’s indicators to extract
To choose which indicators will be extracted, a brief study on the financial market was made.
One of the most important indicators to model a prediction system is the Moving Average. This
indicator is not listed on these sources. The moving average is calculated by the addition of a
stock’s prices over a certain period and dividing the sum by the total number of periods. This
interval will be increasing with every input in the system, giving a stronger result in the final. This
value will be calculated after the extraction of data, and before the presentation of the results.
Other indicators are important to study the stock market, like PE Ratio, which relates a com-
pany’s share price to its earnings per share, or Market Capitalization, which gives the market value
for a company. The dividend is also a good indicator to study because it states the reward that a
company gives to its shareholders, and it is a factor that can influence a behavior of a market.
Taking into account the information available on each source, a set of indicators was defined,
to structure a general extraction model. This decision was based on Google Finance’s indicators.
This source presents eight values about each stock market, and all of them have important
reasons to include in the extraction model: Price, Open, High, Low, Market Capitalization, Price-
Earning Ratio, Dividend Yield, Previous Close, 52 Week High and 52 Week Close.
There are other important values to study, to create a stronger decision on a stock market’s
behavior, but these indicators are not listed on every source, and some of them are hidden or
impossible to extract, due to lack of permission on the website’s API or due to no existence of this
values on the website’s free front end. It is important to state again that the main factor to this
project: a predictive stock market’s analysis on a low-cost system. So, the main objective is to
try to create a solution that would be able to support decision-making with the less cost possible.
With the eight values listed above, a timestamp in the format "MM/DD/YYYY HH : MM" and
the name of the source are also put in the system, for helping in future analysis.
3.4.2 Python Script
As stated in the subsection above, a Python script was developed to extract this data from all four
sources, at the same time. In the next chapter, is explained how these values are stored on HDFS.
3.4 Stock Market’s Web Scraping: Extracting stock indicators 35
By now, only an extraction model will be presented. Firstly, installed Python was 3.8, which is the
latest version. But in the next phase of the project, a problem occurred with this version on Apache
Spark’s framework, and the solution was to downgrade this working version to 3.7. For prevent-
ing auto-update of the Python version or changes on the libraries used, a virtual environment was
installed on the Name Node machine, where extraction and load are executed: Anaconda. This
tool offers an environment that is more resilient to failures and changes in the Python configura-
tions. This tool can be downloaded on https://www.anaconda.com/products/individual#linux and
installed with this command in terminal of Linux:
bash /̃Downloads/Anaconda3-2020.02-Linux-x86_64.sh
After installing, the command conda init begin a remote Python environment. It can be shut
down, but this system will be always active, for compatibility reasons across all the system. This
process can be also executed by the command anaconda-navigator, opening an interface to begin
the virtual environment.
Figure 3.16: Anaconda framework’s interface.
It is possible to see in the figure above that Anaconda keeps track of all libraries installed.
This system allows to save the actual configurations on a file and easily restored if any external
fail occurs.
There is two main libraries that are used to extract data from HTML websites: URLLib and
Beautiful Soup. The first one is used to open a connection to the website, while the second one
is used to read and parse page’s content. The two libraries work together in order to perform the
extraction task. For adding them to the Python environment, pip3, a package installer for Python
was installed, using the command sudo apt install python3-pip and then, in order to install the
libraries the commands pip3 install beautifulsoup and pip3 install urllib3. This two commands
will be installing the lastest version of each library.
36 Implemented Solution
With all dependencies installed, the logic of the algorithm was implemented. The figure shows
the extraction code for Google Finance’s website.
Figure 3.17: GoogleFinane’s code for extraction.
The figure above represents the logic created for extracting data from the source. An URLLib
object was created to open the connection to the website and a BeautifulSoup object was created
to parse all the website’s content. After that, using the code inspector tool in Mozilla Firefox, the
id of the value is searched, to filter the data. In this case, the price is in a span HTML tag with
id "IsqQVc NprOob XcVN5d" and the other indicators are stored in a td tag with id "iyjjgb". A
loop cycle was created also because sometimes the object does not return data. So this loop will
wait until the website has the values. The BeautifulSoup object returns an ordered list with every
record that is placed on a tag that has the id defined.
All websites present the same format of URL and the same abbreviate for each listed stock, so
that is easier to automatize this kind of task to other markets.
3.4 Stock Market’s Web Scraping: Extracting stock indicators 37
Figure 3.18: Inspector tool to find id of data tags.
The final task is to store them on a list, appending timestamp and the abbreviate of the source.
In this case, will be saved a record named "gf", for helping in future analysis.
The other three algorithms present the same theory applied in this one, but differs on the
building of the output record list, since each site stores the required indicators in different orders
and different tags and ids so that a little transformation in the output list has been made to all the
algorithms present the same formatted output. The figure bellows presents the algorithm created
for the extraction of the MarketWatch website’s stock values, and the logic is analogous to the
other sources.
38 Implemented Solution
Figure 3.19: MarketWatch’s extraction algorithm.
This source is more complete than Google Finance, as well as The Wall Street Journal and
Yahoo Finance retrieving other indicators so that filtering was made to get only the required ones.
With a little transformation on the BeauifulSoup’s list, a formatted and required output can be
created, ready to store on an HDFS file, on a single line of it.
The figure shows the output format of this list, which means a record for a file that will be
stored in HDFS and the future analyzed.
Figure 3.20: Output list of stock data.
This algorithm will be used as functions of the main algorithm, the one that will be store this
data on HDFS files, in an automatic way, appending each value to a file that is created to store
information of each stock market each month. The next section, will be presented the main func-
tion, showing too the files that are created automatically to all the data, organized, and formatted
to load on the final stage.
3.5 Spark Framework: Advantages to other solutions and configuration 39
3.5 Spark Framework: Advantages to other solutions and configu-ration
This section presents the environmental configuration of the framework used to store and read
HDFS files containing stock values, Apache Spark, as well as implemented algorithm to load this
data to the final stage of the architecture: the data processing and presentation on Power BI tool.
Also, a brief comparison with another similar tool is made to explain why this technology is
the chosen one to perform this task.
3.5.1 Apache Spark vs Hadoop MapReduce for running applications
Some studies were made around these two frameworks, where an evaluation of performance with
Big Data architectures was compared.
The first advantage of choosing Apache Spark over MapReduce is since the extraction algo-
rithm was made using Python language, and Spark presents an API in Python too, giving better
compatibility to embed the entire system. Using MapReduce with Python is more complicated
and does not have the same efficiency and support with libraries for data load and process.
Taking into account both functionalities, the second advantage of using Spark is due to better
speed. Spark makes data processing about 100 times faster than Map-Reduce since this system
operates with RAM and not on disk, like the Map-Reduce system. creating a "near-real-time"
working environment.
Another great advantage of using Spark is to built-in libraries like "Mlib", a machine learning
library. Map Reduce needs a more complicated and less compatible way to use machine learning
dependencies. Like MapReduce, Spark also has parallel distributed operations, allowing to run
multiple jobs at once, working in the same files if it is the case.
In chapter 2, a study was stated about Hadoop MapReduce versus Apache Framework. Other
similar studies had been made, and the results are very conclusive too: Spark framework presents
a better performance to process much information since these processes are made on memory
instead of on disk. Random Access Memories, also known as RAM, is more powerful than con-
ventional disks, allowing to read and write about 50 to 200 times faster. This is an important factor
for this project, due to the fact stock values are volatile and always are changing, so a quick ex-
traction and process of this data is essential to create a better decision-making solution. So, after
this brief analysis of both frameworks, the configuration of Spark over Hadoop was made. The
next subsection presents all configurations to embed this subsystem on the main architecture.
3.5.2 Apache Spark configuration over HDFS
This subsection represent the configuration of Apache Spark Framework on HDFS. The first step is
to download Spark binaries from Apache Spark download page : https://spark.apache.org/downloads.html.
The recommended Spark version to Hadoop v. 2.10 is 3.0.1 , with pre-built package for Apache
Hadoop 2.7 or later.
40 Implemented Solution
Figure 3.21: Apache Spark versus Hadoop MapReduce .
Figure 3.22: Apache Spark download page .
Next, downloaded Spark package is uncompressed, using the tar command, and copy all the
files to a folder named "spark" on home folder.
tar -xvf spark-3.0.1-bin-hadoop2.7.tgz mv spark-3.0.1-bin-hadoop2.7 spark .
The next step is to include Spark binaries into PATH environment, enabling HDFS to locate
configurations and Spark files. In file /home/hadoop/.profile, following line is added:
PATH=/home/hadoop/spark/bin:$PATH. The entire system are now linked to Spark frame-
work.
The next phase consists on integrate Spark with YARN, in order to configure default YARN
applications as Spark Jobs. By default, YARN is configured to Run MapReduce tasks, so that
it is required to make this change. For that, the following lines was implemented on the file
/home/hadoop/.profile export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
export SPARK_HOME=/home/hadoop/spark
export LD_LIBRARY_PATH=/home/hadoop/hadoop/lib/native:$LD_LIBRARY_PATH
Next, spark default template config file was renamed, in order to be the default configuration
when Spark begin its process:
mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
On file $SPARK_HOME/conf/spark-defaults.conf, the following line was added in order to
set YARN the Job Manager of Spark: spark.master yarn.
3.5 Spark Framework: Advantages to other solutions and configuration 41
The next step is to choose the run mode on YARN from Spark jobs. There are two modes:
Cluster mode, where if a job is started on the master node and this machine turns off, the job will
be running on Cluster, and Spark Drivers will be encapsulated inside YARN Application Master.
If Client mode is activated, if the author of a job turns offline, the job will fail but Spark executors
still run on Cluster, and a small YARN Application Master is created. In this project, spark jobs
will be running for a long time, creating a continuous flow on the Spark executor system, so that
the Cluster mode is the more appropriate so that this mode will be configured. The following
configurations are made for this mode, presenting some differences related to the Client mode,
from the perspective of memory allocation ( Cluster mode requires to allocate memory on the
cluster, and Client mode does not need this configuration).
On spark-defaults.conf file, the next line is created to set the amount of memory to Spark
Driver. spark.driver.memory 2G
Default value is 1GB, but this value is calculated for 4GB machines, so that 2GB is the best
value for 8GB machines, like the master machine, where Spark Application Master will be run-
ning. The next value to set is Spark Executors memory. On the same file, the following line was
introduced.
spark.driver.memory 1024m.
Default value is 512m but this value is calculated for 4GB machines, so that 1024MB is a
better value. The final configurations are made in order to create a History Server interface to log
all jobs that are executed on a system, presenting some statistics about its performance. In the same
file, the following lines was included, to create the History Server interface. spark.history.provider
org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://node-master:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
42 Implemented Solution
Figure 3.23: spark-default.conf file .
Figure 3.24: History Server Web Interface .
After these steps, Spark is ready to run applications from the Hadoop Cluster, performing tasks
on Hadoop’s file system domain, scheduled and managed by YARN. Now, the next step is to create
an algorithm capable of using these resources to load, process, and send data to the final stage,
where it will be made an analysis and presentation of the HDFS’s stored data. The next subsection
will be presenting the logic created, along with all dependencies and additional configurations on
Spark to access HDFS files via PySpark, which used API to bridge these two technologies.
3.6 Apache Spark Script to store extracted data in HDFS
Firstly, before connecting this two frameworks using the algorithm implemented, there are some
addictional configurations that need to be made. More precisely on /home/spark/conf/spark-env.sh
3.6 Apache Spark Script to store extracted data in HDFS 43
file, where the next lines are added in order to set variables for Spark’s Python. export PYS-
PARK_PYTHON=home/hadoop/anaconda3/bin/python3.7
export PYSPARK_DRIVER_PYTHON=/home/hadoop/anaconda3/bin/python3.7
export SPARK_CLASSPATH=/home/hadoop/apache-hive-2.3.7-bin/lib/mysql-connector-java-8.0.
22.jar
The last line was created for a future step, which is responsible to set a connection to Spark
with HIVE tables, via MySQL connector, to allow this system to connect to the Power BI tool.
This step will be discussed in the next section in more detail.
To put all records on HDFS, an Apache Spark script was created, and this script will include
all algorithms created before for extracting data, working together to store these values on HDFS
files, organized by month and the stock market. Files are created in CSV format, compressing it
in Parquet files. Both formats are allowed in this architecture, but Parquet is especially used in the
Hadoop environment and presents some advantages related to the CSV format. This method will
increase the performance of the entire system because Parquet files are created specifically this
type of architecture, capable of compressing high-size files to a low-size. A CSV with 100 Million
records can have, for example, 1TB, and the compression of the Parquet file will be decreasing its
size to approximately 10 times lower, rounding 100 GB. This is an easy transformation, allowing
the system to have more data inside, and Spark will process this file a lot faster.
In the next chapter, it will be presented some tests that were made to see some records saved
in each format, allowing to reflect if Parquet is improving the performance of the system.
This script will be extracting and storing stock market data from all four sites, in a ten-minute
loop, using an auxiliary software to run this job automatically: Crontab.
This software is used to run periodically the script, in order to make requests to the four
websites and store the information on HDFS files. The command to install this technolgy is sudo
apt-get install cron. After that, crontab -e is executed on the Linux shell, creating a new line on
the Crontab configurstion file:
*/10 9-17 * * 1-5 cd /home/hadoop/Desktop && ./script.sh.
This command allows to run the script every Monday to Friday, from 9am to 5pm. This time-
interval is when Stock markets are open, so that this script only have to run on this period. There
is no need to run it allways, because out of this period the values will not change.
The figure below represent bash file created, invoking implemented Spark’s Python script.
44 Implemented Solution
Figure 3.25: Bash Script to Extract and Load Data using Crontab.
It is possible to see that Apple, MacDonald’s and Nike stock values are requested each 10
minutes, storing it on HDFS directory.
Now, the actual Python script will be commented in order to explain every step made.
Figure 3.26: Python Imports.
The first part of the code implemented is the import section, where all libraries are imple-
mented, including the website’s extraction functions, and libraries that ensure the connection of
Python with Spark and Hadoop, namely SparkSession, and subprocess, to run commands on shell
via Python.
Next, algorithm will run each extraction function presented before in the subsection Python
Script , sequentially, in order to extract a list of values, representing website’s chosen stock indi-
cators at that point, defined by "date" variable.
3.6 Apache Spark Script to store extracted data in HDFS 45
Figure 3.27: First part: Exraction of data
The "values" variable will return a row with every value extracted. Next step is to store it on
HDFS file, appending it to other rows that are already on HDFS directory.
For storing extracted data, it was created a directory named "scraping" on HDFS file system.
Executing the command hdfs dfs -ls scraping it is possible to see HDFS files relatively to extracted
data. This files are created in the beginning of a month, according to the algorithm sector that will
be presented below.
Figure 3.28: HDFS folder with extracted files
After the extraction is complete, the next step is to append the new record into the file that
saves the record for that month and the stock market. Appending information on Hadoop Files
with Python is a difficult task so that some middle steps are taken to create a union between all
the values from the HDFS file and the newest records. For performing this task, two Python
DataFrame were created. A DataFrame is an object that represents a table with rows and columns.
In this case, it will be represented HDFS file and each new record that arrives in the process. The
script creates a third DataFrame with the union of new with old data, storing it on a new HDFS
file, updating every time the file that contains all extracted data. The next figure presents the final
stage of the code, where new data is appended to the HDFS file already placed on Hadoop.
46 Implemented Solution
Figure 3.29: Final stage of loading data to HDFS part 1.
Figure 3.30: Final stage of loading data to HDFS part 2.
With this implemented algorithm, it is now possible to store and load this data to the last phase
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 47
of the architecture. This file will be imported to a Hive table, which is responsible to make and
connection to the Power BI platform. The next section will explain all steps made to load HDFS
data into the processing stage. With this implemented algorithm, it is now possible to store and
load this data to the last phase of the architecture. This file will be imported to a Hive table, which
is responsible to make and connection to the Power BI platform. The next section will explain all
steps made to load HDFS data into the processing stage.
Figure 3.31: Apple January extracted stock market values’ HDFS file(Portion of the file).
3.7 Connection to Power BI with Apache Spark framework: ApacheHIVE and Spark Thriftserver Configuration
This section presents all steps made to allow communication between HDFS and the processing
and presentation tool via, Power BI. All configurations in the system will be presented, and well
as explanations about all choices and steps implemented. Apache Hive is a tool that provides a
connector from HDFS to the exterior via SQL language. It uses JDBC ( Java Database Connector)
Driver. As Hadoop, Hive was created on the same enterprise, providing the better compatibility
possible with any framework that belongs to Apache. In this case, Hadoop with Hive fits perfectly
to create a connection to other tools that want to take care of its data.
For that, Hive uses the technology of warehousing on top of Hadoop, providing querying of
data with its database tables. These tables are composed of all the stock market data that HDFS
stores. By default, Hive tables are created using a Derby metastore. Metastore is the repository
that stores metadata(information about a database) for Hive tables, including their schema and lo-
cation, creating a shared environment to other frameworks. It also can be configured with MySQL
metastore, which is universal and more compatible with any Business Intelligence environment
48 Implemented Solution
than Derby metadata. Besides that, Derby’s databases only accept one connection at a time, limit-
ing access to Hive’s data, compromising real-time purposes.
The first time this tool is configured, it was used the default metastore schema( Derby), but
only Tableau would accept it, denying any kind of communication with the Power BI tool. So,
after a reconfiguration of the system, embedding MySQL on the system will create an instant
connection. As stated before, HDFS communicates with the outside via Spark’s drivers, opening
an internal server making the Hive database online. This software is called ThriftServer. The
configurations below are made exclusively with MySQL metadata. Derby database was discarded
after no success on connection so that it will not be explained.
3.7.0.1 Apache Hive configuration
The first step is to download from Hive website the lastest release. ( https://hive.apache.org/downloads.html)
. In this case, the release 2.3.7 was downloaded, fitting to last Hadoop version as well(Hadoop
version 2.10).
Figure 3.32: Hive downloaded compressed file ( Version 2.3.7) .
After decompressing it, using the command in terminal tar xzf apache-hive-2.3.7-bin.tar.gz,
the next step is to configure environment variables, in order to Operating System locates the repos-
itory. Next lines was added on .bashrc shell script, in order to create an environment variables in
Linux to Hive.
export HIVE_HOME= “home/hadoop/apache-hive-2.3.7-bin”
export PATH=$PATH:$HIVE_HOME/bin
Hadoop environment variables are located on same file as well, as stated before in Hadoop
Configuration section. After editing the environment script, executing source /.bashrc will save
changes. Now, machine knows about Hive repository, able to link it to other processes.
This configurations yet does not create a relationship between Hive and HDFS. For that, some
changes are made on Hive configuration file, hive-config.sh, located on /home/hadoop/apache-
hive-2.3.7-bin/bin folder
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 49
The following line was added on this file in order to Hive locates HDFS directory: export
HADOOP_HOME=/home/hadoop/hadoop
Figure 3.33: Hive’s hive-conf.sh file.
There is another variable that was declared, namely Jars Path. This variable is created by
default in the system and represents a path for plugin jar implementations by the user. In this case,
there is no additional jars used so this variable will be not related to nothing in the system.
The next step is to create Hive directories in HDFS domain, more concretely two: a temporary
folder to store Hive results, such as Hive processes, for sending data if necessary, and a warehouse
folder to store all hive tables. So, the next commands are executed on shell,performing above steps.
hdfs dfs -mkdir /user/hive, for creating a hive user on the HDFS system, hdfs dfs -mkdir /tmp, hdfs
dfs -chmod g+w /tmp, for grant permissions to the system, hdfs dfs -mkdir -p /user/hive/warehouse,
for creating warehouse folder under hive user and hdfs dfs -chmod g+w /user/hive/warehouse
for grant permissions on this folder to hive. This directories will be used on the ThriftServer
configuration XML file in order to communicate with Power BI, allowing to remotely access to this
databases and its tables This tables contain all previously extracted data that will be imported using
Hive commands, explained below. The Hive configuration, by default, has already configured
directories’ names as stated above, so that it is more easy to create this folders with the same
names for less modifications on the configuration file.
Hive is now ready to use, and next steps are: configure MySQL and Thriftserver together in
order to create a metastore server and schema for Hive’s tables, allowing them to write, read and
connect its data to Power BI via server.
50 Implemented Solution
3.7.0.2 MySQL and Thriftserver Configuration: Create Metastore Database
Every database has a schema , and before creating Hive’s tables, its metastore schema has to be de-
clared and started in order to any allowed client can access to its tables. In order to define a schema
for metastore database, the command $HIVE_HOME/bin/schematool –initSchema –dbType mysql
was executed.
Figure 3.34: MySQL schema metastore creation .
Hive offers a server to run Metastore, that can be running at any time using the command hive
–service metastore.
Hive is now ready to table creation on HDFS, and the next step is to configure Spark’s Thrift-
Server in order to connect to MySQL created metastore schema and respectively to Power BI or
other similar tool, via JDBC driver. Once again, the default Derby metastore schema dont fit to
this architecture as well as MySQL, so that additional configurations on system to link MySQL
and Hive were made, and explained below.
The first step is to download MySQL database. It can be downloaded from terminal, using
the command sudo apt-get install mysql-server. The lastest version and the used one is the 8.0.
Next, Java connector is installed, in order to connect to Hive, using command sudo apt-get install
limbysql-java.
In order to connector is linked to Hive, a soft link is created betwwen Hive and MySQL, with
the command ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-
java.jar
Now, Hive have a linked connector to MySQL schema type. Now, it is important to configure
MySQL permissons, in order to allow only internal access for Hive tables. For that, a user will be
created on MySQL environment, and his credentials are used to access to Hive tables via Power
BI.
On MySQL terminal , the following lines was written. mysql> CREATE USER ’afonso’@’%’
IDENTIFIED BY ’afonso’;
mysql> GRANT all on *.* to ’afonso’@% identified by ’afonso’;
mysql> flush privileges;
The first line will create an user called afonso, with afonso as password. The second line will
ensure that afonso can access to all tables on the system. And last line saves all grant permissions
on MySQL database. The expression " @%" means that every machine in the private network can
connect to this Hive tables, if the user and password match with defined above.
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 51
The figure below represents grants created for this user, and it is possible to see that this user
can access and modify any table in Hive domain.
Figure 3.35: Permissions to new Hive and MySQL user .
The next step is to modify some lines on hive-site.xml file, located on the conf folder in Hive.
This file will be replicated also to the conf folder in the Spark directory, so all repositories have a
copy of the configuration made.
The first modification is the credentials to access the metastore server. This is only used if the
user wants to access directly to it r. But Power BI will access with MySQL credentials and these
changes are only for data protection, and will not be used frequently.
Figure 3.36: Metastore server username.
Figure 3.37: Metastore server password.
52 Implemented Solution
Figure 3.38: Connection URL.
This property is for connection URL. It is defining Connection URL in this property. It acts as
JDBC connection and its representing metastore location as well.
Figure 3.39: Driver Name.
This is the name of the JDBC driver, which represents a class on the Java MySQL connector,
linking it to Hive and Spark ThriftServer.
After saving the configurations, it is time to create Hive tables, to test if MySQL can connect
to them via metastore server and send them to the exterior. The following figure presents the table
that will contain all extracted data from December of the Apple Stock market. This table will be
imported to Power BI, via ThriftServer and using MySQL user’s credentials. This table will be
loaded to Hive via Hive’s terminal. This is only one of the multiple Hive tables created. Each
market will be one table for each month, containing all records extracted.
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 53
Figure 3.40: Created Hive table for Apple’s stock values in December 2020.
If everything went ok, then MySQL can access to this table. So , in order to test it, MySQL is
open again and the command use metastore is executed, changing to metastore database, and then
show tables; will show all Hive metastore’s tables. As the figure below proves, MySQL can acces
to Hive’s content. So , the process of linking this two technologies is a success.
Figure 3.41: Created Hive table for Apple’s stock values in December 2020.
The metadata corresponds to that tables are stored under TBLS in MySQL database. So exe-
cuting select * from TBLS; will present Hive’s tables, if all was done correctly.
Realizing that MySQL has already connectivity with Hive, the next and final stage for con-
necting Power BI to HDFS is to configure ThriftServer. This configuration takes place on the
54 Implemented Solution
Figure 3.42: Hive’s table in MySQL domain.
same configuration file presented above, hive-site.xml. And the next lines were added to create an
HTTP server to allow SQL queries between HDFS and PowerBI via JDBC:
Figure 3.43: ThriftServer configuration.
By default, ThriftServer address will be private address of the machine, in this case it will be
192.168.1.50 with port 10001 as configured.
Saving this changes, the last step is to run Thriftserver script, located on spark/sbin folder:
./start-thriftserver.sh
This will create a log file on log folder, and every failure will be reported there. Thriftserver
has to be executed in parallel with the command hive –service metastore in order to locate Hive’s
metadata.
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 55
3.7.0.3 Power BI Download and Connection
Finally, the last thing to do in order to see Hive’s tables on Power BI is to download the platform
and connecting to the server.
The download page for this tool is https://powerbi.microsoft.com/en-us/downloads/. Power BI
tool only exists for now in Windows Operative System, so that external computer on the same
network is used to perform this last stage. Opening Power BI, on "Data" Toolbar, it is possible to
see all connectors available on the system.
Figure 3.44: Connectors available at Power BI.
56 Implemented Solution
The "Spark" connector is the one to be used in order to connect to HDFS, and next credentials
of the server are inputted.
Figure 3.45: ThriftServer Connection.
Figure 3.46: Credentials to connect to HDFS.
Direct Query is chosen to create changes only on recent file’s data after a change or append
occurs with new records. If Import Query was selected, a copy of the entire file will be imported
to Power BI, and the objective is to be quicker possible. After that, a window is prompted with
Hive’s table, ready to take on data and process it.
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 57
Figure 3.47: Hive’s table on Power BI : preview
Now, the last phase of the project begins, where there is no connection to information on HDFS
via Power BI, and processing and presentation of data will be made to create some calculations
and plots with extracted data, studying stock’s behavior with its values.
The next section will be presented all calculations and modifications on extracted data to create
a wide vision of the evolution of extracted information, in this case containing values about stock
markets.
3.7.1 Power BI: Data Load and Processing in real time
This section will represent the final stage of this work, where all extracted and stored data will be
taken, modified(since these values are in text format) to perform mathematical calculations, and
presented in graphs, according to the timestamp of the records. This tool is very useful to perform
this kind of methodologies since it is a Business Intelligence framework that allows the creation of
visual interfaces according to external datasets, from all kinds of data sources, like Spark, MySQL,
local CSV files, among others.
58 Implemented Solution
There is two ways of connecting Hive tables to Power BI, as explained before: Import Mode
and Direct Query. The last one is simply a connection that only queries over data, and does not
import all the tables to Power BI. So, it requires an active connection to Spark ThriftServer, to
present the data.
The main objective in this stage is to create dashboards with all extracted data, containing plot
visuals and tables with historical values, for each month and the stock market. Each dashboard
will present 4 plots, each one for each stock market quotes’ source, and a couple of tables with the
preview of the table’s content, concerning the timestamp of each record.
The first step, after a successful connection with tables from Hadoop Cluster, is to modify the
data type, also known as casting columns, once the table’s columns are expressed as text, allowing
to perform some calculations over these values. The columns presenting price values will be cast
as currency columns, percentage columns as percentage values, and finally, the timestamp of the
record is already pre-formatted on the file creation state, on Spark’s Script (MM/DD/YYYY hh:
mm) as a valid date format, to Power BI be able to cast. Power BI tool presents a "Field" toolbar
where is possible to create Measurements to Queried tables.
Figure 3.48: Power BI Fields toolbar.
For example, to cast all price values from Apple, extracted from Google Finance, it is created a
DAX (Data Analysis Expression) formula. This language is used on Power BI to perform specific
queries on data, and it is the only method available to cast and make calculations to values. It is
possible to cast, create moving averages, count distinct values, among others. In this project’s case,
the most important calculation will be moving average, creating an evolution of stock values. This
3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 59
calculation can be executed with the help of a function available on DAX, namely AVERAGEX.
This function can be executed on the entire column or is also possible to filter selected information.
In this case, the selection filter is important, because each calculation will be made to each source.
Each table presents records to a stock market on all sources, so it is important to measure different
values for each source. The parameter "FILTER" will be performing this selection.
Figure 3.49: Moving average of Google Finance’s Close Price indicator , on Apple, in December
Every row in extracted tables will be presenting a text value, appended to stock’s indicators,
with an abbreviation of the record’s source, to distinguish all the information according to its font.
To create average calculations to every indicator and source, small modifications are made in this
formula, changing the selected column and font’s name.
Figure 3.50: Hive’s table, with source font’s column.
After performing the required calculations, the final step is to create plots and tables on the
dashboard to present data information. For that, there is a toolbar "Visualizations", where is pos-
sible to create a different type of charts, tables, and other presentation formats. For this type of
project, once all information is related to stock values according to time, line charts and a simple
table are used to present all data.
60 Implemented Solution
Figure 3.51: Power BI Implemented Line Charts and Tables
In the X-axis, there will be used time column and in the Y-axis all calculated values for each
source. Therefore, a temporal analysis can be done. The figure below represents the final state of
data on Power BI, where is possible to see every record of each Hive’s table successfully repre-
sented in form of charts or tables, to give a better overview of the results. In the next chapter, all
created tables and plots will be presented and discussed, to comment on the given result of this
final step.
To conclude, this phase represents the last phase of implemented architecture, where is possi-
ble to show all extracted and loaded data to the Hadoop Cluster.
Firstly, it was projected to create some further calculations with this data, more precisely using
Machine Learning algorithms and Artificial Intelligence models. But due to lack of time, this idea
will be taken to possible future implementation. Once extracted values present a good overview of
the behavior of each market value, in parallel with calculations like Moving Average, it is enough
to conclude some aspects of their studies just by implementing this Power BI’s calculations and
data previews.
Chapter 4
Result of Implementation and Tests
4.1 HDFS architecture availability
This section will present some results to prove that Hadoop Cluster is available to perform every
job, as well as availability checks on Spark Framework and Hive Server.
Every framework presents a Web User Interface where is possible to check if the service is
available. The following figures will prove the successful communication and availability among
all services, to check the entire availability of the system. To test Hive’s connection to the exterior,
Hive presents a script to test the connection, namely beeline. It is possible to check if Hive JDBC
is running by performing a command for the test.
Figure 4.1: HDFS availability check
The figures above illustrates the correct work of all architecture’s frameworks, that together
create an environment for data analysis, on Power BI or even in Spark framework. There is some
libraries on Spark that allows to train data and learn some patterns, like mlib, but due to lack of
time, this implementation will be taken to future work.
61
62 Result of Implementation and Tests
Figure 4.2: YARN scheduling and motorization test.
Figure 4.3: Spark History Server test.
Figure 4.4: Hive server test with beeline.
4.2 HDFS extraction mechanism 63
4.2 HDFS extraction mechanism
The figure below present the result of data extraction, where is possible to see where files are stored
on Hadoop Cluster, as well as an preview of data, in the row-column format, ready to import to
Hive and consequentially on Power BI dashboards.
Figure 4.5: Hadoop Folder of Extracted data
Figure 4.6: Output of data in HDFS files.
This mechanism will append new records to older ones, and store them in the actual HDFS
file, using an auxiliary file to union two data sets in one (creating a new file with old data, creating
an object with its data, and finally creating a new file with new plus old records): This method is
necessary because Spark framework does not have an embedded function to read/write the same
file
64 Result of Implementation and Tests
4.3 HDFS Performance Results - Data Extracting and Load: SparkJobs with CSV files vs Parquet
To test the best format to store data on Hadoop, performance tests were made, using Spark frame-
work, to see which format, with the same information, will be faster to process and load to Hive
Table. For that, a sample CSV file is created, with real records of MacDonald’s stock prices,
extracted from Yahoo Finance.
Then, the row’s files are replicated to create three different sizes of files: one with 1 million
files, another with 10 million, and lastly one file with 100 million rows. The objective is to see
if different sizes of data sets in these two presented formats will have different performance in
the Hadoop environment. The following figures represent the total size of tested files: 1, 10, and
100 Million rows ( it was used 100 different rows and replicated until a number of rows equal the
required amount.).
Figure 4.7: CSV File sizes: 1,10, 100 million of rows.
It is possible too see the size of each file, where:
• 1 million row CSV file is equals to 57956253 bytes, or 57.95 megabyte;
• 10 million row CSV file is equals to 522000043 bytes, or 522 megabyte, approximately;
• 100 million row CSV file is equals to 5220000043 bytes, or 5.22 gigabyte.
Spark framework create partitions in created Parquet files. This method increases parallelism
of Hadoop’s architecture, creating chunks of data between every node and preventing single points-
of-failure. So that, the content of the files are located inside the folder, as figures below present.
Figure 4.8: 1 Million row Parquet file
4.3 HDFS Performance Results - Data Extracting and Load: Spark Jobs with CSV files vsParquet 65
Figure 4.9: 10 Million row Parquet file
Figure 4.10: 100 Million row Parquet file
Performing the sum of each chunk’s size, the conclusion is :
• 1 million row Parquet file is equals to 16152 bytes, or 16.15 kilobyte;
• 10 million row Parquet file is equals to 64700 bytes, or 64.70 kilobyte, approximately;
• 100 million row Parquet file is equals to 595800 bytes, or 595 kilobyte.
The ratio of compression with Parquet files, for this example, is approximately 0.02% of CSV
format size. This is a great compression value, because no data is a loss, and Parquet files have
a metadata file (language that translates tables into smaller pieces of data) and default schema to
read values in the same format as CSV: as rows and columns.
Performing this compression with a high amount of data, with distinct rows (no equal rows),
the compression rate can be lower (up to 10% of CSV’s size), according to some studies related
to data compression with Parquet and CSV, explained in the past chapter. In this project’s case,
it is inconceivable to perform such tests, because extracted data does not have the same quantity
66 Result of Implementation and Tests
as expected to test it at an extreme level. But, it is concluded that using Parquet format files have
better performance and also easily readable/writable as CSV.
A couple of tests was made also using Spark’s History Server, where is possible to check Job’s
performance for each format type. For that, it was created a small script to read each CSV and
Parquet file, to see if there are relevant differences between each format, in terms of processing
time.
The following figure represents Spark’s History Server logs, in its Web interface, where is
possible to compare each format’s performance.
Figure 4.11: Time performance test in Spark: CSV files.
Figure 4.12: Time performance test in Spark: Parquet files.
It is possible to see that Parquet files are read faster than CSV, in about 50% less time. This
value shows one more time that Parquet format presents a better solution than CSV to this project,
where time is an important factor since real-time analysis is a must on every stock market analysis’
architecture. So, with these presented tests, the conclusion is to use Parquet format in every file,
to minimize space and time used, so that more data can be processed in a small time interval.
4.4 Power BI Results
The figures below present the dashboards created on the Power BI tool to visualize all data ex-
tracted. With this, it is possible to create a better overview of all data extracted, and also create
some patterns and calculations on that, to help to understand the evolution of data.
This tool is essential to empower a decision by end-user, creating a good visual about all data
and text extracted, using charts and other types of visualization. Besides that, as stated before, it
is possible to perform calculations on data, to create some basic predictive models on data, which
is a valid advantage in this project, like creating Moving average values, distinct counts, among
others.
4.4 Power BI Results 67
Figure 4.13: Power BI Final Dashboard.
68 Result of Implementation and Tests
Chapter 5
Conclusions and Future Work
5.1 Conclusion
The development of this dissertation enabled me to study a technology that can bring great support
for any enterprise’s business decision. Hadoop is the cheapest way to create a Big Data solution,
and this architecture is compatible with almost every data analysis tool.
In contrast with other similar architecture, like Traditional Warehouse or Cloud-based Clusters,
HDFS proves high-performance results, with small costs. Machines used to create the distributed
system have regular specifications, do not require high-performance hardware. Frameworks used
to support all communication and data management are free to use, and present great online sup-
port, to help to resolve any problem.
The main objective of this work, besides creating a visual decision-making business product,
relies on study different paths to create an HDFS solution, understanding which methods are more
efficient and viable, creating a valid product for the market, and satisfying internal enterprise
interests.
Also, there are made comparisons with different types of output HDFS files, namely Parquet
and CSV, which are the most used ones. In this way, it was possible to understand that, saving
the data in Parquet format, becomes faster and more efficient since the space required for each file
reduces significantly and its processing time is approximately half of the original value. With this,
it is possible to save more information in the system, with the shortest possible reading/writing
time.
The market analysis of the stock exchange, a topic that was analyzed, fits perfectly into the
conception of a Cluster. This will be the ideal architecture to do this, since such a study requires
a large amount of data, to make tables and evolutionary indicators of the stock market. Besides, it
requires a fast and accurate solution, so that its results are conclusive and in real-time, or almost
real since the stock market is quite volatile, that is, its values change with a high frequency.
This is just an example to test the validity of the construction of a Data Cluster, but it is the
most appropriate, as there is an interest in further exploring this subject, in a business environment.
During the realization of the project, the Spark tool was also compared to the original MapReduce,
69
70 Conclusions and Future Work
promoted by Hadoop. This comparison was only theoretical, based on studies carried out exter-
nally, wherein in all cases Spark would have better results, since its processing takes place in
memory, instead of on disk. Also, using this tool, it is possible to more easily connect the various
steps, namely the data extraction script with the logic of access and loading them, since the same
programming language is used, and this tool allows good compatibility. On the other hand, using
Spark, it is easier to create and use predictive models, using libraries embedded in the framework,
since in MapReduce jobs this concept is not so easy.
So, the combination of every process made on this work reveals to be the better option to create
a Big Data /Business Intelligence solution, if time, low cost, and commodity are key indicators to
the product. Besides all frameworks and components present high compatibly between them,
their configurations was a long process, with ups and downs, where some middle steps presented
difficulties and problems, compromising a couple of objectives that, if implemented, would create
a better final solution, like for example a better data analysis system. But, due to all barriers
imposed, data training and complex analysis are postponed to future work.
Concluding, the project was implemented successfully, with some gaps regarding the lack of
complex data analysis, but the main idea was proved, that is, the viability to use a low-cost solution
to perform Big Data analysis with good performance. This technology will be useful for business
use, where it will be possible in the future to use this solution to analyze and present various data
that is received, from various sources that work together with the company in question, since it
works with a large amount of information. In the future, if necessary and required, this architecture
could also be mass-produced for potential customers who need a relatively inexpensive and viable
solution to store and process any type of information.
5.2 Future Work
Some points can be highlighted as future work, improving the current solution.
The first point would be to carry out more performance tests, among different distribution
technologies, namely between Hadoop and traditional warehouses. That is, carrying out an im-
plementation, in a separate environment, of a traditional Warehouse, making performance tests
with the same amount of information. Only theoretical case studies were made, based on external
studies.
It would also be interesting to create performance tests between different data processing
frameworks, similar to the one used, Spark, namely between Spark and MapReduce jobs. Again,
this analysis was done only in theory, and it would be interesting to have some tangible results
on this comparison More important than the points mentioned above, the main improvement over
the created implementation would be to implement the concept of Machine Learning on stored
data, using some libraries available in Spark’s framework, like the "Mlib", where the data would
be previously trained and processed, and sent to the Power BI tool, with some additional columns
with more conclusive values about possible market forecasts.
5.2 Future Work 71
Due to problems that arose, only a few calculations were made, which are limited, since the
Power BI tool does not have as much power when it comes to processing information according to
regressive/predictive training models. However, given the existing possibilities, the best possible
work was created, building a reliable and efficient architecture for storing, loading, and processing
a large set of information.
Figure 5.1: Mlib library to data training. [14]
72 Conclusions and Future Work
References
[1] DataFlair. Hadoop architecture in detail – hdfs, yarn mapreduce. Available at: https://data-flair.training/blogs/hadoop-architecture/, (Accessed September2020).
[2] IEEE Yang-Suk Kee Member IEEE Dongchul Park, Member. In-storage computing forhadoop mapreduce framework: Challenges and possibilities. page 4, July 2015.
[3] Shubham Sinha. Hadoop ecosystem: Hadoop tools for crunch-ing big data. Available at: https://dzone.com/articles/hadoop-ecosystem-hadoop-tools-for-crunching-big-da, (AccessedSeptember 2020).
[4] Takashi Kimoto, Morio Yoda Kazuo Asakawa, and Masakazu Takeoka. Stock market pre-diction system with modular neural network. page 2.
[5] Zhihao PENG. Stocks analysis and prediction using big data analytics. Technical report,Department of Computer Science, Dalian Neusoft Institute of Information, Dalian 116626,China.
[6] Ms. Shetty Mamatha Gopal Mrs. Lathika J Shetty. Developing prediction model for stockexchange data set using hadoop map reduce technique. Technical report, International Re-search Journal of Engineering and Technology (IRJET), May 2016.
[7] Amogh P. Kulkarni Mahantesh C. Angadi. Time series data analysis for stock market predic-tion using data mining techniques with r. Technical report, Acharya Institute of Technologyand Sai Vidya Institute of Technology.
[8] Rohan Arora Satish Gopalani. Comparing apache spark and map reduce with performanceanalysis using k-means. Technical report, International Journal of Computer Applications.
[9] Amina Nouicer Abdelkamel Tari Abderrazak Sebaa, Fatima Chikh. Research in big datawarehousing using hadoop. Technical report, LIMED laboratory, Computer Science Depart-ment, University of Bejaia, Bejaia, ALGERIA.
[10] Our World In Data. Internet. Available at: https://ourworldindata.org/internet, (Accessed November 2020).
[11] FlashDBA. The real cost of oracle rac. Available at: https://flashdba.com/2013/09/18/the-real-cost-of-oracle-rac/, (Accessed November 2020).
[12] Prosenjit Chakraborty.
73
74 REFERENCES
[13] Apache hive architecture. Available at: hhttps://www.tutorialandexample.com/apache-hive-architecture//, (Accessed January 2021).
[14] Introduction of a big data machine learning tool — sparkml. Avail-able at: https://yurongfan.wordpress.com/2017/01/10/introduction-of-a-big-data-machine-learning-tool-sparkml/, (Ac-cessed January 2021).
[15] Hadoop. Apache hadoop (2020), September 2020. Available at https://hadoop.apache.org/.
[16] HDFS. Apache hadoop hdfs. (2020), 2020. Available at: http://hadoop.apache.org/hdfs, (Accessed September 2020).
[17] Apache Hadoop. Mapreduce tutorial, 2020. Available at: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html, (Accessed September 2020).
[18] Apache Spark. Apache sparkTM is a unified analytics engine for large-scale data process-ing., 2020. Available at: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html, (Accessed September 2020).
[19] Apache Hadoop YARN. Apache hadoop yarn, 2020. Available at: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html,(Accessed September 2020).
[20] Apache Hive. Apache hive. Available at: https://hive.apache.org/, (AccessedSeptember 2020).
[21] Intel IT Server. Apache hive. page 2, March 2013.
[22] Amina Nouice1 Abdelkamel Tari Abderrazak Sebaa, Fatima Chikh. Research in big datawarehousing using hadoop. Journal of Information Systems Engineering Management, pages3–4, March 30 2017.
[23] Apache Parquet. Apache parquet. Available at: https://parquet.apache.org/, (Ac-cessed September 2020).
[24] Databricks. What is parquet. Available at: https://databricks.com/glossary/what-is-parquet, (Accessed September 2020).
[25] Tony Jan Paul D. Yoo, Maria H. Kim. Machine learning techniques and use of event infor-mation for stock market prediction: A survey and evaluation. Technical report, Faculty ofInformation Technology University of Technology, Sydney.
[26] D. E. Rumelhart et al. Parallel distributed processing vol. 1. page 2, 1986.
[27] Ramon Lawrence. Using neural networks to forecast stock market prices. Technical report,Department of Computer Science,University of Manitoba, December 12 1997.
[28] G. Jr. Margavio Yoon, Y. Swales. A comparison of discriminant analysis versus artificialneural networks. Technical report, Journal of the Operational Research Society.
[29] Arkilic. Stock price movement prediction using mahout and pydoop documentation. Tech-nical report, October 2017.
REFERENCES 75
[30] M.D. Jaweed and J. Jebathangam. Analysis of stock market by using big data processingenvironment. Technical report.
[31] Data clustering algorithms. Available at: https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm, (AccessedNovember 2020).
[32] HortonWorks. Determine yarn and mapreduce memory configuration settings.URL: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html [last accessed 2020-12-21].