facebook data extraction using r & process in data lake
TRANSCRIPT
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Facebook data extraction using
R & process in Data Lake
An approach to understand how retail companieBsy Gcaautnam
pGoe
swrf
aomri m
Facebook data mining to analyze customers behavioral
pattern
By OnlineGuwahati Team
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Abstract Without visibility on social media, it's extremely tough for the brands to interact with below
ground influencers who molds the future. Now a day's almost 94 per cent of buying decisions
are based on exponential growth of user participation on social media (mainly on Facebook).
Facebook is playing a critical role to increase brand awareness. Businesses need to transform
digital native customers into brand advocates and this can only be done if the relationship has
been nurtured. The key for brands is to encourage consumers to endorse the brand and play a
real part within the business.
Facebook data mining is becoming a major factor in taking accurate business decisions by
analysis of user posts, comments, likes, shares etc. as well as the sentiment analysis on
business page. To analyze data, we should have proper mechanism to extract data from the
business page. With effective utilization of R, a programming language for statistical analysis
and Hadoop Distributed File Systems (HDFS) with ELT (Extraction, Loading and Transformation)
approach, this E-Book has describe in details how we can perform Facebook data mining.
Part I
Facebook application creation and R installation
Before creating a Facebook application, we need to have a fair knowledge on Facebook
platform. A set of application programming interfaces (API) and tools have been developed by
Facebook for the third party developers so that they can create applications to leverage and
interact with core Facebook features. Entire set of API and tools are as a whole denoted as
Facebook platform. Besides, following are the high level components consolidated in Facebook
platform.
- Graph API can be utilized by application developer to read from and write data to Facebook.
Graph API provides an overview of the Facebook social graph and the relationships between
entities therein.
- Authentication allows applications to interact with Facebook and users to sign on to various
applications through Facebook via a PC, mobile phone or desktop app.
- Social plug-in like "Like” button by which application developers are allowed to give their users
a social experience through Facebook without gaining access to Facebook users' information.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
- Open graph protocol which can be utilized by application developers to integrate their pages
with Facebook.
- IFrames can be used to create applications those can be accessed via Facebook login, but are
hosted separately from Facebook.
- Microformats, is a component that allows Facebook users to move these details to their own
calendars or to mapping applications.
Facebook application creation
Facebook application is a small application which is developed for the Facebook profiles. As a
first step we need to have an account with Facebook. Login to developers.facebook.com if
already has an account.
After successful login, the dropdown box at the right hand top corner will change to My Apps.
On extending it, we will have option to add a new application as shown in picture below.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Now we can create a new application by adding few inputs. Applications are already built
internally by Facebook. Based on our category as well as display name selection, an ID would
be generated and assign to it. Since we are going to extract page data, so category selection
should be of type "Apps for the page" from the dropdown box.
After clicking on "Create App Id ", new App ID will be generated with options to execute
multiple operations on it.
To get all the mandatory information about the application, we need to navigate through
"Dashboard" that appears on left side below the application name. App ID and App Secret are
mandatory and sensitive information and hence need to be copied in safe file to avoid
disclosure to others. Now we need to proceed towards "Settings" panel to add platform that
would be "Website".
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
We are almost done with the application creation except Website Site-URL input. Leave the
browser without sign-out from developers.facebook.com.
Next Step is R installation and R-Studio setup.
R installation and R-Studio setup
R is a free software environment for statistical computing and graphics. It compiles and runs on
a wide variety of platforms viz. UNIX , Windows and MacOS. R encompasses effective data
handling and storage facility with a suit of operators for calculations on arrays, in particular
matrices. Eventually it is a simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
In order to download R, we can choose preferred CRAN mirror. Here is the URL of the mirror
https://cran.r-project.org/mirrors.html.
If we choose operating system as Windows, it will appear as R-3.3.3-win.exe after download.
Below picture shows how R console looks like after successful installation.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
RStudio is an integrated development environment (IDE) for R. It makes R easier to be used.
RStudio is a consolidated platform that includes code editor, debugging and visualization tool.
Besides, a console, syntax-highlighting that supports direct code execution, tools for plotting,
history and workspace management are key components in RStudio. RStudio is available in
open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in
a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS,
and SUSE Linux). As a beginner/ learner, you can go for open source and get it from
https://www.rstudio.com/products/rstudio/download/.
Note:- Without proper installation of R , RStudio doesn’t work, because RStudio runs on
top R environment. We can correlate this with Eclipse studio where Java runtime should
be installed first to run it.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Above picture shows how RStudio looks in Windows environment. Even though several basic
packages come by default with RStudio, we need to Install “Rfacebook” package from CRAN
separately. Additional packages can be installed by navigating top menu "Tools" -> install
Packages. Here we need to have "Rfacebook".
After successful installation, it's mandatory to check whether all required packages are
available or not. Those are visible in the right side below and here is the list of packages.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Finally we can see a message in RStudio editor.
Part 2
Generate and assign OAuth token to Facebook R session.
R internally makes a call to the Facebook API to get all the relevant information but
call/invocation to the Facebook API has to be authenticated first. That means, before releasing
any data from its repository, Facebook API verifies whether methods in the API gets invoked
from trusted source.
Once we generate OAuth token using App ID and App Secret as explained in Part 1, R internally
use it during the session to invoke various methods on Facebook API.
Following are the steps for OAuth Token creation, which will subsequently be assigned to R
session. Start the RStudio and load the library "Rfacebook"
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Get the library into RStudio editor by
> library ("Rfacebook")
Pass the value of App ID and App Secret to the method fbOAuth as parameter and
returned value should store in a local variable say og_oauth. Value of App ID and
App Secret are already noted in Part 1.
After entering the above command, a site URL gets generated as http://localhost:1410
that we need to add into Facebook application created in Part 1 and click "Save
changes" button on Facebook application page.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
After that, press any key or Enter in the RStudio editor. Immediately a new browser tab will be opened with Facebook page to get the password again. Once entered, we see a message as “Authentication complete. Please close this page and return to R." on the page.
We can save the variable "og_oauth" as a file to be re-used in future sessions, which in
fact can be used as token in functions by the following command in RStudio editor
For testing, we can extract information for the logged in user viz. name of the user,
total Likes etc. To execute that, we need to invoke getUser() method where User Token
of the created Facebook application has to be passed.
To access User Token, we need to navigate the Tools & Support menu in
developer.facebook.com where application is created.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Copy the User Token and add as parameter in the RStudio editor. "myself" is a local
variable/object here which getUser() method returns
if we type "myself$name" in editor, Name of the logged in user gets displayed.
As the connection between Facebook API & R is established now, we are ready to
extract the data from any company’s facebook page for analysis.
Part 3
Extraction of post/data from the public facebook page
Here we are considering electronic commerce company's pages for precisely analyzing
customer's feedback, sentiments and other information. In real time business a group of
customers or other interested individuals give the companies a convenient way to keep the
members informed about products and services and share information. Besides, posts,
comments in groups in Facebook page help companies to understand customers’/buyers’
expectations. To attract customers from rival companies and retain customers with same
company, companies need to enhance/improve products/service in an innovative way
referring to the critics, complaints posted on rival company Facebook page.
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Once those posts/information are extracted and analyzed, it will be very easy to decide on
the additional steps to be performed, those the rival companies have overlooked.
https://www.facebook.com/Snapdeal 471784335392
The data what we have extracted from Facebook is unstructured and it can't be processed in
traditional RDBMDS (Databases like Oracle, MySQL, IBM's DB2 etc.) as well as mining too to
understand customer's behaviors' etc.
In a similar ways we can consolidate huge volume of Facebook data from the companies’ pages.
Besides, twitter streaming data is another source where we can analyze customer's sentiments,
their views etc. The data produced in the twitter streaming is semi-structured that is JSON (Java
script Object Notation) that is light weight compare to XML
Part 4
Process data by adopting ELT in a distributed manner on Data Lake.
The basic concept of Data Lake where we can use the approach ELT (Extraction, Loading and
then Transformation) against traditional ETL (Extraction, Transformation and then loading)
process. ETL process implies to traditional data warehousing system where structured data
format follows (raw and column). The main characteristic of a Data Lake is that data is not
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
classified when gets persisted and due to that the data preparation, cleansing and subsequently
transformation activities are being eliminated. In Data Warehouse, these activities consume
major time of whole process. By storing Facebook's comments, reviews, shares, 'Like' etc. in its
rawest form enables business analyst of the organizations specific to retail domain with multi-
channel e-commerce as a separate vertical to find answers from those data for which they do
not know the questions yet. By leveraging HDFS (Hadoop Distributed File System), we can
develop Data Lake to store any format data in order to process and analysis. Directly data can
be loaded in the Lake without transformation, later transformation can be performed on
demand basis. There are few components available using those we can ingest data
(independent of any
Format) into lake like Flume, Sqoop, efficient file transfer mechanism etc.
Node Node
Node
Node Node
HDFS Node
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com
Since we are not going to analyze run time streaming data, we need to follow batch processing
approach where data first gets persisted in the lake. As a next step, all the junk, invalid, noisy
data should be removed and that can be implemented using multiple Map-Reduce chain which
in turn also detonated as data pipeline.
Data quality is a major challenge in the data lake implementation even though it makes possible
to store all the data, ask complex and radically bigger business questions, and find out hidden
patterns and relationships from the data. After successful data crunching in an optimized way,
responsibilities goes to data scientist to perform any ad-hoc queries, and build an advanced
model at any time—iteratively.
By adopting machine learning technique, we can start to explore data, as well as to build
predictive models for customer's interest in buying products. If our objective is to build a model
that can predict the products which customer is buying then proceed with supervised approach.
Unsupervised approach like clustering or PCA will likely mix the information that customers are
interested in with other, unrelated, products and generate a worse predictor. By levering
supervised machine learning techniques, accumulating all available historical information can
predict trends and effects of seasonality for planning and forecasting.
On top of it, natural language processing (NLP) can effectively utilized to analyze consumer
sentiments and apply those as inputs to the planning and forecasting process to organize the
products in the channel of "Related Products" or "You may like also" in the E-Commerce portal.
The retailer can penetrate insight to their customer as well as visitors by performing forecasting
and planning through machine learning on those extracted Facebook data which in turns helps
to improve performance and profitability. Applying automated complex task such as NLP on
Facebook's unstructured massive data sets collection, an actionable intelligence can be
developed to achieve better understanding of their customers.
In addition to that, information achieved from the datasets collection by machine learning can
be leveraged to channelize the business and operational reshuffle to improve business
decisions.
In this e-book, we have just discussed the approach how to analyze the Facebook's data in Data Lake using
machine learning. So it can't be taken as cookbook to implement in real scenarios. There are large numbers of
technical steps with complete description and configuration involves which we have not covered here. This e-
book is presented to get the flavor of Facebook's unstructured data mining using R and Hadoop (A popular Big
Data processing framework )