facebook data extraction using r & process in data lake

14
All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com Facebook data extraction using R & process in Data Lake An approach to understand how retail companie B s y G c a a ut n am p Go e sw rf a o m r i m Facebook data mining to analyze customers behavioral pattern By OnlineGuwahati Team

Upload: others

Post on 04-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Facebook data extraction using

R & process in Data Lake

An approach to understand how retail companieBsy Gcaautnam

pGoe

swrf

aomri m

Facebook data mining to analyze customers behavioral

pattern

By OnlineGuwahati Team

Page 2: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Abstract Without visibility on social media, it's extremely tough for the brands to interact with below

ground influencers who molds the future. Now a day's almost 94 per cent of buying decisions

are based on exponential growth of user participation on social media (mainly on Facebook).

Facebook is playing a critical role to increase brand awareness. Businesses need to transform

digital native customers into brand advocates and this can only be done if the relationship has

been nurtured. The key for brands is to encourage consumers to endorse the brand and play a

real part within the business.

Facebook data mining is becoming a major factor in taking accurate business decisions by

analysis of user posts, comments, likes, shares etc. as well as the sentiment analysis on

business page. To analyze data, we should have proper mechanism to extract data from the

business page. With effective utilization of R, a programming language for statistical analysis

and Hadoop Distributed File Systems (HDFS) with ELT (Extraction, Loading and Transformation)

approach, this E-Book has describe in details how we can perform Facebook data mining.

Part I

Facebook application creation and R installation

Before creating a Facebook application, we need to have a fair knowledge on Facebook

platform. A set of application programming interfaces (API) and tools have been developed by

Facebook for the third party developers so that they can create applications to leverage and

interact with core Facebook features. Entire set of API and tools are as a whole denoted as

Facebook platform. Besides, following are the high level components consolidated in Facebook

platform.

- Graph API can be utilized by application developer to read from and write data to Facebook.

Graph API provides an overview of the Facebook social graph and the relationships between

entities therein.

- Authentication allows applications to interact with Facebook and users to sign on to various

applications through Facebook via a PC, mobile phone or desktop app.

- Social plug-in like "Like” button by which application developers are allowed to give their users

a social experience through Facebook without gaining access to Facebook users' information.

Page 3: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

- Open graph protocol which can be utilized by application developers to integrate their pages

with Facebook.

- IFrames can be used to create applications those can be accessed via Facebook login, but are

hosted separately from Facebook.

- Microformats, is a component that allows Facebook users to move these details to their own

calendars or to mapping applications.

Facebook application creation

Facebook application is a small application which is developed for the Facebook profiles. As a

first step we need to have an account with Facebook. Login to developers.facebook.com if

already has an account.

After successful login, the dropdown box at the right hand top corner will change to My Apps.

On extending it, we will have option to add a new application as shown in picture below.

Page 4: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Now we can create a new application by adding few inputs. Applications are already built

internally by Facebook. Based on our category as well as display name selection, an ID would

be generated and assign to it. Since we are going to extract page data, so category selection

should be of type "Apps for the page" from the dropdown box.

After clicking on "Create App Id ", new App ID will be generated with options to execute

multiple operations on it.

To get all the mandatory information about the application, we need to navigate through

"Dashboard" that appears on left side below the application name. App ID and App Secret are

mandatory and sensitive information and hence need to be copied in safe file to avoid

disclosure to others. Now we need to proceed towards "Settings" panel to add platform that

would be "Website".

Page 5: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

We are almost done with the application creation except Website Site-URL input. Leave the

browser without sign-out from developers.facebook.com.

Next Step is R installation and R-Studio setup.

R installation and R-Studio setup

R is a free software environment for statistical computing and graphics. It compiles and runs on

a wide variety of platforms viz. UNIX , Windows and MacOS. R encompasses effective data

handling and storage facility with a suit of operators for calculations on arrays, in particular

matrices. Eventually it is a simple and effective programming language which includes

conditionals, loops, user-defined recursive functions and input and output facilities.

In order to download R, we can choose preferred CRAN mirror. Here is the URL of the mirror

https://cran.r-project.org/mirrors.html.

If we choose operating system as Windows, it will appear as R-3.3.3-win.exe after download.

Below picture shows how R console looks like after successful installation.

Page 6: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

RStudio is an integrated development environment (IDE) for R. It makes R easier to be used.

RStudio is a consolidated platform that includes code editor, debugging and visualization tool.

Besides, a console, syntax-highlighting that supports direct code execution, tools for plotting,

history and workspace management are key components in RStudio. RStudio is available in

open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in

a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS,

and SUSE Linux). As a beginner/ learner, you can go for open source and get it from

https://www.rstudio.com/products/rstudio/download/.

Note:- Without proper installation of R , RStudio doesn’t work, because RStudio runs on

top R environment. We can correlate this with Eclipse studio where Java runtime should

be installed first to run it.

Page 7: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Above picture shows how RStudio looks in Windows environment. Even though several basic

packages come by default with RStudio, we need to Install “Rfacebook” package from CRAN

separately. Additional packages can be installed by navigating top menu "Tools" -> install

Packages. Here we need to have "Rfacebook".

After successful installation, it's mandatory to check whether all required packages are

available or not. Those are visible in the right side below and here is the list of packages.

Page 8: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Finally we can see a message in RStudio editor.

Part 2

Generate and assign OAuth token to Facebook R session.

R internally makes a call to the Facebook API to get all the relevant information but

call/invocation to the Facebook API has to be authenticated first. That means, before releasing

any data from its repository, Facebook API verifies whether methods in the API gets invoked

from trusted source.

Once we generate OAuth token using App ID and App Secret as explained in Part 1, R internally

use it during the session to invoke various methods on Facebook API.

Following are the steps for OAuth Token creation, which will subsequently be assigned to R

session. Start the RStudio and load the library "Rfacebook"

Page 9: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Get the library into RStudio editor by

> library ("Rfacebook")

Pass the value of App ID and App Secret to the method fbOAuth as parameter and

returned value should store in a local variable say og_oauth. Value of App ID and

App Secret are already noted in Part 1.

After entering the above command, a site URL gets generated as http://localhost:1410

that we need to add into Facebook application created in Part 1 and click "Save

changes" button on Facebook application page.

Page 10: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

After that, press any key or Enter in the RStudio editor. Immediately a new browser tab will be opened with Facebook page to get the password again. Once entered, we see a message as “Authentication complete. Please close this page and return to R." on the page.

We can save the variable "og_oauth" as a file to be re-used in future sessions, which in

fact can be used as token in functions by the following command in RStudio editor

For testing, we can extract information for the logged in user viz. name of the user,

total Likes etc. To execute that, we need to invoke getUser() method where User Token

of the created Facebook application has to be passed.

To access User Token, we need to navigate the Tools & Support menu in

developer.facebook.com where application is created.

Page 11: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Copy the User Token and add as parameter in the RStudio editor. "myself" is a local

variable/object here which getUser() method returns

if we type "myself$name" in editor, Name of the logged in user gets displayed.

As the connection between Facebook API & R is established now, we are ready to

extract the data from any company’s facebook page for analysis.

Part 3

Extraction of post/data from the public facebook page

Here we are considering electronic commerce company's pages for precisely analyzing

customer's feedback, sentiments and other information. In real time business a group of

customers or other interested individuals give the companies a convenient way to keep the

members informed about products and services and share information. Besides, posts,

comments in groups in Facebook page help companies to understand customers’/buyers’

expectations. To attract customers from rival companies and retain customers with same

company, companies need to enhance/improve products/service in an innovative way

referring to the critics, complaints posted on rival company Facebook page.

Page 12: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Once those posts/information are extracted and analyzed, it will be very easy to decide on

the additional steps to be performed, those the rival companies have overlooked.

https://www.facebook.com/Snapdeal 471784335392

The data what we have extracted from Facebook is unstructured and it can't be processed in

traditional RDBMDS (Databases like Oracle, MySQL, IBM's DB2 etc.) as well as mining too to

understand customer's behaviors' etc.

In a similar ways we can consolidate huge volume of Facebook data from the companies’ pages.

Besides, twitter streaming data is another source where we can analyze customer's sentiments,

their views etc. The data produced in the twitter streaming is semi-structured that is JSON (Java

script Object Notation) that is light weight compare to XML

Part 4

Process data by adopting ELT in a distributed manner on Data Lake.

The basic concept of Data Lake where we can use the approach ELT (Extraction, Loading and

then Transformation) against traditional ETL (Extraction, Transformation and then loading)

process. ETL process implies to traditional data warehousing system where structured data

format follows (raw and column). The main characteristic of a Data Lake is that data is not

Page 13: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

classified when gets persisted and due to that the data preparation, cleansing and subsequently

transformation activities are being eliminated. In Data Warehouse, these activities consume

major time of whole process. By storing Facebook's comments, reviews, shares, 'Like' etc. in its

rawest form enables business analyst of the organizations specific to retail domain with multi-

channel e-commerce as a separate vertical to find answers from those data for which they do

not know the questions yet. By leveraging HDFS (Hadoop Distributed File System), we can

develop Data Lake to store any format data in order to process and analysis. Directly data can

be loaded in the Lake without transformation, later transformation can be performed on

demand basis. There are few components available using those we can ingest data

(independent of any

Format) into lake like Flume, Sqoop, efficient file transfer mechanism etc.

Node Node

Node

Node Node

HDFS Node

Page 14: Facebook data extraction using R & process in Data Lake

All rights reserved © 2017 | Irisidea Technologies Private Limited | www.irisidea.com

Since we are not going to analyze run time streaming data, we need to follow batch processing

approach where data first gets persisted in the lake. As a next step, all the junk, invalid, noisy

data should be removed and that can be implemented using multiple Map-Reduce chain which

in turn also detonated as data pipeline.

Data quality is a major challenge in the data lake implementation even though it makes possible

to store all the data, ask complex and radically bigger business questions, and find out hidden

patterns and relationships from the data. After successful data crunching in an optimized way,

responsibilities goes to data scientist to perform any ad-hoc queries, and build an advanced

model at any time—iteratively.

By adopting machine learning technique, we can start to explore data, as well as to build

predictive models for customer's interest in buying products. If our objective is to build a model

that can predict the products which customer is buying then proceed with supervised approach.

Unsupervised approach like clustering or PCA will likely mix the information that customers are

interested in with other, unrelated, products and generate a worse predictor. By levering

supervised machine learning techniques, accumulating all available historical information can

predict trends and effects of seasonality for planning and forecasting.

On top of it, natural language processing (NLP) can effectively utilized to analyze consumer

sentiments and apply those as inputs to the planning and forecasting process to organize the

products in the channel of "Related Products" or "You may like also" in the E-Commerce portal.

The retailer can penetrate insight to their customer as well as visitors by performing forecasting

and planning through machine learning on those extracted Facebook data which in turns helps

to improve performance and profitability. Applying automated complex task such as NLP on

Facebook's unstructured massive data sets collection, an actionable intelligence can be

developed to achieve better understanding of their customers.

In addition to that, information achieved from the datasets collection by machine learning can

be leveraged to channelize the business and operational reshuffle to improve business

decisions.

In this e-book, we have just discussed the approach how to analyze the Facebook's data in Data Lake using

machine learning. So it can't be taken as cookbook to implement in real scenarios. There are large numbers of

technical steps with complete description and configuration involves which we have not covered here. This e-

book is presented to get the flavor of Facebook's unstructured data mining using R and Hadoop (A popular Big

Data processing framework )