web data mart

Upload: pksinhadwh

Post on 15-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 Web Data Mart

    1/12

    A STUDY ON DESIGN AND ANALYSIS

    OF WEB MART MINING AND ITS

    RELEVANCE TODAY

    Ravikumar G K*

    *Dr. MGR University, Chennai, Tamilnadu, INDIA,

    Manjunath T. N+

    +Bharathiar University, Coimbatore, Tamil Nadu, INDIA,

    Ravindra S. Hegadi#

    #Karnatak University,Dharwad,Karnataka,INDIA,

    Archana R A++

    ++SJB Institute of Technology,Bangalore,Karnataka,INDIA,

    Abstract:

    Data warehousing is one of the latest trends in computing environment and information technology

    applications. A data warehouse is a system that extracts, cleans and delivers source data into dimensional

    data store and then supports and implements querying and analysis for the purpose of decision making.

    From a data warehouse, data flows to various departments for their customized decision support systems.

    These individual departmental components are called data marts. A data mart is a set of dimensional

    tables supporting a business process. Data marts contain all atomic detail needed to support drilling down

    to the lowest level. Every company or organization in the world has a website. Beneath each web site are

    web logs that record every object either posted to or served from the web server. Web logs are important

    because they reveal the user traffic on the web site. The activity of parsing web logs and storing the

    results in a data mart to analyze customer activity is known as click stream data warehousing. The web

    mart - database schema is designed to make the underlying data structure more comprehensible to users

    and to simplify the query process. The recommended approach for data warehouse data modeling is to

    follow a Dimensional Modeling approach - Star Schema. We explore the design and analysis of web mart

    and its relevance today at minute level.

    Keywords: Data warehousing, ETL, Web log, Data mart, Web mart.

    1. Introduction - Star Schema of the Web Mart

    The web mart - database schema is designed to make the underlying data structure more comprehensible to

    users and to simplify the query process. The recommended approach for data warehouse data modeling is to

    follow a Dimensional Modeling approach-called Star Schema. The star schema has a central fact table with

    dimension tables at the points of the star. The single fact tables composite primary key requires a foreign key

    field corresponding to the primary key field of each dimension table. The dimension tables are hierarchical and

    thus highly denormalised [4] .A fact table is a primary table in the web mart that contain the business facts, and

    dimension tables are companion tables to the fact table that represent the business critical dimensions and

    contain the attributes for the business critical dimensions. The central fact table provides users the ability to doanalysis on business facts, and dimensional tables provide users the ability to do analysis on these business facts

    in various business critical dimensions[10].

    The figure-1 presents the overall view of the click stream fact and the associated dimensions.

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3141

  • 7/23/2019 Web Data Mart

    2/12

    Fig-1: Star Schema of the Web Mart Design -Click Stream Fact and its Associated Dimensions

    2. Description -Web Mart Objects

    This section explains about the detailed analytical capabilities of the model by giving the listing of the basic fact

    that the user will be able to analyze and the corresponding dimensions which gives user the capability of drill up

    and drill down and slicing and dicing on the base fact. Before the design of specific click stream data marts,

    there is a need to collect together as many dimensions as one can think of that may have relevance in a click

    stream environment. The unique dimensions of the click stream data warehouses are page, visitor, session and

    referral. The page dimension describes the page context for a web page event. It contains attributes like page

    key, page source, page function. The visitor dimension gives the details regarding visitor. The main attributesare userId, CookieId, Operating System and Browser. The session dimension provides one or more levels of

    diagnosis for the visitors session as a whole. For example, the local context of the session might be requesting

    product information, but the overall session context might be ordering a product. The referral dimension

    describes how the customer arrived at the current page [9] [11].

    2.1 Facts and Dimensions in the Web Mart

    The following table-1 presents the objects i.e. fact and dimensions available in the web mart for the analysis

    purpose.

    Table name Fact/Dimension Levels

    Click Stream Fact Fact -

    Universal Date Dimension Year, Quarter, Month, Week, Day

    Universal TOD Dimension Period of the day, Hour, Minute, Second

    Date Dimension Year, Quarter, Month, Week, Day

    TOD Dimension Period of the day, Hour, Minute, Second

    Visitor Dimension IP Address (or) Visitor Id (or) Cookie Id

    Page Dimension Object Type, File Type, Page Type, URL (a)

    Domain, Site, Directory, URL

    Session Dimension Session Type

    Click Stream Fact

    ObjectDimension

    Universal

    DateTOD

    Dimension

    OS

    Dimension

    Browser

    Dimension

    Universal

    TOD

    Visitor

    Dimension

    Page

    Dimension

    Session

    Dimension

    ReferrerDimension

    DateDimension

    Status

    Dimension

    GeographyDimension

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3142

  • 7/23/2019 Web Data Mart

    3/12

    Referrer Dimension Referrer Type, URL (a)

    Domain, Site, URL

    Status Dimension Type of the Status, Status description

    Visit Dimension -

    Content Page Dimension -

    Table-1: Objects of Web Mart

    2.2 Click Stream Fact Table

    2.3 The Table-2 presents various business measures on which the user will be able to do analysisusing the click stream fact table and the associated dimension tables.

    Field Name Description

    UniversalDatekey Foreign key for the Universal Date dimension

    UTODkey Foreign key for the Universal TOD dimension

    Datekey Foreign key for the Date dimension

    TODkey Foreign key for the TOD dimension

    Visitorkey Foreign key for the Visitor dimension

    Pagekey Foreign key for the Page dimension

    Sessionkey Foreign key for the Session dimension

    Referrerkey Foreign key for the Referrer dimension

    Statuskey Foreign key for the Status dimension

    Visitkey Foreign key for the Visit dimension

    ContentPagekey Foreign key for the Content Page dimension

    TimeViewed The time spent in seconds, by the Visitor on a particular object like

    page, file

    BytesTransferred The bytes transferred to the client machine.Table-2: Click Stream Fact Table Description

    2.3 Dimensions

    The dimension table gives users the ability to analyze the business measures in different dimensions by allowing

    the users to drill up and drill down and slice and dice with the attributes of the dimensions. Drilling down is

    adding detailed rows to an existing request and is nothing more than requesting to give more detail. Drilling upis subtracting row headers and is nothing more than looking at the data at more aggregated/consolidated form.

    Slicing is constraining the data that is displayed on an attribute found in a dimension and dicing is constraining

    the data that is displayed by attributes in multiple dimensions [4].

    2.3.1 Universal Date Dimension

    The figure-2 presents the universal date dimension with all its attributes. This universal date dimension

    facilitates analysis along the calendar period with respect to the Greenwich mean time. The hierarchical

    attributes of the dimension are represented using the arrow head connections and general attributes are

    represented using straight-line connections. Date is the lowest grain level that the user will be able to drill down

    to, and year is the highest level that the user will be able to drill up to. Drill down path can be identified by

    following the arrow headings.

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3143

  • 7/23/2019 Web Data Mart

    4/12

  • 7/23/2019 Web Data Mart

    5/12

    The table-4 describes the structure of the Universal TOD dimension table.

    Field Name Description Values/Example

    UTODkey Primary key for the dimension (Surrogate key) 1,2

    UniversalSecond Second of a Minute 1-60

    UniversalMinute Minutes of a hour 1-60

    UniversalHour Hours of a day 10-11, 11-12 UniversalTODPeriodofDay Collection of hours in a day Morning, Evening

    Table-4: Column Description of Universal TOD Dimension

    2.3.3 Date Dimension

    The figure-4 presents the date dimension with all its attributes. The hierarchical attributes of the dimension are

    represented using the arrow head connections and general attributes are represented using straight-line

    connections. Date is the lowest grain level that the user will be able to drill down to, and year is the highest level

    that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10]

    [1].

    Figure-4: Date Dimension with the hierarchy

    The table-5 describes the structure of the date dimension table.

    Field Name Description Values/Example

    Datekey Primary key for the dimension (Surrogate key) 1,2

    Date Date 25/01/2000

    DayOfWeek Day of the week Sunday

    DayOfWeekNumber Day of the week number 1-7, 1 being Sunday

    WeekNumber Week number in the month 1-5

    Week Week number in the year 1-52MonthDay Day number in the month 1-31

    MonthNumber Month of the year in number 1-12

    Month Month of the year January

    Quarter Quarter of the year 1-4

    Year Year of the date 2000

    Table-5: Column Description of Date dimension

    DATE

    DAY NUMBER IN MONTH

    MONTH NUMBER

    YEAR

    WEEK NUMBER IN YEAR

    YEAR

    QUARTER

    MONTH

    WEEK OF THE MONTH

    DAY OF THE WEEK

    DAY NUMBER IN WEEK

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3145

  • 7/23/2019 Web Data Mart

    6/12

    2.3.4 TOD Dimension

    The figure-5 presents the TOD dimension with all its attributes. The hierarchical attributes of the dimension are

    represented using the arrow head connections and general attributes are represented using straight-line

    connections. Seconds is the lowest grain level that the user will be able to drill down to, and period of the day is

    the highest level that the user will be able to drill up to. Drill down path can be identified by following the arrow

    headings [10] [1].

    Figure-5: TOD dimension with its attributes

    The table-6 describes the structure of the TOD dimension table.

    Field Name Description Values/Example

    TODkey Primary key for the dimension (Surrogate key) 1,2 Second Second of a Minute 1-60

    Minute Minutes of a hour 1-60

    Hour Hours of a day 10-11, 11-12

    TimeOfDay Time of the day 12:00:55;19:15:25

    PeriodofDay Collection of hours in a day Morning, Evening

    Table-6: Column description of TOD dimension

    2.3.5 Visitor Dimension

    The figure-6 presents the Visitor dimension with all its attributes. The hierarchical attributes of the dimension

    are represented using the arrow head connections and general attributes are represented using straight-line

    connections. User id/ Cookie id/ Domain name is the lowest grain level that the user will be able to drill down

    to, and country is the highest level that the user will be able to drill up to. The lowest granularity will be decidedat the client site. Drill down path can be identified by following the arrow headings [10] [1].

    Figure-6: Visitor Dimension with its attributes

    * - Demographics are collection of many fields. It is also possible to form a hierarchy in the demographic

    information. The table-7 describes the structure of the visitor dimension table.

    PERIOD OF THE DAY

    HOUR

    USER ID

    OPERATING SYSTEM

    BROWSER

    COOKIE ID

    DEMOGRAPHICS *

    COUNTRY

    DOMAIN NAME

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3146

  • 7/23/2019 Web Data Mart

    7/12

  • 7/23/2019 Web Data Mart

    8/12

    The following table-8 describes the various fields of the page dimension.

    Field Name Description Values/Example

    Pagekey Primary key for the dimension (Surrogate key) 1,2

    URL Full path of the page in the server C:\..\index.html

    PageName Name of the web page Welcome page, Product

    info page

    PageType Classification of pages in the web site News pages, Jobs &Career pages

    FileType The file type of the object Gif, au, ra, html

    ObjectType The type of the object Multimedia files,

    Application, Content

    pages

    FileName Name of the file being accessed by the user Index.html,

    ProductInfo.html

    Directory The directory in the server, of the accessed file C:\Inetpub\doc

    Site The site where the particular page is available

    Domain The domain of the site where the page resides

    Table-8: Column description of Page Dimension

    2.3.7. Session Dimension

    The figure-8 presents the session dimension with all its attributes. The general attributes are represented using

    straight-line connections; the connection with a circle at one end denotes that the specified item is a collection of

    fields. Session type assigns a meaning to a visit [1]

    Figure-8: Session Dimension with its attributes

    * - Session parameters are collection of fields, which describes the conditions for characterizing a session

    type.

    The following table-9 describes the various fields of the session dimension.

    Field Name Description Values/Example

    Sessionkey Primary key for the dimension (Surrogate key) 1,2

    SessionType The type of the user session. Session is defined

    bases on the business rules

    Quick hit and gone,

    Product Ordering

    SessionDescription The description of a particular session

    SessionParameters The parameters that characterize the particularsession. It can be split into multiple fields, based

    on the business rules provided by the customer.

    Ex: If Time Spent is inthe range of 1-10 min

    and the pages visited in

    general info or product

    info, then it is a

    Looking for Info

    session.

    Table-9: Column Description of Session dimension

    SESSION TYPE

    SESSION DESCRIPTIONSESSION PARAMETERS *

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3148

  • 7/23/2019 Web Data Mart

    9/12

    2.3.7 Referrer Dimension

    The figure-9 presents the session dimension with all its attributes. The hierarchical attributes of the dimension

    are represented using the arrow head connections and general attributes are represented using straight-line

    connections. URL is the lowest grain level that the user will be able to drill down to, and the domain and referrer

    type are the highest levels that the user will be able to drill up to. Drill down path can be identified by following

    the arrow headings [1].

    Figure-8: Session Dimension with its attributes

    The following table-10 describes the fields of the referrer dimension.

    Field Name Description Values/Example

    Referrerkey Primary key for the dimension (Surrogate key) 1,2

    ReferringURL The URL of the referring page C:\..\index.html

    ReferringSite The Site of the referring page

    ReferringDomain The domain of the referring page

    Keyword The keyword given by the user as search criteria

    to reach the page.

    Web mining,

    warehousing

    ReferrerType The type of the referrer Ad banner, Search

    engine

    Table-10: Column description of referrer dimension

    2.3.8 Status Dimension

    The figure-9 presents the status dimension with all its attributes. The hierarchical attributes of the dimension are

    represented using the arrow head connections and general attributes are represented using straight-line

    connections. Status id is the lowest grain level that the user will be able to drill down to, and the status type is

    the highest level that the user will be able to drill up to. Status description provides a description for the status

    id. Drill down path can be identified by following the arrow headings [10] [1].

    Figure-9: Status dimension with its attributes

    URL

    KEY WORD

    SITE

    DOMAIN

    REFERRER TYPE

    STATUS DESCRIPTIONSTATUS ID

    STATUS TYPE

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3149

  • 7/23/2019 Web Data Mart

    10/12

    The following table-11 describes the fields of the status dimension.

    Field Name Description Values/Example

    Statuskey Primary key for the dimension (Surrogate key) 1,2

    StatusId The Status Code 101, 201

    StatusDescription Description of the Status Successful, File not

    found error

    StatusType Type of the Status File errors

    Table-11: Column Description of Status dimension

    2.3.9 Visit Dimension

    This dimension has no hierarchy. This dimension is used for identifying the start and end of a visit, it is show in

    table-12

    Field Name Description Values/Example

    Visitkey Primary key for the dimension (Surrogate key) 1,2

    Description The value or description. For the start of the visit

    it is 'Start' for an end of the visit page 'End'

    Start, End

    Table-12: Column description of visit dimension

    2.3.10 Content Page Dimension

    This dimension has no hierarchy. This dimension is used for identifying a page as a content page or not. it is

    shown in table-13

    Field Name Description Values/Example

    ContentPagekey Primary key for the dimension (Surrogate key) 1,2

    Description 'Yes' to indicate a content page. 'No' to indicate

    other files

    'Yes', 'No'

    Table-13: Column description of content page dimension

    3. The Data Modeling-Star Schema

    The recommended approach for data warehouse data modeling is to follow a dimensional modeling approach

    i.e. star schema. The star schema has a central fact table with dimension tables at the points of the star. The

    single fact table composite primary key requires a foreign key field corresponding to the primary key field of

    each dimension table. The dimensional tables are hierarchical and thus highly denormalised. A fact table is a

    primary table in the web mart that contain the business facts, and dimension tables are companion tables to the

    fact table that represent the business critical dimensions and contain the attributes for the business critical

    dimensions[4]. The central fact table provides users the ability to do analysis on business facts, and dimensional

    tables provide users the ability to do analysis on these business facts in various business critical dimensions, this

    is shown in figure-10.

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3150

  • 7/23/2019 Web Data Mart

    11/12

    Figure-10: Star Schema model design for Web mart

    4. Results

    Now we can proceed to the interesting part of our data warehouse: relieving information.

    4.1 The average number of minutes from login to order

    4.2 The average number of days from first being invited to the site by email to the first order.

    5.Conclusions

    Understanding the behavior of users on your website is as valuable as following a customer around a store and

    recording his or her every move. Imagine how much more organized your store can be and how many

    opportunities you can have to dell merchandise if you know every move customers make while navigating your

    store. The ETL process in Clickstream data warehousing is significantly different from any other source you are

    likely to encounter.

    References[1] Ralph Kimball, The Data Warehouse ETL Toolkit, Wiley India Pvt Ltd., 2006

    [2] Dr. K.V.K.K Prasad, Data warehouse Development Tools, Dreamtech Press, 2006

    [3] White paper by Vivek R Gupta, Senior Consultant, System Services Corporation,An Introduction to data warehousing.[4] Manjunath T.N, Ravindra S Hegadi, Ravikumar G K."Analysis of Data Quality Aspects in DataWarehouse Systems", International

    Journal of Computer Science and Information Technologies, Vol. 2 (1), 2010, 477-485

    [5] Manjunath T.N, Ravindra S Hegadi, Ravikumar G K." A Survey on Multimedia Data Mining and Its Relevance Today",

    [6] International journal of Computer Science and Network Security. Vol. 10 No. 11 pp. 165-170.

    [7] Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, Multimedia Data Mining in Digital Libraries: Standards and

    [8] Features in Proc. READIT-2007, p. 54.[9] Shu-Ching Chen, Mei-Ling Shyu, Chengcui Zhang, and Jeff Strickrott, "Multimedia Data Mining for Traffic Video Sequences,"

    Proceedings of the Second International Workshop on Multimedia data Mining MDM/KDD'2001), in conjunction with the Seventh

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN : 0975-5462 Vol. 3 No. 4 Apr 2011 3151

  • 7/23/2019 Web Data Mart

    12/12

    ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 78-85, August 26, 2001, San Francisco, CA,

    USA.

    [10] Valery A. Petrushin and Latifur Khan, Multimedia Data Mining and Knowledge Discovery, 2007 - London: Springer-Verlag, pp.

    [11] 3- 17[12] S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Multimedia Mining, SEAS Transactions on Systems, Issue 10, Volume 3, December

    2004, pp. 3263-3268

    [13] Sanjiv Purba Data Management Handbook Published by CRC Press, 1999[14] Bhavani M. Thuraisingham, Data Management Systems: Evolution and Interoperation, Published by CRC Press, 1997

    [15] Jiawei Han, Micheline Kamber Data Mining: Concepts and Techniques Published by Morgan Kaufmann, 2001

    [16]

    Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, Multimedia Data Mining in Digital Libraries: Standards and Features,ACVIT- 07, Dr. Babasaheb Ambedkar MarathWada University, Aurangabad,MS-India

    [17] Mori Y, Takahashi H, Oka R. Image-to-word transformation based on dividing and vector quantizing images with words. In:

    MISRM99 First International Workshop on Multimedia Intelligent Stotage and Retrievel management, 1999.[18] Ordenoz C, Omiecinski E. Discovering association rules based on image content. In: ADL 99: Proceedings of the IEEE Forum on

    Research and Technology Advances in Digital libraries. Washington, DC: IEEE Computer Society; 1999, p.38.

    [19] Chakrabarti, S. (2000): Data mining for hypertext: A tutorial survey. SIGKDD explorations, 1(2), pp. 111.

    [20] Ravikumar G K, Manjunath T. N, Ravindra S. Hegadi, Umesh I.M, Cross Industry Survey on Data mining Applications, InternationalJournal of Computer Science and Information Technologies, Vol. 2 (2) , 2011, 624-628.

    Authors Profile

    Ravikumar GK. received his Bachelors degree from Siddaganga Institute of Technology, Tumkur (Bangalore

    University) during the year 1996 and M. Tech in Systems Analysis and Computer Application from Karnataka

    Regional Engineering College Surthakal (NITK) during the year 2000. He is currently working towards his PhD

    degree in the Area of Data mining. He has published several papers in International and national levelconferences. He is having around 14 years of Professional experienced which includes Software Industry and

    teaching experience. His area of interests are Data Warehouse & Business Intelligence, multimedia and

    Databases.

    Manjunath T N. received his Bachelors Degree in computer Science and Engineering from Bangalore

    University, Bangalore, Karnataka, India during the year 2001 and M. Tech in computer Science and Engineering

    from VTU, Belgaum, Karnataka, India during the year 2004. Currently pursing Ph.D degree in Bharathiar

    University, Coimbatore. He is having total 10 years of Industry and teaching experience. His areas of interests

    are Data Warehouse & Business Intelligence, multimedia and Databases. He has published and presented papers

    in journals, international and national level conferences.

    Dr.Ravindra S Hegadi received his Master of Computer Applications (MCA) & M.Phil and Doctorate of

    Philosophy (Ph.D). in year 2007 in computer science from Gurbarga University, Karnataka; He is having 15years of Experience. He has visited overseas to various universities as SME.His area of interests are Image

    Mining, Image Processing and Databases and business intelligence. He has published and presented papers in

    journals, international and national level conferences.

    Archana.R.A received her Bachelors Degree in computer Science and Engineering from VTU ,Belgaum,

    Karnataka, India during the year 2007 and Master of Technology in year 2010 in computer science from

    VTU,Belgaum,Karnataka,India,she is working in SJB Institute of Technology,Bangalore,Karnataka,India.she is

    having 3 years of Experience.Her area of interests are Image Mining, Image Processing and Databases and

    business intelligence. She has published and presented papers in journals, international and national level

    conferences.

    Ravikumar G K et al. / International Journal of Engineering Science and Technology (IJEST)

    ISSN 0975 5462 V l 3 N 4 A 2011 3152