web log mining: a study of user sessions

20
UNIVERSITY OF PADUA Department of Information Engineering PersDL 2007 10th DELOS Thematic Workshop on Personalized Access, Profile Management, and Context Awareness in Digital Libraries Corfu, Greece, 29–30 June 2007 Web Log Mining: A Study of User Sessions Maristella Agosti [email protected] Giorgio Maria Di Nunzio [email protected] Information Management Systems Research Group

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

UNIVERSITY OF PADUADepartment of Information Engineering

PersDL 200710th DELOS Thematic Workshop on

Personalized Access, Profile Management, and Context Awareness in Digital LibrariesCorfu, Greece, 29–30 June 2007

Web Log Mining: A Study of User Sessions

Maristella Agosti

[email protected]

Giorgio Maria Di Nunzio

[email protected]

Information Management Systems Research Group

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Outline

I Motivations;

I Approach;

I Experimental Analysis;

I Current and Future Works.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 1

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Motivations

I The three main approaches to evaluate an information access service are:

B the studies based on test collection analysis, the so-called “Cranfield approach”1,B the user studies, andB the analysis of log data.

I Web log file analysis began with the purpose to offer to Web site administrators away to ensure adequate bandwidth and server capacity to their organization.

I It may offer advices about

B a better way to improve the offer of Web content,B information about problems occurred to the users,B and even about problems for the security of the site.

1C. W. Cleverdon “The Cranfield Tests on Index Languages Devices”. In Readings in Information

Retrieval, Morgan Kaufmann Publisher, Inc., San Francisco, California, pp.47–60, 1997.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 2

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

The European Library

I Case study: The European Library 2, a service set up by the Conference

of the European National Librarians (CENL)3.

I Born to offer access to combined resources (such as books, magazines,

and journals – both digital and non-digital) of 47 national libraries of

Europe.

I Analyse the data contained in the logs of their Web servers.

2http://www.theeuropeanlibrary.org/3http://www.nlib.ee/cenl/

PersDL 2007 - Corfu, Greece, 29–30 June 2007 3

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Scope of the study

I Evaluate the information access service to give recommendations for

developing possible future personalization services.

I Report on initial findings on a specific aspect that is highly relevant for

personalization:

B the study of user sessions.

I The log data used for the present analysis refer to the collections from

27 out of the 47 national libraries that were full partners at the moment

of the analysis.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 4

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Approach

I Information extracted from Web has to be managed efficiently for

analyses.

I The proposed solution is based on database management methods

which permit the definition of an application which maintains and

manages the necessary data4.

I The database and the application which have been developed enable

separation of the different entities recorded and facilitate data-mining

and on-demand querying of the log data.

4M. Agosti, G.M. Di Nunzio and A. Niero “From Web Log Analysis to Web User Profiling” In DELOS

Conference 2007. Working Notes. Pisa, Italy, 2007, pp 121–132.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 5

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Methodology

I Usually Log files come in a text file format.

I The proposed methodology for acquiring data from Web log files

identifies two problems:

B gathering, and

B storing the information.

I Gathering data with parsers;

I Storing data with databases.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 6

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Gathering Data

I Web log files have ordered fields to record activities:

B date: Date, in the form of yyyy-mm-dd.

B time: Time, in the form of hh:mm:ss.

B s-ip: The IP of the server.

B cs-method: The requested action. Usually GET for common users.

B cs-uri-stem: The URI-Stem of the request.

B cs-uri-query: The URI-Query, where requested.

B s-port: The port of the server for the transaction.

B cs-username: The username for identification of the user.

B c-ip: The IP address of the client.

B cs(User-Agent): User-Agent of the Client. For a standard user this means the browser and other

information about operative system.

B cs(Referer): The site where the link followed by the user was located.

B sc-status: HTTP status of the request, that means the response of the server.

B sc-substatus: The substatus error code.

B sc-win32-status: The Windows status code.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 7

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Storing Data

R4

(0, N)

R3

(1, N)

(1, N) (1, 1)

(1, N)

(1, 1)

(1, N)

R5

R1 (1, 1)

SERVER

IPaddress

(1, 1)

R2

(0, N)

CLIENT

IPaddress

HEURISTIC

ID

(1, N)

SESSION

ID

URISTEM

Uristem

USERAGENT

Useragent

REQUEST

Timestamp

PersDL 2007 - Corfu, Greece, 29–30 June 2007 8

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Approach - HTTP Requests

I We concentrate our attention here only on deriving the information on

user sessions from the analysis of the

B HyperText Transfer Protocol (HTTP) requests made by clients,

B grouped in sessions,

B using a specific heuristic.

I A request represents the data of the HTTP request that are recorded

in the Web log files.

2005-11-30 23:00:37 192.87.31.35 GET /index.htm - 80 - 152.xxx.xxx.xxx Mozilla/4.0+(compatible; ...

PersDL 2007 - Corfu, Greece, 29–30 June 2007 9

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Approach - Sessions

I A session is a particular set of requests made in a certain interval of

time by the same client.

2005-11-30 23:00:37 192.87.31.35 GET /index.htm - 80 - 152.xxx.xxx.xxx Mozilla/4.0+(compatible; ...2005-11-30 23:00:38 192.87.31.35 GET /portal/index.htm - 80 - 152.xxx.xxx.xxx Mozilla/4.0+( ...2005-11-30 23:00:38 192.87.31.35 GET /portal/scripts/Hashtable.js - 80 - 152.xxx.xxx.xxx Mozilla/4.0+ ...2005-11-30 23:00:44 192.87.31.35 GET /portal/scripts/Session.js - 80 - 152.xxx.xxx.xxx Mozilla/4.0+ ...2005-11-30 23:00:46 192.87.31.35 GET /portal/scripts/Query.js - 80 - 152.xxx.xxx.xxx Mozilla/4.0+ ...2005-11-30 23:00:47 192.87.31.35 GET /portal/scripts/Search.js - 80 - 152.xxx.xxx.xxx Mozilla/4.0+ ...

I Organizing the HTTP requests in a single session permits to have a

better view of the actions performed by visitors.

I Sessions are found through empirical rules when information about

sessions are not available.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 10

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Approach - Sessions’ Reconstruction

I “Session reconstruction” may be used in order to map the list of

activities performed by every single user to the visitors of the site.

I Possible choices:

B the IP address and the user-agent are the same of the requests already

inserted in the session5,

B the request is done less than fifteen minutes after the last request

inserted 6.5D. Nicholas, P. Huntington, A. Watkinson “Scholarly journal usage: the results of deep log analysis”,

Journal of Documentation Vol. 61 No. 2, 2005.6B. Berendt, B. Mobasher, M. Nakagawa, M. Spiliopoulou “The Impact of Site Structure and User

Environment on Session Reconstruction in Web Usage Analysis”, WEBKDD 2002, LNAI 2703, pp

159-179, 2003.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 11

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses

I Experimental analysis was performed on a sample of The European Library Web logfiles of eleven months

B from 31st October 2005,B to 25th September 2006.

I The structure of the log file record is conform to the W3C Extended Log File Format7.

I The analyses, that are presented in the following, cover software tools such asoperating systems and browsers used by clients, sessions in terms of daily distribution,and time intervals per number of HTTP requests.

I The numbers we are reporting include all the requests and sessions, even those oneswhich can belong to automatic crawlers and spiders.

7http://www.w3.org/TR/WD-logfile.html.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 12

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses - HTTP Requests

I A total of 25,881,469 of HTTP requests were recorded in the log files of the elevenmonths.

I The distribution of HTTP methods which are present in the log files is the following

HTTP method total numberCONNECT 2LINK 6PROPFIND 760PUT 3,640OPTIONS 3,779HEAD 33,770POST 844,058GET 24,995,454Total 25,881,469

PersDL 2007 - Corfu, Greece, 29–30 June 2007 13

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses - Clients, OS, Browsers

I During this period, according to the chosen heuristic,

B 949,643 sessions were reconstructed with an average of ∼27 accesses per session;B 285,125 different pairs IP address and user-agent were found.

I Operating systems (left) and browsers (right) used by clients

75%

9%

2%

14%

WindowsLinuxMacOthers

56%

14%

5%

4%

3%

3%

2%

13%

MSIEFirefoxMozillaOperaGeckoAOLKonquerorOthers

PersDL 2007 - Corfu, Greece, 29–30 June 2007 14

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses - Times, Sessions’ Lengths

I Number of sessions per hour of day, Web server set on CET (left).

I Number of sessions per HTTP request intervals (right).

0 4 8 12 15 200

1

2

3

4

5

6x 10

4

hour

num

ber

of s

essi

ons

<= 25 > 25 , <= 50 > 50 , <= 75 > 75 , <= 100 > 1000

1

2

3

4

5

6

7

8x 10

5

num

ber

of s

essi

ons

requests per session

PersDL 2007 - Corfu, Greece, 29–30 June 2007 15

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses - Sessions Breakdown

I Breakdown sessions according to both the number of requests and the length of the session.

I Sessions with less than or equal 25 requests (left), more than 25 (right).

I 16% of sessions last more than 60 seconds regardless of the number requests per session.

I 12% of the sessions contain more than 50 requests.

<= 25 <= 20

> 20 , <= 40

> 40

0

1

2

3

4

5

6

x 105

session length (seconds)

requests per session

<= 10> 10 , <= 20

> 20 , <= 30> 30 , <= 40

> 40 , <= 50> 50 , <= 60

> 60> 25 , <= 50

> 50 , <= 75> 75 , <= 100

> 100

0

2

4

6x 10

4

session length (seconds)requests per session

PersDL 2007 - Corfu, Greece, 29–30 June 2007 16

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Experimental Analyses - Focus on Lengthy Sessions

I An analysis of the sessions with more than 100 requests has been computed separately.

I The majority of sessions with a high number of requests last from 2 to 30 minutes.

> 60 , <= 120

> 120 , <= 300

> 300 , <= 600

> 600 , <= 1800

> 1800 , <= 3600

> 3600

> 100 , <= 200

> 200 , <= 300

> 300 , <= 400

> 400 , <= 500

> 500

0

2000

4000

6000

8000

10000

12000

14000

session length (seconds)requests per session

PersDL 2007 - Corfu, Greece, 29–30 June 2007 17

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Conclusions

I Preliminary analysis of eleven months of The European Library Web log data,according to a methodology for gathering and mining information information fromWeb log files based on a DataBase Management System (DBMS) application.

I Report on initial findings about the study of user sessions which have been recon-structed by means of heuristic methods, since no personal data was available to trackeach user.

I Heuristics used to identify users and sessions suggested that authentication would berequired since it would allow Web servers to identify users, track their requests, andmore importantly create more accurate profiles to tailor specific needs.

I Authentication would also help to mitigate the problem concerning crawlers accesses,granting access to some sections of the Web site only to registered users, blockingcrawlers using faked user agents.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 18

Giorgio Maria Di Nunzio Web Log Mining: A Study of User Sessions

Current and Future Works

I As a follow up of the cooperation with The European Library, the Office of TheEuropean Library has implemented the changes suggested by this work.

B The use of cookies in the HTTP server logging system (September 2006)B An user authentication procedure has been established (since August 2006).

I A comparison with sessions’ reconstruction using the heuristic and using cookiesshould give hints about the use of heuristics.

I A more accurate profile for each user, studying nationality, language, and chosencollections.

PersDL 2007 - Corfu, Greece, 29–30 June 2007 19