linear regression.pdf

Upload: christopher-herring

Post on 15-Feb-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 linear regression.pdf

    1/4

    International Conference on Methods and Models in Computer Science, 2009

    A LINEAR Regression-Based Frequent Itemset

    Forecast Algorithm for Stream DATA

    Sonali Shukla , Sushil Kumar , Bhupendra Verma

    3

    TIT, Bhopal,

    e-mail:}[email protected]

    [email protected].

    b k _ v e r m a @ r e d i f f m a i l c o m

    Abstract-

    Data mining deals with extracting or mining

    knowledge from large and infinite amount of stream data.

    It also handles the data quality with limited volume of disk

    or memory. In such traditional transaction environment it

    is impossible to perform frequent items mining because it

    requires analyzing which item is a frequent one to

    continuously incoming stream data and which is probable

    to become a frequent item. This paper proposes a way to

    predict frequent items using regression model to the

    continuously incoming real time stream data. By

    establishing the regression model from the stream data, it

    may be used for prediction of uncertain items.

    After gathering real-time stream data through sliding

    window, algorithm FIM-2DS computes support for

    appointed sequence and describes linear equation to

    forecast sequence trends in the future.

    Keywords: Frequent items; stream data; linear

    regression; sliding window.

    I. INTRODUCTION

    S

    ream data is generated continuously in a dynamic

    environment, with huge volume, infmite flow, and

    fast changing behavior.

    In recent years, many time-series mining problems

    have been explored for a data stream environment.

    Because sequence forecast is an import subject

    of

    those,

    some methods were designed to develop sequence trend

    analysis in a dynamic environment. Yixin Chen etc. [9]

    investigated methods for multi-dimensional regression

    analysis of

    time-series stream data; they built up a

    partially materialized data cube model and took an

    exception-guided regression approach to analysis stream

    data. Wei-Guang Teng etc. [10] devised a FTP-DS

    algorithm to mine frequent temporal patterns of data

    stream. The FTP-DS algorithm uses linear regression to

    perform trend detection, it s an effective method, but it

    always omits some exceptions, which are too pivotal to

    lead to failure.

    In this paper, we present a FIM-2DS algorithm that

    use plan regression approach to ameliorate FTP-DS

    algorithm, so the sequence trends can cover those key

    exceptions. Our experiment on comparing between the

    FTP-DS algorithm and the FIM-2DS algorithm shows

    that coverage ofFIM-2DS algorithm ismuch wider.

    The rest

    of

    the paper is organized as follows. In next

    Section, we introduce the basic concepts of sequence

    forecast and concerned problem, Section 1 describes

    FIPM algorithm with example, and Section 2 describes

    the FTP-DS algorithm.

    In

    Section 3, we present the

    algorithm

    of

    FIM-2DS. Section 4 shows our

    experiments and performance analysis and our study

    and future research are concluded in Section 5 and 6

    respectively.

    II .

    RELATED CONCEPTS

    In this section, we introduce the basic concepts

    of

    sequence forecast in data stream and give a brief look at

    FIPM algorithm and FTP-DS algorithm.

    Sequence in data stream is essentially a sequence of

    sets of events, which conform to the given timing

    constraints [11]. As an example, the sequential pattern

    (A)(C, B)(D), encodes an interesting fact that event D

    occurs after an event-set(C, B), which in tum occurs

    after event A. If the support

    of

    sequence is no less than

    the threshold, this sequence is called frequent sequence.

    Sequence forecast plays an important role in gene

    analysis, fmance forecast, network security, web

    information extraction, communication statistic, and so

    on.

    1.FREQUENT ITEMS PREDICTION METHOD

    We explain the frequent items prediction method [8],

    called

    FIPM. The method uses a regression analysis model

    in one dimensional stream data and predicts frequent

    items by using the estimated regression model. First,

    this method preprocesses one dimensional stream data

    and transforms it to a form of sampling value for further

    regression. When a sampling value is generated, a

    regression model is found using the method of least

    squares, a way to estimate regression model.

    A. Data Preprocessing

    The purpose

    of

    regression analysis is to fmd an

    estimator in the future from observed values as samples.

    In other words,

    if

    n number

    of

    sample values were

    obtained, an estimator, yn+1 would be predicted based

    on the input, xn+1 upon the analysis of the samples.

  • 7/23/2019 linear regression.pdf

    2/4

    Therefore, it is necessary to transform the incoming

    stream data into a data pair, such as (xi, yi). Stream data

    has properties of time series data. As seen in the Figure

    1, stream data comes in continuously as inputs with time,

    t.

    The stream data preprocessing is as follows: first,

    each inputted value in stream data is reorganized into

    stream data at the incoming time of each inputted value.

    We first reorganize times of each value to

    perform

    preprocessing operation.

    From the reorganized stream data we can find the

    difference in time for each item. For example in the

    value A, the value, A is entered in the stream data at t

    IS, t-13, t-IO, t-S, t-4 , and t respectively. From the

    example of stream data in the Figure I , it is unknown

    when the value, A entered before the time

    oft-IS

    , but it

    is possible to see the differences in time for the input

    of

    A from the time oft-IS to the present time, t: {2, 3, 2 , 4,

    4}. Such differences in time can be sequentially paired

    as follows : (2, 3), (3, 2), (2, 4), (4, 4). These pairs

    indicate the differences in time before and after the

    value , A is entered in the stream data . ( As given in

    Table 2).

    U A

    B

    A r

    E c

    A

    E D

    r

    D E

    A

    I I I I

    I

    I I I I

    I

    ri

    1

    16

    1ll

    I J 11) Ig

    >1

    1-6 ,

    l-l I)

    r.[

    r

    time. In other words, previously appeared time intervals

    can be set as an independent variable and the time

    appeared intervals later as a dependent variable, we may

    be able to estimate the linear regression model that we

    want.

    Putting the values of Dependent and Independent

    variable in linear regression model, we will get the

    regression fit line.

    2. A BRIEF LOOK AT FPT-DS ALGORITHM

    Algorithm FTP-DS [10] was presented by Wei-Guang

    Teng on 29

    th

    VLDB conferences at 2003. There were

    two key defmitions for FTP-DS algorithm. One is the

    support of frequent sequence; the other is the pattern

    representation, which called a compact ATF. We will

    give a brieflook at those in below.

    The support of frequent sequence X at a specific time

    t is denoted by the ratio

    of

    the number

    of

    events having

    sequence

    X

    in the current sliding window to total

    number of events. Supposing the sliding window size is

    3, there are 3 sliding windows in Figurel , i.e., [0,3],

    [1,4], [2,5], the support of sequence is computed

    as: in [0,3], the is in event {ID:2, ID:5}, so its

    support is 2/5=0.4; in [1,4], the is in event {ID:I ,

    ID:4}, so its support is 2/5=0.4; in [2,5], the is in

    event {ID:I}, so its support is 1/5=0.2.

    TO

    St r{ am Sequence

    J

    1

    (b)

    (d ) rJ 1

    2

    (u , b. c (C. dl (d, f

    1

    3

    (b)

    (f.')

    J

    4

    a

    (b . eJ Cd)

    I

    5

    (b) c . tI)

    CI )

    .t

    aap

    l 2 J

    ,I

    Fig. 1. Exampleof onedimensionstreamdata

    Calculate the values

    of

    Dependent variable and

    Independent variable depend on the following model

    using pairs of Data Sets:

    TABLE I. EXAMPLEOFTHEORDERED PAIR,(X,V),

    FORTHEVALUE, A

    t

    i mcs

    input time

    1 2 3 4 5

    6

    7 8 9

    sequence

    independent

    2

    3

    2

    4 4 3

    5 7

    6

    variable(x)

    dependent

    3 2 4 4 3 5 7 6 5

    variable(y)

    TABLE2. PROCESS OFESTIMATINGREGRESSION

    FORMULA BETWEEN ANINDEPENDENTVARIABLE

    Fig. 2.Anexampleof supportcomputation

    The ATP

    form

    of time series was defmed as

    (ts, Df, f,

    It2

    and FPT-DS algorithm extracted a regression fit line:

    f = a ~ t

    Whereo, has to be calculated using some formulae.

    Seq.

    V....>

  • 7/23/2019 linear regression.pdf

    3/4

    A

    b

    o

    = j b

    l

    i

    A

    Ix

    ,r-

    .1\ I (x -

    i f

    V

    - \ .)

    l;

    = t J I .,-- = I

    J .

    ,

    -- (

    -) '

    L ;; - X z: .\ - x -

    Step5: Fit the regression model ofFlM-2DS:

    y =b

    [

    Step6: Calculate the error

    - -

    erroron

    FTPDS

    e r ro ron

    FIPM2D

    345678

    Fig. 4. Graph to show the mean residual comparison

    o

    0.3

    0.2

    0.1

    5. CONCLUSION

    In this method , the least square method is used to

    estimate a regression line based on the linear regression

    analysis. The regression in this method is a simple linear

    regression model estimating a regression model with

    changes in two different variables . Since stream data has

    a characteristic

    of being continuously transmitted in

    time, they defme the changes in time with two variables.

    These two variables indicate an independent variable

    and a dependent variable respectively.

    Here, in this method I have presented an overview

    of

    some vulnerabilities of algorithm FIPM and designed

    linear regression based algorithm FIM-2DS to predict

    frequent sequence for real time stream data, in which I

    reviewed the concept of preprocessing from the

    algorithm FTP-DS. Our experimental result shows the

    merit of our algorithm in terms

    of

    errors.

    6. FUTURE ENHANCEMENT

    REFERENCES

    [I] B. Babcock, S. Babu, M. Datar, R. Motwani, and 1. Widom,

    Models and Issues in Data Stream Systems, In Proc. of PODS,

    March 2002.

    [2] N. Davey, S. P. Hunt, and R. 1. Frank, Time Series Prediction

    and Neural Networks, In Journal of Intelligent and Robotic

    Systems, 200

    I.

    [3] L. Golab, M. Tamer Ozsu, Issues in Data StreamManagement,

    In SIGMODRecord, Volume 32, Number 2,2003.

    [4] X. Hao, D. Xu, Time Series Prediction based on Non

    Parametric, In SIGMODRecord, Volume 32, Number 2,2003 .

    [5] D. F. Specht, A General Regression Neural Network, IEEE

    Trans. on Neural Networks, Vol.2, No.6, pp.568-576, Nov. 1991.

    [6] D. Wacherly, W. Mendenhall,

    R.

    L. Scheaffer, Mathematical

    Statistics with Applications , 5th edition, 2005.

    [7] O. B. Yaik, C. H. Yong, and F. Haron, Time Series Prediction

    using Adaptive Association Rules, In Proc. Of DFMA05,

    pp.3IO-3I4,2005.

    [8] Duck Jin Chai, Eun Hee Kin, Long Jin, Frequent Items

    Prediction Method to one dimensional stream data , in Fifth

    International Conference on Computational Science And

    Applications, IEEE -2007.

    Let us analyze a few open problems of FIM-2DS

    algorithm from data stream background. Stream data is

    so huge , infinite, and fast changing that it always

    contains lots of noisy, overflowing, incomplete data

    items. As my work is concerned, it is not dealing with

    this problem that too at the cost of execution time, my

    future work is to resolve this problem.

    --- - e rro r o n FTPDS

    ____ e rro r o n F IPM2D

    - 1

    .- L . r

    Tl r-J

    Error

    comparison o n F TP DS a nd

    FIPM2D

    Fig. 3. Graph to show the residual comparison

    II .J>

    II

    I I

  • 7/23/2019 linear regression.pdf

    4/4

    [9] Y.Chen, G.Dong, J.Han et, al, Multi-Dimensional Regression

    Analysis of Time-Series Data Stream , Proceedings of the 28

    th

    International Conference on Very Large Data Base, Berlin, pp.

    323-334,August 2002.

    [10] Wei-Guang Teng, Ming-Syan Chen, Philip S.Yu, A

    Regression-Based Temporal Pattern Mining Scheme for Data

    Stream , Proceedings of the 29th International Conference on

    Very Large Data Base, Berlin, pp.607-617, August 2003.

    [11] Joshi M, Karypis G, A universal formulation of sequential

    patterns , Research report No.99-021, Department of Computer

    Science, University

    of

    Minnesota, Minnesota, 1999.