a novel framework for discovering emerging topics in streams of social networks

119
"A Novel Framework for Discovering Emerging Topics in Streams of Social Networks" 1.Intoduction 1.1 Data Mining: Data mining is the process of automatically discovering useful information in large data repositories. Data mi techniques are deployed to scour large databases in order to nd novel and useful patterns that might otherwise remain unknown. They also provide capabilities to predict the outcome of the future observation, such as predicting whether a newly arrived customer will spend more than 100 at a department store. (Figure 1.1 Data Mining fow chart) 1.2 Structure o Data Mining: !enerally, data mining "sometimes called data or knowledge discovery# is the process of analy$ing data from di%erent perspectives and summari$ing it into useful informatio 1

Upload: rajeshd84

Post on 05-Nov-2015

219 views

Category:

Documents


0 download

DESCRIPTION

Documentation for ieee paper

TRANSCRIPT

1.INTRODUCTION

1.Intoduction 1.1 Data Mining:Data mining is the process of automatically discovering useful information in large data repositories. Data mining techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown. They also provide capabilities to predict the outcome of the future observation, such as predicting whether a newly arrived customer will spendmore than 100$ at a department store.

(Figure 1.1 Data Mining flow chart)1.2 Structure of Data Mining:Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.1.3 Data Mining Work:While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: -Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.-Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. -Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.-Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system.Provide data access to business analysts and information technology professionals.Analyze the data by application software.Present the data in a useful format, such as a graph or table.

Different levels of analysis are available: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of thekrecord(s) most similar to it in a historical dataset (wherek=1). Sometimes called thek-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.1.4 Characteristics of Data Mining: Large quantities of data: The volume of data so great it has to be analyzed by automated techniques e.g. satellite information, credit card transactions etc. Noisy, incomplete data: Imprecise data is the characteristic of all data collection. Complex data structure: conventional statistical analysis not possible Heterogeneous data stored in legacy systems1.5 Benefits of Data Mining:1) Its one of the most effective services that are available today. With the help of data mining, one can discover precious information about the customers and their behavior for a specific set of products and evaluate and analyze, store, mine and load data related to them2) An analytical CRM model and strategic business related decisions can be made with the help of data mining as it helps in providing a complete synopsis of customers3) An endless number of organizations have installed data mining projects and it has helped them see their own companies make an unprecedented improvement in their marketing strategies (Campaigns)4) Data mining is generally used by organizations with a solid customer focus. For its flexible nature as far as applicability is concerned is being used vehemently in applications to foresee crucial data including industry analysis and consumer buying behaviors5) Fast paced and prompt access to data along with economic processing techniques have made data mining one of the most suitable services that a company seek1.6 Advantages of Data Mining:1.6.1 Marketing / Retail:Data mining helps marketing companies build models based on historical data to predict who will respond to the new marketing campaigns such as direct mail, online marketing campaignetc. Through the results, marketers will have appropriate approach to sell profitable products to targeted customers.Data mining brings a lot of benefits to retail companies in the same way as marketing. Through market basket analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent buying products together with pleasant. In addition, it also helps the retail companies offer certain discounts for particular products that will attract more customers.1.6.2 Finance / BankingData mining gives financial institutions information about loan information and credit reporting. By building a model from historical customers data, the bank and financial institution can determine good and bad loans. In addition, data mining helps banks detect fraudulent credit card transactions to protect credit cards owner.1.6.3 ManufacturingBy applying data mining in operational engineering data, manufacturers can detect faulty equipments and determine optimal control parameters. For example semi-conductor manufacturers has a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are lot the same and some for unknown reasons even has defects. Data mining has been applying to determine the ranges of control parameters that lead to the production of golden wafer. Then those optimal control parameters are used to manufacture wafers with desired quality.1.6.4 GovernmentsData mining helps government agency by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activities.

1.6.5 Law enforcementData mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviors.1.6.6 ResearchersData mining can assist researchers by speeding up their data analyzing process; thus, allowing those more time to work on other projects.1.7 Network:A network consists of two or more computers that are linked in order to share resources (such as printers and CDs), exchange files, or allow electronic communications. The computers on a network may be linked through cables, telephone lines, radio waves, satellites, or infrared light beams.Two very common types of networks include:1. 1.7.1 Local1. 1.7.2 Wide Area NetworkYou may also see references to a Metropolitan Area Networks (MAN), a Wireless LAN (WLAN), or a Wireless WAN (WWAN).1.7.1 Local Area NetworkA Local Area Network (LAN) is a network that is confined to a relatively small area. It is generally limited to a geographic area such as a writing lab, school, or building.Computers connected to a network are broadly categorized as servers or workstations. Servers are generally not used by humans directly, but rather run continuously to provide "services" to the other computers (and their human users) on the network. Services provided can include printing and faxing, software hosting, file storage and sharing, messaging, data storage and retrieval, complete access control (security) for the network's resources, and many others.Workstations are called such because they typically do have a human user which interacts with the network through them. Workstations were traditionally considered a desktop, consisting of a computer, keyboard, display, and mouse, or a laptop, with integrated keyboard, display, and touchpad. With the advent of the tablet computer, and the touch screen devices such as iPad and iPhone, our definition of workstation is quickly evolving to include those devices, because of their ability to interact with the network and utilize network services.Servers tend to be more powerful than workstations, although configurations are guided by needs. For example, a group of servers might be located in a secure area, away from humans, and only accessed through the network. In such cases, it would be common for the servers to operate without a dedicated display or keyboard. However, the size and speed of the server's processor(s), hard drive, and main memory might add dramatically to the cost of the system. On the other hand, a workstation might not need as much storage or working memory, but might require an expensive display to accommodate the needs of its user. Every computer on a network should be appropriately configured for its use.On a single LAN, computers and servers may be connected by cables or wirelessly. Wireless access to a wired network is made possible by wireless access points (WAPs). These WAP devices provide a bridge between computers and networks. A typical WAP might have the theoretical capacity to connect hundreds or even thousands of wireless users to a network, although practical capacity might be far less.Nearly always servers will be connected by cables to the network, because the cable connections remain the fastest. Workstations which are stationary (desktops) are also usually connected by a cable to the network, although the cost of wireless adapters has dropped to the point that, when installing workstations in an existing facility with inadequate wiring, it can be easier and less expensive to use wireless for a desktop.See theTopology,Cabling, andHardwaresections of this tutorial for more information on the configuration of a LAN.1.7.2 Wide Area NetworkWide Area Networks (WANs) connect networks in larger geographic areas, such as Florida, the United States, or the world. Dedicated transoceanic cabling or satellite uplinks may be used to connect this type of global network.Using a WAN, schools in Florida can communicate with places like Tokyo in a matter of seconds, without paying enormous phone bills. Two users a half-world apart with workstations equipped with microphones and a webcams might teleconference in real time. A WAN is complicated. It uses multiplexers, bridges, and routers to connect local and metropolitan networks to global communications networks like the Internet. To users, however, a WAN will not appear to be much different than a LAN.1.8 Social network: Asocial networkis asocial structuremade up of a set ofsocialactors (such as individuals or organizations) and a set of thedyadicties between these actors. The social network perspective provides a set of methods for analyzing the structure of whole social entities as well as a variety of theories explaining the patterns observed in these structures.The study of these structures uses socialto identify local and global patterns, locate influential entities, and examine network dynamics.Social networks and the analysis of them is an inherentlyinterdisciplinary academic field which emerged fromsocial psychology,sociology,statistics, and graph.Georg Simmelauthored early structural theories in sociology emphasizing the dynamics of triads and "web of group affiliations."Jacob Morenois credited with developing the firstsociogramsin the 1930s to study interpersonal relationships. These approaches were mathematically formalized in the 1950s and theories and methods of social networks became pervasive in the social and behavioral sciences by the 1980s.Social network analysisis now one of the major paradigms in contemporary sociology, and is also employed in a number of other social and formal sciences. Together with other complex, it forms part of the nascent field ofnetwork science.Communication through social networks, such as Facebook and Twitter, is increasing its importance in our daily life. Since the information exchanged over social networks are not only texts but also URLs, images, and videos, they are challenging test beds for the study of data mining. There is another type of information that is intentionally or unintentionally exchanged over social networks: mentions. Here we mean by mentions links to other users of the same social network in the form of message-to, reply-to, retweet-of, or explicitly in the text. One post may contain a number of mentions. Some users may include mentions in their posts rarely; other users may be mentioning their friends all the time. Some users (like celebrities) may receive mentions every minute; for others, being mentioned might be a rare occasion. In this sense, mention is like a language with the number of words equal to the number of users in a social network.We are interested in detecting emerging topics from social network streams based on monitoring the mentioning behavior of users. Our basic assumption is that a new (emerging) topic is something people feel like discussing about, commenting about, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words. A term frequency based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly non-textual information. On the other hands, the words formed by mentions are unique, requires little prepossessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.1.9 Anomaly Detection:Indata mining,anomaly detection(oroutlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in adataset.Typically the anomalous items will translate to some kind of problem such asbank fraud, a structural defect, medical problems or finding errors in text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. In particular in the context of abuse and network intrusion detection, the interesting objects are often notrareobjects, but unexpectedburstsin activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, acluster analysisalgorithm may be able to detect the micro clusters formed by these patterns. Three broad categories of anomaly detection techniques exist.Unsupervised anomaly detectiontechniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.Supervised anomaly detectiontechniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many otherstatistical classificationproblems is the inherent unbalanced nature of outlier detection).Semi-supervised anomaly detectiontechniques construct a model representing normal behavior from a givennormaltraining data set, and then testing the likelihood of a test instance to be generated by the learnt model.

2.Literature Survey

2.1 Detection and Tracking Pilot Study,AUTHORS: J. Allan et al Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks: (1) segmenting a stream of data, especially recognized speech, into distinct stories; (2) identifying those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream. The TDT Pilot Study ran from September 1996 through October 1997. The primary participants were DARPA, Carnegie Mellon University, Dragon Systems, and the University of Massachusetts at Amherst. This report summarizes the findings of the pilot study. The TDT work continues in a new project involving larger training and test corpora, more active participants, and a more broadly defined notion of "topic" than was used in the pilot study.

2.2 Bursty and Hierarchical Structure in Streams:AUTHORS: J. KleinbergA fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queuing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.

2.3. Real-Time Change-Point Detection Using Sequentially Discounting Normalized Maximum Likelihood Coding,AUTHORS: Y. Urabe, K. Yamanishi, R. Tomioka, and H. IwaiWe are concerned with the issue of real-time change-point detection in time series. This technology has recently received vast attentions in the area of data mining since it can be applied to a wide variety of important risk management issues such as the detection of failures of computer devices from computer performance data, the detection of masqueraders/malicious executables from computer access logs, etc. In this paper we propose a new method of real-time change point detection employing the sequentially discounting normalized maximum likelihood coding (SDNML). Here the SDNML is a method for sequential data compression of a sequence, which we newly develop in this paper. It attains the least code length for the sequence and the effect of past data is gradually discounted as time goes on, hence the data compression can be done adaptively to non-stationary data sources. In our method, the SDNML is used to learn the mechanism of a time series, then a change-point score at each time is measured in terms of the SDNML code-length. We empirically demonstrate the significant superiority of our method over existing methods, such as the predictive-coding method and the hypothesis testing method, in terms of detection accuracy and computational efficiency for artificial data sets. We further apply our method into real security issues called malware detection. We empirically demonstrate that our method is able to detect unseen security incidents at significantly early stages.

2.4. Model Selection by Sequentially Normalized Least Squares,AUTHORS: J. Rissanen, T. Roos, and P. MyllymakiModel selection by means of the predictive least squares (PLS) principle has been thoroughly studied in the context of regression model selection and autoregressive (AR) model order estimation. We introduce a new criterion based on sequentially minimized squared deviations, which are smaller than both the usual least squares and the squared prediction errors used in PLS. We also prove that our criterion has a probabilistic interpretation as a model which is asymptotically optimal within the given class of distributions by reaching the lower bound on the logarithmic prediction errors, given by the so called stochastic complexity, and approximated by BIC. This holds when the regressor (design) matrix is non-random or determined by the observed data as in AR models. The advantages of the criterion include the fact that it can be evaluated efficiently and exactly, without asymptotic approximations, and importantly, there are no adjustable hyper-parameters, which makes it applicable to both small and large amounts of data.

2.5 Dynamic Syslog Mining for Network Failure Monitoring,AUTHORS: K. Yamanishi and Y. Maruyama Syslog monitoring technologies have recently received vast attentions in the areas of network management and network monitoring. They are used to address a wide range of important issues including network failure symptom detection and event correlation discovery. Syslog are intrinsically dynamic in the sense that they form a time series and that their behavior may change over time. This paper proposes a new methodology of dynamic syslog mining in order to detect failure symptoms with higher confidence and to discover sequential alarm patterns among computer devices. The key ideas of dynamic syslog mining are 1) to represent syslog behavior using a mixture of Hidden Markov Models, 2) to adaptively learn the model using an on-line discounting learning algorithm in combination with dynamic selection of the optimal number of mixture components, and 3) to give anomaly scores using universal test statistics with a dynamically optimized threshold. Using real syslog data we demonstrate the validity of our methodology in the scenarios of failure symptom detection, emerging pattern identification, and correlation discovery.

3.System Study3.1 Feasibility Study:A feasibility study, also known as feasibility analysis, is an analysis of the viability of an idea. It describes a preliminary study undertaken to determine and document a projects viability. The results of this analysis are used in making the decision whether to proceed with the project or not. This analytical tool used during the project planning phrase shows how a business would operate under a set of assumption, such as the technology used, the facilities and equipment, the capital needs, and other financial aspects. The study is the first time in a project development process that show whether the project create a technical and economically feasible concept. As the study requires a strong financial and technical background, outside consultants conduct most studies. A feasible project is one where the project could generate adequate amount of cash flow and profits, withstand the risks it will encounter, remain viable in the long-term and meet the goals of the business. The venture can be a start-up of the new business, a purchase of the existing business, and expansion of the current business .The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

( Figure 3.1 Model of the systems development life cycle, highlighting the maintenance phase)Three key considerations involved in the feasibility analysis are:- 3.1.1 Economical Feasibility. 3.1.2 Technical Feasibility. 3.1.3 Social Feasibility.

3.1.1 Economical Feasibility: Economic evaluation is a vital part of investment appraisal, dealing with factors that can be quantified, measured, and compared in monetary terms. The results of an economic evaluation are considered with other aspects to make the project investment decision as the proper investment appraisal helps to ensure that the right project is undertaken in a manner that gives it the best chances of success.Project investments involve the expenditure of capital funds and other resources to generate future benefits, whether in the form of profits, cost savings, or social benefits. For an investment to be worthwhile, the future benefit should compare favorably with the prior expenditure of resources need to achieve them.This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased. 3.1.2 Technical Feasibility: This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system. 3.1.3 Social Feasibility:Social impact analysis / social feasibilitySocial Impact Assessment (SIA) is a process that provides a framework for prioritizing, gathering, analyzing, and incorporating socialinformationand participation into the design and delivery of projects. It ensures that infrastructure project development is:

informed and takes into account the key relevant social issues, and incorporates a participation strategy for involving a wide range of stakeholders At the micro-level, SIA impacts on individuals, at the meso-level it impacts on collectives (eg, groups of people, institutions, and organizations) and at the macro-level it impacts on social macro-systems (eg, national and international political and legal systems).The stages in Social Impact Assessment are: Describe the relevant human environment/ area of influence and baseline conditions Develop an effective public plan to involve all potentially affected public Describe the proposed action or policy change and reasonable alternatives Scoping to identify the full range of probable social impacts Screening to determine the boundaries of the SIA. Predicting Responses to Impacts Develop Monitoring Plan & Mitigation Measures. Ideally the SIA should an Integral part of other assessments as shown below.

(Figure 3.2 SIA in relation to other assessments) The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

4. System Analysis & System Requirements After analyzing the requirements of the task to be performed, the next step is to analyze the problem and understand its context. The first activity in the phase is studying the existing system and other is to understand the requirements and domain of the new system. Both the activities are equally important, but the first activity serves as a basis of giving the functional specifications and then successful design of the proposed system.Understanding the properties and requirements of a new system is more difficult and requires creative thinking and understanding of existing runningsystem is also difficult, improper understanding of present system can lead diversion from solution.

4.1 Analysis Model: SDLC METHDOLOGIESThis document play a vital role in the development of life cycle (SDLC) as it describes the complete requirement of the system. It means for use by developers and will be the basic during testing phase. Any changes made to the requirements in the future will have to go through formal change approvalprocess.SPIRAL MODEL was defined by Barry Boehm in his 1988 article, A spiralModel of Software Development and Enhancement. This model was not thefirst model to discuss iterative development, but it was the first model to explain why the iteration models. As originally envisioned, the iterations were typically 6 months to 2 years long. Each phase starts with a design goal and ends with a client reviewing the progress thus far. Analysis and engineering efforts are applied at each phase of the project, with an eye toward the end goal of the project.

The steps for Spiral Model can be generalized as follows:

The new system requirements are defined in as much details as possible. This usually involves interviewing a number of users representing all the external or internal users and other aspects of the existing system. A preliminary design is created for the new system. A first prototype of the new system is constructed from the preliminary design. This is usually a scaled-down system, and represents an approximation of the characteristics of the final product. A second prototype is evolved by a fourfold procedure:1) Evaluating the first prototype in terms of its strengths, weakness2) Defining the requirements of the second prototype.3) Planning a designing the second prototype.4) Constructing and testing the second prototype. At the customer option, the entire project can be aborted if the risk is deemed too great. Risk factors might involved development cost overruns, operating-cost miscalculation, or any other factor that could, in the customers judgment, result in a less-than-satisfactory final product. The existing prototype is evaluated in the same manner as was the previous prototype, and if necessary, another prototype is developed from it according to the fourfold procedure outlined above.The preceding steps are iterated until the customer is satisfied that the refined prototype represents the final product desired. The final system is constructed, based on the refined prototype. The final system is thoroughly evaluated and tested. Routine maintenance is carried on a continuing basis to prevent large scale failures and to minimize down time.

The following diagram shows how a spiral model acts like:

(Figure 4.1-Spiral Model)

In the flexibility of the uses the interface has been developed a graphicsconcept in mind, associated through a browser interface. The GUIS at the top level have been categorized as:1) Administrative user interface.2) The operational or generic user interface.The administrative user interface concentrates on the consistent information that is practically, part of the organizational activities and which needs properauthentication for the data collection. The interfaces help the administrationswith all the transactional states like Data insertion, Data deletion and Data updating along with the extensive data search capabilities.The operational or generic user interface helps the users upon the system in transactions through the existing data and required services. The operational user interface also helps the ordinary users in managing their own information helps the ordinary users in managing their own information in a customized manner as per the assisted flexibilities.

4.2 Existing System : A new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words. 4.2.1 Disadvantages of Existing System :A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly nontextual information. On the other hand, the words formed by mentions are unique, require little preprocessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.4.3 Proposed System: In this paper, we have proposed a new approach to detect the mergence of topics in a social network stream. The basic idea of our approach is to focus on the social aspect of the posts reflected in the mentioning behavior of users instead of the textual contents. We have proposed a probability model that captures both the number of mentions per post and the frequency of mentioned.4.3.1 Advantages of Proposed System: The proposed method does not rely on the textual contents of social network posts, it is robust to rephrasing and it can be applied to the case where topics are concerned with information other than texts, such as images, video, audio, and so on. The proposed link-anomaly-based methods performed even better than the keyword-based methods on NASA and BBC data sets4.4 Developers Responsibilities Overview:

The developer is responsible for: Developing the system, which meets the SRS and solving all the requirements of the system? Demonstrating the system and installing the system at client's location after the acceptance testing is successful. Submitting the required user manual describing the system interfaces to work on it and also the documents of the system. Conducting any user training that might be needed for using the system. Maintaining the system for a period of one year after installation.

4.5 Functional Requirements:

Functional Requirements refer to very important system requirements in a software engineering process (or at micro level, a sub part of requirementengineering) such as technical specifications, system design parameters and guidelines, data manipulation, data processing and calculation modules etc.Functional Requirements are in contrast to other software design requirements referred to as Non-Functional Requirements which are primarily based on parameters of system performance, software quality attributes, reliability and security, cost, constraints in design/implementation etc.The key goal of determining functional requirements in a software product design and implementation is to capture the required behavior of a software system in terms of functionality and the technology implementation of the business processes.47The Functional Requirement document (also called Functional specifications or Functional Requirement Specifications), defines the capabilities and functions that a System must be able to perform successfully.Functional Requirements should include: Descriptions of data to be entered into the system. Descriptions of operations performed by each screen. Descriptions of work-flows performed by the system. Descriptions of system reports or other outputs. Who can enter the data into the system? How the system meets applicable regulatory requirements ?The functional specification is designed to be read by a general audience. Readers should understand the system, but no particular technical knowledgeshould be required to understand the document.

4.5.1 Examples of Functional Requirements:

Functional requirements should include functions performed by pecificscreens, outlines of work-flows performed by the system and other businessor compliance requirements the system must meet.4.5.2 Interface Requirements:

Field accepts numeric data entry. Field only accepts dates before the current date. Screen can print on-screen data to the printer.484.5.3 Business Requirements:

Data must be entered before a request can approvedClicking the Approve Button moves the request to the Approval WorkflowAll personnel using the system will be trained according to internaltraining strategies.

4.5.4 Regulatory/Compliance Requirements:

The database will have a functional audit trail. The system will limit access to authorized users. The spreadsheet can secure data with electronic signatures.

4.5.5 Security Requirements:

Member of the Data Entry group can enter requests but not approveor delete requests. Members of the Managers group can enter or approve a request, butnot delete requests. Members of the Administrators group cannot enter or approverequests, but can delete requests.The functional specification describes what the system must do; how the system does it is described in the Design Specification. If a User Requirement Specification was written, all requirements outlined in the user requirement specification should be addressed in the functional requirements.

4.6 Non functional Requirements :

All the other requirements which do not form a part of the above specification are categorized as Non-Functional Requirements. A system may be require to present the user with a display of the number of records in a database.4.7 Hardware Requirements :

System: Pentium IV 2.4 GHz. Hard Disk : 40 GB. Floppy Drive: 1.44 Mb. Monitor: 15 VGA Colour. Mouse: Logitech. Ram: 512 Mb.

4.8 Software Requirements :

Operating system : Windows XP/7. Coding Language: JAVA/J2EE IDE:Netbeans 7.4 Database:MYSQL

+5.System DesignSystems designis the process of defining thearchitecture, components, modules, interfaces, anddatafor asystemto satisfy specifiedrequirements. Systems design could be seen as the application ofsystems theorytoproduct development. There is some overlap with the disciplines ofsystems analysis,systems architectureandsystems engineering5.1 System Architecture:

(Figure 5.1 Overall flow of the proposed method)

5.2 Block Diagram

(Figure 5.2 Block Diagram of Proposed System)

5.3 Data Flow Diagram:1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of input data to the system, various processing carried out on this data, and the output data is generated by this system.2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.3. DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

Twitter TrendsKey Based DetectionLink Based DetectionLevel 0:

(Figure 5.3 Data Flow Diagram of Twitter Trends)Level 1:

Change Point AnalysisPerform TrainingAggregation

(Figure 5.4 Data Flow Diagram of Perform Training)

Key Based DetectionLevel 2:

Burst AnalysisChange point AnalysisLink Based Detection(Figure 5.5 Data Flow Diagram of Change Point)

Key Based DetectionLink Based DetectionPerform Training AggregationChange Point AnalysisBurst AnalysisLevel 3:

(Figure 5.6 Data Flow Diagram of Key Based Detection)5.4UML Diagrams:UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling language in the field of object-oriented software engineering. The standard is managed, and was created by, the Object Management Group. The goal is for UML to become a common language for creating models of object oriented computer software. In its current form UML is comprised of two major components: a Meta-model and a notation. In the future, some form of method or process may also be added to; or associated with, UML.The Unified Modeling Language is a standard language for specifying, Visualization, Constructing and documenting the artifacts of software system, as well as for business modeling and other non-software systems. The UML represents a collection of best engineering practices that have proven successful in the modeling of large and complex systems. The UML is a very important part of developing objects oriented software and the software development process. The UML uses mostly graphical notations to express the design of software projects.

5.5 Goals :The Primary goals in the design of the UML are as follows:1. Provide users a ready-to-use, expressive visual modeling Language so that they can develop and exchange meaningful models.2. Provide extendibility and specialization mechanisms to extend the core concepts.3. Be independent of particular programming languages and development process.4. Provide a formal basis for understanding the modeling language.5. Encourage the growth of OO tools market.6. Support higher level development concepts such as collaborations, frameworks, patterns and components.7. Integrate best practices.

5.6 Use Case Diagram:A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical overview of the functionality provided by a system in terms of actors, their goals (represented as use cases), and any dependencies between those use cases. The main purpose of a use case diagram is to show what system functions are performed for which actor. Roles of the actors in the system can be depicted.

(Figure 5.7 Actor & Uses Case )5.7 Class Diagram :In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations (or methods), and the relationships among the classes. It explains which class contains information.

(Figure 5.8 Class Diagram Representation )

-5.8 Sequence Diagram :A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that shows how processes operate with one another and in what order. It is a construct of a Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams.

(Figure 5.9 Sequence Diagram Representing Sequence Activities )5.9 Activity Diagram : Activity diagrams are graphical representations of workflows of stepwise activities and actions with support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams can be used to describe the business and operational step-by-step workflows of components in a system. An activity diagram shows the overall flow of control.

(Figure 5.10 Activity Diagram Showing Sequence of Activities)

5.10 Collaborative Diagram :

(Figure 5.11 Collaborative Diagram Showing Collaboration Between All The Use Case )

5.10 Input Design :The input design is the link between the information system and the user. It comprises the developing specification and procedures for data preparation and those steps are necessary to put transaction data in to a usable form for processing can be achieved by inspecting the computer to read data from a written or printed document or it can occur by having people keying the data directly into the system. The design of input focuses on controlling the amount of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The input is designed in such a way so that it provides security and ease of use with retaining the privacy. Input Design considered the following things: What data should be given as input? How the data should be arranged or coded? The dialog to guide the operating personnel in providing input. Methods for preparing input validations and steps to follow when error occur.

5.11 Objectives :1. Input Design is the process of converting a user-oriented description of the input into a computer-based system. This design is important to avoid errors in the data input process and show the correct direction to the management for getting correct information from the computerized system.2. It is achieved by creating user-friendly screens for the data entry to handle large volume of data. The goal of designing input is to make data entry easier and to be free from errors. The data entry screen is designed in such a way that all the data manipulates can be performed. It also provides record viewing facilities.3. When the data is entered it will check for its validity. Data can be entered with the help of screens. Appropriate messages are provided as when needed so that the user will not be in maize of instant. Thus the objective of input design is to create an input layout that is easy to follow5.12 Output Design :A quality output is one, which meets the requirements of the end user and presents the information clearly. In any system results of processing are communicated to the users and to other system through outputs. In output design it is determined how the information is to be displaced for immediate need and also the hard copy output. It is the most important and direct source information to the user. Efficient and intelligent output design improves the systems relationship to help user decision-making.1. Designing computer output should proceed in an organized, well thought out manner; the right output must be developed while ensuring that each output element is designed so that people will find the system can use easily and effectively. When analysis design computer output, they should Identify the specific output that is needed to meet the requirements.2. Select methods for presenting information.3. Create document, report, or other formats that contain information produced by the system.The output form of an information system should accomplish one or more of the following objectives. Convey information about past activities, current status or projections of the Future. Signal important events, opportunities, problems, or warnings. Trigger an action. Confirm an action.5.13Related Work:

Detection and tracking of topics have been studied extensively in the area of topic detection and tracking (TDT). In this context, the main task is to either classify a new document into one of the known topics (tracking) or to detect that it belongs to none of the known categories. Subsequently, temporal structure of topics have been modeled and analyzed through dynamic model selection, temporal text mining, and factorial hidden Markov models. Another line of research is concerned with formalizing the notion of bursts in a stream of documents. In his seminal paper, Kleinberg modeled bursts using time varying Poisson process with a hidden discrete process that controls the firing rate. Recently, He and Parker developed a physics inspired model of bursts based on the change in the momentum of topics. All the above mentioned studies make use of textual content of the documents, but not the social content of the documents. The social content (links) have been utilized in the study of citation networks However, citation networks are often analyzed in a stationary setting. The novelty of the current paper lies in focusing on the social content of the documents (posts) and in combining this with a change-point analysis.

5. 14 Proposed Method:The overall flow of the proposed method is shown in Figure 5.1 We assume that the data arrives from a social network service in a sequential manner through some API. For each new post we use samples within the past T time interval for the corresponding user for training the mention model we propose below. We assign anomaly score to each post based on the learned probability distribution. The score is then aggregated over users and further fed into a change point analysis.A. Probability ModelWe characterize a post in a social network stream by the number of mentions k it contains, and the set V of names (IDs) of the users mentioned in the post. Formally, we consider the following joint probability distributionProbability Model.......................................................... (1)Here the joint distribution consists of two parts: the probability of the number of mentions k and the probability of each mention given the number of mentions. The probability of the number of mentions P(k| is defined as a geometric distribution with parameter as follows:.................................................................................. (2)On the other hand, the probability of mentioning users in V is defined as independent, identical multinomial distribution with parameters .Suppose that we are given n training examples T = {(), . . .. , () } from which we would like to learn the predictive distribution..................................................................(3)First we compute the predictive distribution with respect to the number of mentions P(k|T ). This can be obtained by assuming a beta distribution as a prior and integrating out the parameters . The density function of the beta prior distribution is written as follows:

where and are parameters of the beta distribution and is the beta function. By the Bayes rule, the predictive distribution can be obtained as follows:

Both the integrals on the numerator and denominator can be obtained in closed forms as beta functions and the predictive distribution can be rewritten as follows:

Using the relation between beta function and gamma function, we can further simplify the expression as follows:

where is the total number of mentions in the training set T . Next, we derive the predictive distribution P(v|T ) of mentioning user v. The maximum likelihood (ML) estimator is given as P(v|T ) = , where m is the number of total mentions and is the number of mentions to user v in the data set T . The ML estimator, however, cannot handle users that did not appear in the training set T ; it would assign probability zero to all these users, which would appear infinitely anomalous in our framework. Instead we use the Chinese Restaurant Process (CRP; see [9]) based estimation. The CRP based estimator assigns probability to each user v that is proportional to the number of mentions in the training set T ; in addition, it keeps probability proportional to for mentioning someone who was not mentioned in the training set T . Accordingly the probability of known users is given as follows:

On the other hand, the probability of mentioning a new user is given as follows:

B. Computing the link-anomaly scoreIn order to compute the anomaly score of a new post x = (t; u; k; V ) by user u at time t containing k mentions to users V , we compute the probability (3) with the training set , which is the collection of posts by user u in the time period [t-T,t] (we use T = 30 days in this paper). Accordingly the link-anomaly score is defined as follows:

The two terms in the above equation can be computed via the predictive distribution of the number of mentions (4), and the predictive distribution of the mentionee (5)(6), respectivelyC. Combining Anomaly Scores from Different UsersThe anomaly score in (7) is computed for each user depending on the current post of user u and his/her past behavior . In order to measure the general trend of user behavior, we propose to aggregate the anomaly scores obtained for posts x1 xn using a discretization of window size T > 0 as follows:

where xi = is the post at time by user including ki mentions to users Vi.D. Change-point detection via Sequentially DiscountingNormalized Maximum Likelihood CodingGiven an aggregated measure of anomaly (8), we apply a change-point detection technique based on the SDNML coding [3]. This technique detects a change in the statistical dependence structure of a time series by monitoring the compressibility of the new piece of data. The SDNML proposed in [3] is an approximation for normalized maximum likelihood (NML) code length that can be computed sequentially and employs discounting in the learning of the AR models;. Algorithmically, the change point detection procedure can be outlined as follows. For convenience, we denote the aggregate anomaly score as , instead of .1. 1st layer learning: Let : = { be the collection of aggregate anomaly scores from discrete time 1 to j - 1. Sequentially learn the SDNML density function ) (j=1, 2, ); see Appendix A for details.

2. 1st layer scoring: Compute the intermediate change-point score by smoothing the log loss of the SDNML density function with window size k as follows:

3. 2nd layer learning Let be the collection of smoothed change-point score obtained as above. Sequentially learn the second layer SDNML density ) (j=1, 2, ); function see Appendix A.

4. 2nd layer scoring Compute the final change-point score by smoothing the log loss of the SDNML density function as follows:

E. Dynamic Threshold Optimization (DTO) We make an alarm if the change-point score exceeds a threshold, which was determined adaptively using the method of dynamic threshold optimization (DTO). In DTO, we use a 1-dimensional histogram for the representation of the score distribution. We learn it in a sequential and discounting way. Then, for a specified value , to determine the threshold to be the largest score value such that the tail probability beyond the value does not exceed .We call a threshold parameter. The details of DTO are summarized in Algorithm :Algorithm 1 Dynamic Threshold Optimization (DTO) Given: {scores, : total number of cells, : parameter for threshold, _H: estimation parameter, : discounting parameter, M: data sizeInitialization: Let be a uniform distribution.For j = 1; : : : ;M - 1 do

Threshold optimization: Let l be the least index such that he threshold at time j is given as

Alarm output: Raise an alarm if Histogram update:

If Scorej Falls into the hth cell, otherwise.

End for.6.Implementation 6.1 Modules :Modules Description:6.1.1Training6.1.2 Identify individual Anomaly Score6.1.3 Aggregate6.1.4 Change Point Analysis and DTO6.1.5 Burst Detection

6.1.1 Training:In this section, we describe the probability model that we used to capture the normal mentioning behavior of a user and how to train the model. We characterize a post in a social network stream by the number of mentions k it contains, and the set V of names (IDs) of the mentioned (users who are mentioned in the post). There are two types of infinity we have to take into account here. The first is the number k of users mentioned in a post. Although, in practice a user cannot mention hundreds of other users in a post, we would like to avoid putting an artificial limit on the number of users mentioned in a post. Instead, we will assume a geometric distribution and integrate out the parameter to avoid even an implicit limitation through the parameter. The second type of infinity is the number of users one can possibly mention. To avoid limiting the number of possible mentioned, we use Chinese Restaurant Process (CRP) based estimation; who use CRP for infinite vocabulary.

6.1.2 Aggregate:In this subsection, we describe how to combine the anomaly scores from different users. The anomaly score is computed for each user depending on the current post of user u and his/her past behavior Ttu. To measure the general trend of user behavior, we propose to aggregate the anomaly scores obtained for posts x1;...;x xn using a discretization of window size >0.

6.1.3 Identify individual Anomaly Score:In this subsection, we describe how to compute the deviation of a users behavior from the normal mentioning behavior modeled in the previous subsection.

6.1.4 Change Point Analysis and DTO:This technique is an extension of Change Finder proposed, that detects a change in the statistical dependence structure of a time series by monitoring the compressibility of a new piece of data. Urabe et al.proposed to use a sequential version of normalized maximum-likelihood (NML) coding called SDNML coding as a coding criterion instead of the plug-in predictive distribution used. Specifically, a change point is detected through two layers of scoring processes. The first layer detects outliers and the second layer detects change-points. In each layer, predictive loss based on the SDNML coding distribution for an autoregressive (AR) model is used as a criterion for scoring. Although the NML code length is known to be optimal, it is often hard to compute. The SNML proposed is an approximation to the NML code length that can be computed in a sequential manner. The SDNML proposed further employs discounting in the learning of the AR models.As a final step in our method, we need to convert the change-point scores into binary alarms by thresholding. Since the distribution of change-point scores may change over time, we need to dynamically adjust the threshold to analyze a sequence over a long period of time. In this subsection, we describe how to dynamically optimize the threshold using the method of dynamic threshold optimization proposed.In DTO, we use a one-dimensional histogram for the representation of the score distribution. We learn it in a sequential and discounting way.

6.1.5 Burst Detection:In addition to the change-point detection based on SDNML followed by DTO described in previous sections, we also test the combination of our method with Kleinbergs burst-detection method. More specifically, we implemented a two-state version of Kleinbergs burst-detection model. The reason we chose the two-state version was because in this experiment we expect nonhierarchical structure. The burst-detection method is based on a probabilistic automaton model with two states, burst state and non-burst state. Some events (e.g., arrival of posts) are assumed to happen according to a time-varying Poisson processes whose rate parameter depends on the current state.6.2 Java Technology:Java technology is both a programming language and a platform.6.3 The Java Programming Language:The Java programming language is a high-level language that can be characterized by all of the following buzzwords: Simple Architecture neutral Object oriented Portable Distributed High performance Interpreted Multithreaded Robust Dynamic SecureWith most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

(Figure 6.1 java structure)

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether its a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make write once, run anywhere possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac. (Figure 6.2 Java Program & Compiler) 6.4 The Java Platform: A platform is the hardware or software environment in which a program runs. Weve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that its a software-only platform that runs on top of other hardware-based platforms. The Java platform has two components: The Java Virtual Machine (Java VM) The Java Application Programming Interface (Java API)

Youve already been introduced to the Java VM. Its the base for the Java platform and is ported onto various hardware-based platforms. The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide. The following figure depicts a program thats running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

(Figure 6.3 Java Platform) Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability. 6.5 What Can Java Technology Do? The most common types of programs written in the Java programming language are applets and applications. If youve surfed the Web, youre probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser. However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs. An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server. How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features: The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on. Applets: The set of conventions used by applets. Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses. Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language. Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates. Software components: Known as JavaBeansTM, can plug into existing component architectures. Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI). Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases. The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

(Figure 6.4 Java IDE)6.6 How Will Java Technology Change My Life? We cant promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following: Get started quickly: Although the Java programming language is a powerful object-oriented language, its easy to learn, especially for programmers already familiar with C or C++. Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++. Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other peoples tested code and introduce fewer bugs. Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++. Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online. Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform. Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded on the fly, without recompiling the entire program. 6.7 ODBC :Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change. Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN. The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesnt change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer. The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isnt as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.8 JDBC:In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of plug-in database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on. To gain a wider acceptance of JDBC, Sun based JDBCs framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution. JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after. The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.9 JDBC Goals:Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java. The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

6.9.1 SQL Level API: The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to generate JDBC code and to hide many of JDBCs complexities from the end user. 6.9.2 SQL Conformance:SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users. 1. JDBC must be implemental on top of common database interfaces The JDBC SQL API must sit on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa. 2. Provide a Java interface that is consistent with the rest of the Java systemBecause of Javas acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system. 3. Keep it simpleThis goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

4. Use strong, static typing wherever possible Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime. 5. Keep the common cases simple Because more often than not, the usual SQL calls used by the programmer are simple SELECTs, INSERTs, DELETEs and UPDATEs, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible. Java ha two things: a programming language and a platform. Java is a high-level programming language that is all of the following

SimpleArchitecture-neutralObject-orientedPortableDistributed High-performanceInterpretedmultithreadedRobustDynamicSecureJava is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java ProgramCompilersInterpreterMy Program

(Figure 6.5 Java Program Cycle)

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether its a Java development tool or a Web browser that can run Java applets, is an implementation of the Java VM. The Java VM can also be implemented in hardware.Java byte codes help make write once, run anywhere possible. You can compile your Java program into byte codes on my platform that has a Java compiler. The byte codes can then be run any implementation of the Java VM. For example, the same Java program can run Windows NT, Solaris, and Macintosh.

6.10 Networking:6.10.1 TCP/IP stack:

(Figure 6.6 Application & h/w interface)

The TCP/IP stack is shorter than the OSI one:TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

6.10.2 IP datagrams:The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.6.10.3 UDP:UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model - see later.6.10.4 TCP:TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.6.10. 5 Internet addresses:In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address. This encodes a network ID and more addressing. The network ID falls into various classes according to the size of the network address. 6.10.6 Network address:Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32. 6.10.7 Subnet address:Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts. 6.1o.8 Host address:8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

6.10.9 Total address:

(Figure 6.6 Total Address)

The 32 bit address is usually written as 4 integers separated by dots.

6.10.10 Port addresses:A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are "well known".

6.10.11 Sockets:A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.#include #include int socket(int family, int type, int protocol);Here "family" will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe - but the actual pipe does not yet exist.6.10.12 JFree Chart:JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart's extensive feature set includes:A consistent and well-documented API, supporting a wide range of chart types; A flexible design that is easy to extend, and targets both server-side and client-side applications; Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart is "open source" or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications. 1. Map Visualizations:Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include:Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas); Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more. 2. Time Series Chart Interactivity:Implement a new (to JFreeChart) feature for interactive time series charts --- to display a separate control that shows a small version of ALL the time series data, with a sliding "view" rectangle that allows you to select the subset of the time series data to display in the main chart.3. Dashboards:There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.4. Property Editors:The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

6.11 What is a Java Web Application?A Java web application generates interactive web pages containing various types of markup language (HTML, XML, and so on) and dynamic content. It is typically comprised of web components such as JavaServer Pages (JSP), servlets and JavaBeans to modify and temporarily store data, interact with databases and web services, and render content in response to client requests.Because many of the tasks involved in web application development can be repetitive or require a surplus of boilerplate code, web frameworks can be applied to alleviate the overhead associated with common activities. For example, many frameworks, such as JavaServer Faces, provide libraries for templating pages and session management, and often promote code reuse.

6.12What is Java EE?Java EE (Enterprise Edition) is a widely used platform containing a set of coordinated technologies that significantly reduce the cost and complexity of developing, deploying, and managing multi-tier, server-centric applications. Java EE builds upon the Java SE platform and provides a set of APIs (application programming interfaces) for developing and running portable, robust, scalable, reliable and secure server-side applications.Some of the fundamental components of Java EE include: Enterprise JavaBeans (EJB): a managed, server-side component architecture used to encapsulate the business logic of an application. EJB technology enables rapid and simplified development of distributed, transactional, secure and portable applications based on Java technology. Java Persistence API (JPA): a framework that allows developers to manage data using object-relational mapping (ORM) in applications built on the Java Platform.

6.13 JavaScript and Ajax Development:JavaScript is an object-oriented scripting language primarily used in client-side interfaces for web applications. Ajax (Asynchronous JavaScript and XML) is a Web 2.0 technique that allows changes to occur in a web page without the need to perform a page refresh. JavaScript toolkits can be leveraged to implement Ajax-enabled components and functionality in web pages.

6.14 Web Server and Client:Web Server is a software that can process the client request and send the response back to the client. For example, Apache is one of the most widely used web server. Web Server runs on some physical machine and listens to client request on specific port.A web client is a software that helps in communicating with the server. Some of the most widely used web clients are Firefox, Google Chrome, Safari etc. When we request something from server (through URL), web client takes care of creating a request and sending it to server and then parsing the server response and present it to the user.

6.15 HTML and HTTP:Web Server and Web Client are two separate software, so there should be some common language for communication. HTML is the common language between server and client and stands for Hypertext Markup Language. Web server and client needs a common communication protocol, HTTP (HyperText Transfer Protocol) is the communication protocol between server and client. HTTP runs on top of TCP/IP communication protocol.Some of the important parts of HTTP Request are: HTTP Method action to be performed, usually GET, POST, PUT etc. URL Page to access Form Parameters similar to arguments in a java method, for example user,password details from login page.Sample HTTP Request:

123GET /FirstServletProject/jsps/hello.jsp HTTP/1.1Host: localhost:8080Cache-Control: no-cache

Some of the important parts of HTTP Response are: Status Code an integer to indicate whether the request was success or not. Some of the well known status codes are 200 for success, 404 for Not Found and 403 for Access Forbidden. Content Type text, html, image, pdf etc. Also known as MIME type Content actual data that is rendered by client and shown to user.

MIME Type or Content Type: If you see above sample HTTP response header, it contains tag Content-Type. Its also called MIME type and server sends it to client to let them know the kind of data its sending. It helps client in rendering the data for user. Some of the mostly used mime types are text/html, text/xml, application/xml etc.

6.16 Understanding URL:URL is acronym of Universal Resource Locator and its used to locate the server and resource. Every resource on the web has its own unique address. Lets see parts of URL with an example.

http://localhost:8080/FirstServletProject/jsps/hello.jsp

http:// This is the first part of URL and provides the communication protocol to be used in server-client communication.

local host The unique address of the server, most of the times its the hostname of the server that maps to unique IP address. Sometimes multiple hostnames point to same IP addresses and web server virtual host takes care of sending request to the particular server instance.

8080 This is the port on which server is listening, its optional and if we dont provide it in URL then request goes to the default port of the protocol. Port numbers 0 to 1023 are reserved ports for well known services, for example 80 for HTTP, 443 for HTTPS, 21 for FTP etc.

FirstServletProject/jsps/hello.jsp Resource requested from server. It can be static html, pdf, JSP, servlets, PHP etc.

6.17 Why we need Servlet and JSPs?Web servers are good for static contents HTML pages but they dont know how to generate dynamic content or how to save data into databases, so we need another tool that we can use to generate dynamic content. There are several programming languages for dynamic content like PHP, Python, Ruby on Rails, Java Servlets and JSPs.Java Servlet and JSPs are server side technologies to extend the capability of web servers by providing support for dynamic response and data persistence.6.18 Web Container:Tomcat is a web container, when a request is made from Client to web server, it passes the request to web container and its web container job to find the correct resource to handle the request (servlet or JSP) and then use the response from the resource to generate the response and provide it to web server. Then web server sends the response back to the client.When web container gets the request and if its for servlet then container creates two Objects HTTPServletRequest and HTTPServletResponse. Then it finds the correct servlet based on the URL and creates a thread for the request. Then it invokes the servlet service() method and based on the HTTP method service() method invokes doGet() or doPost() methods. Servlet methods generate the dynamic page and write it to response. Once servlet thread is complete, container converts the response to HTTP response and send it back to client. Some of the important work done by web container are: Communication Support Container provides easy way of communication between web server and the servlets and JSPs. Because of container, we dont need to build a server socket to listen for any request from web server, parse the request and generate response. All these important and complex tasks are done by container and all we need to focus is on our business logic for our applications. Lifecycle and Resource Management Container takes care of managing the life cycle of servlet. Container takes care of loading the servlets into memory, initializing servlets, invoking servlet methods and destroying them. Container also provides utility like JNDI for resource pooling and management. Multithreading Support Container creates new thread for every request to the servlet and when its processed the thread dies. So servlets are not initialized for each request and saves time and memory. JSP Support JSPs doesnt look like normal java classes and web container pro