user friendly pattern search paradigm

Click here to load reader

Post on 15-Jan-2015

355 views

Category:

Education

2 download

Embed Size (px)

DESCRIPTION

Low Price Contact :9840442542

TRANSCRIPT

  • 1. A User-friendly Patent Search Paradigm

2. INTRODUCTION Patents play a very important role in intellectual property protection. As patent search can help the patent examiners to find previously published relevant patents and validate or invalidate new patent applications, it has become more and more popular, and recently attracts much attention from both industrial and academic communities. For example, there are many online systems to support patent search, such as Google patent search, Derwent Innovations Index (DII), and USPTO. As most patent-search users have limited knowledge about the underlying patents, they have to use a try-and see approach to repeatedly issue queries and check answers, which is a very tedious process. 3. ABSTRACT As most patent-search users have limited knowledge about the underlying patents, they have to use a try-and see approach to repeatedly issue queries and check answers, which is a very tedious process. To overcome this, our proposed system introduces the efficient patent search paradigm. Our project can help users find relevant patents more easily and improve user search experience. To overcome the typing error problem in existing system our project introduces the error correction technique. Our project proposes three effective techniques, error correction, Topic-based query suggestion, and query expansion, to improve the usability of patent search. For improving efficiency partition the patents into small partitions based to their topics and classes. Then given a query and find highly relevant partitions and answer the query in each of such highly relevant partitions. Finally combine the answers of each partition and generate top answers of the patent-search query. 4. SCOPE OF THE PROJECT: In this project we improve the search efficiency. And we provide the more suggestions for user to check the patents. We correct the errors in the search keywords using the query correction methods. 5. LITERATURE SURVEY: Title: Improving Retrievability of Patents in Prior-Art Search Authors: S. Bashir and A. Rauber Year: 2010 Description Prior-art search is an important task in patent retrieval. The success of this task relies upon the selection of relevant search queries. Typically terms for prior-art queries are extracted from the claim fields of query patents. However, due to the complex technical structure of patents, and presence of terms mismatch and vague terms, selecting relevant terms for queries is a difficult task. During evaluating the patents retrievability coverage of prior-art queries generated from query patents, a large bias toward a subset of the collection is experienced. A large number of patents either have a very low retrievability score or cannot be discovered via any query. To increase the retrievability of patents, in this paper we expand prior-art queries generated from query patents using query expansion with pseudo relevance feedback. Missing terms from query patents are discovered from feedback patents, and better patents for relevance feedback are identified using a novel approach for checking their similarity with query patents. We specifically focus on how to automatically select better terms from query patents based on their proximity distribution with prior-art queries that are used as features for computing similarity. Our results show, that the coverage of prior-art queries can be increased significantly by incorporating relevant queries terms using query expansion. 6. Title: Latent dirichlet allocation Authors: D. M. Blei, A. Y. Ng, and M. I. Jordan Year: 2003 Description We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model 7. Title: Suggesting Topic-Based Query Terms as You Type Authors: J. Fan, H. Wu, G. Li, and L. Zhou Year: 2010 Description Query term suggestion that interactively expands the queries is an indispensable technique to help users formulate high-quality queries and has attracted much attention in the community of web search. Existing methods usually suggest terms based on statistics in documents as well as query logs and external dictionaries, and they neglect the fact that the topic information is very crucial because it helps retrieve topically relevant documents. To give users gratification, we propose a novel term suggestion method: as the user types in queries letter by letter, we suggest the terms that are topically coherent with the query and could retrieve relevant documents instantly. For effectively suggesting highly relevant terms, we propose a generative model by incorporating the topical coherence of terms. The model learns the topics from the underlying documents based on Latent Dirichlet Allocation (LDA). For achieving the goal of instant query suggestion, we use a trie structure to index and access terms. We devise an efficient top-k algorithm to suggest terms as users type in queries. Experimental results show that our approach not only improves the effectiveness of term suggestion, but also achieves better efficiency and scalability. 8. Title: Ranking structured documents: a large margin based approach for patent prior art search Authors: Y. Guo and C. P. Gomes Year: 2009 Description We propose an approach for automatically ranking structured documents applied to patent prior art search. Our model, SVM Patent Ranking (SVMPR) incorporates margin constraints that directly capture the specificities of patent citation ranking. Our approach combines patent domain knowledge features with meta-score features from several different general Information Retrieval methods. The training algorithm is an extension of the Pegasos algorithm with performance guarantees, effectively handling hundreds of thousands of patent-pair judgments in a high dimensional feature space. Experiments on a homogeneous essential wireless patent dataset show that SVMPRperforms on average 30%-40% better than many other state-of-the-art general-purpose Information Retrieval methods in terms of the NDCG measure at different cut-off positions. 9. Title: Efficient interactive fuzzy keyword search Authors: S. Ji, G. Li, C. Li, and J. Feng Year: 2009 Description Traditional information systems return answers after a user submits a complete query. Users often feel "left in the dark" when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting auto complete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called "interactive, fuzzy search," in which the system searches the underlying data "on the fly" as the user types in query keywords. It extends auto complete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental- search algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. 10. Title: Efficient Merging and Filtering Algorithms for Approximate String Searches Authors: C. Li, J. Lu, and Y. Lu Year: 2008 Description We study the following problem: how to efficiently find in a collection of strings those similar to a given query string? Various similarity functions can be used, such as edit distance, Jaccard similarity, and cosine similarity. This problem is of great interests to a variety of applications that need a high real-time performance, such as data cleaning, query relaxation, and spellchecking. Several algorithms have been proposed based on the idea of merging inverted lists of grams generated from the strings. In this paper we make two contributions. First, we develop several algorithms that can greatly improve the performance of existing algorithms. Second, we study how to integrate existing filtering techniques with these algorithms, and show that they should be used together judiciously, since the way to do the integration can greatly affects the performance. We have conducted experiments on several real data sets to evaluate the proposed techniques. 11. Title: Supporting Search-As-You-Type Using SQL in Databases Authors: G. Li, J. Feng, and C. Li Year: 2011 Description A search-as-you-type system computes answers on-the-fly as a user types in a keyword query letter by letter. We study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the high-performance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search performance. We present solutions for both single-keyword queries and multi-keyword queries, and develop novel techniques for fuzzy search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables with millions of records. 12. Title: Efficient fuzzy full-text type-ahead search Authors: G. Li, S. Ji, C. Li, and J. Feng Year: 2011 Description Traditional information systems return answers after a user submits a complete query. Users often feel "left in the dark" when they have limited knowledge about the underlying data and have to use a try-and-see approach for finding information. A recent trend of supporting auto complete in these systems is a first step toward solving this problem. In this paper, we study a new information-access paradigm, called "type- ahead search" in which the system searches the underlying data "on the fly" as the user types in query keywords. It extends auto complete interfaces by allowing keywords to appear at different places in the underlying data. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms for both single-keyword queries and multi-keyword queries, using previously computed and cached results in order to achieve a high interactive speed. We develop novel techniques to support fuzzy search by allowing mismatches between query keywords and answers. We have deployed several real prototypes using these techniques. One of them has been deployed to support type-ahead search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency. 13. Title: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data Authors: G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou Year: 2008 Description Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogeneous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi- structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly 14. Title: Simple vs. sophisticated approaches for patent prior-art search Authors: W. Magdy, P. Lopez, and G. J. F. Jones Year: 2011 Description Patent prior-art search is concerned with finding all filed patents relevant to a given patent application. We report a comparison between two search approaches representing the state-of-the-art in patent prior-art search. The first approach uses simple and straightforward information retrieval (IR) techniques, while the second uses much more sophisticated techniques which try to model the steps taken by a patent examiner in patent search. Experiments show that the retrieval effectiveness using both techniques is statistically indistinguishable when patent applications contain some initial citations. However, the advanced search technique is statistically better when no initial citations are provided. Our findings suggest that less time and effort can be exerted by applying simple IR approaches when initial citations are provided. 15. Modules: 1. Login page 2. Client Search through query 2.1 Automatic Error correction 2.2 Topic based query suggestion 2.3 Query expansion 3. Ranking 4. Patent Partition selection 5. Query Processing 16. Module Description 1. Login page Before client creation we check the user credential here by login page, we receive the username and password by the user and we will check in the database is that user have the credential or not to give request to the server. Here also we can add new user through user registration by taking all the important details like users name, gender, username, password, address, email id, phone no from the user. 2. Client Search through query In this module first we design the page for getting the users query then we will write the code in java file and through jsp file we will take the users query request to the semantic storage. 17. 2.1 Automatic Error correction In the automatic error correction we are using trie structure to do efficient keyword correction and completion. We are considering the prefix of the query word .if it is not familiar with the trie node then we dont want to consider that keyword. 2.2 Topic based query suggestion The topic based model is estimating the probability of the next query keyword. If a keyword in patents is more topically coherent with the previously typed query word it will be getting the higher score. 2.3 Query expansion In the query expansion we will be using the search engine for suggesting the relevant keyword. And we are using the relevant keywords from the query log for the expansion purpose. 18. 3. Ranking In this module we are ranking the answers that are obtained for our query search by the probability of most relevant patent. We are finding the most relevant patent regarding with the patent search. 4. Patent Partition selection In this module we are selecting the partition regarding with our patent search using two relevancy .That is topic relevancy and keyword relevancy. Using these two relevancy we are finding the top relevant partitions. 5. Query Processing Query processing module is for find the top answers regarding with our search. In this process we are combining all the ranking and selected partitions for finding the top answer. 19. Module Diagram 1. Login page User Login Page Database Patent search page 20. 2. Client Search through query 21. 2.1 Automatic Error correction User Typing Query Error Corrected Query 22. 2.2 Topic based query suggestion 23. 2.3 Query expansion 24. 3. Ranking 25. 4. Patent Partition selection 26. 5. Query Processing 27. GIVEN INPUT EXPECTED OUTPUT 1. Login page Input: User name and Password Output: Application transferred to the Patent search engine 2. Client Search through query Input: Enters the patent keyword which has to search Output: Query shown in search place 2.1 Automatic Error correction Input: Enters the patent which has to search Output: Error corrected Patent keyword 2.2 Topic based query suggestion Input: Enters the patent which has to search Output: Suggestions regarding with the topic 2.3 Query expansion Input: Enters the patent which has to search Output: Query keyword with relevant expanded format 28. 3. Ranking Input: Enters the patent which has to search Output: : Patent will be selected using ranking 4. Patent Partition selection Input: Enters the patent which has to search Output: Partitions searched topic based and keyword based 5. Query Processing Input: Enters the patent which has to search Output: Aggregated And Ranked top answers 29. SYSTEM REQUIREMENTS HARDWARE PROCESSOR : PENTIUM IV 2.6 GHz, Intel Core 2 Duo. RAM : 512 MB DD RAM MONITOR : 15 COLOR HARD DISK : 40 GB CDDRIVE : LG 52X SOFTWARE Front End : JSP Back End : MS SQL 2000/05 Operating System : Windows XP/07 IDE : Net Beans, Eclipse 30. TECHNIQUE USED 1. Automatic Error Correction 2. Topic-based Query Suggestion 3. Query Expansion 31. Automatic Error Correction As query keywords that users have typed in may have typos, traditional methods will return no answer as they cannot find answers that contain the query keywords. Obviously this method is not user-friendly. Instead, it is better to correct the typos, recommend users similar keywords, and return the answers of the similar keywords. To quantify the similarity between keywords, existing methods usually adopt edit distance. The edit distance between two keywords is the minimum number of edit operations (i.e., insertion, deletion, and substitution) of single characters needed to transform the first one to the second. For example, the edit distance of patent and paitant is 2. Two keywords are said to be similar if their edit distance is within a given threshold. There are some recent studies on efficient error correction, which use a filter-and-refine framework to find similar keywords of a query keyword. The method first uses the filter step to find a subset of keywords which may be potentially similar to the query keyword. Then it uses a verification step to remove those false positives and get the final similar keywords. Although we can use these methods to efficiently suggest keywords for complete keywords, they cannot support prefix keyword the user is completing. To address this problem, we can use the trie structure to do efficient keyword correction and completion. Using the trie structure, even users type in a partial keyword, we can also efficiently suggest relevant accurate keywords. The basic idea is that if a prefix is not similar enough to a trie node, then we do not need to consider the keywords under the trie node. We can use this observation to efficiently suggest similar keywords. 32. Topic based Query Suggestion We devise a novel model for effectively suggesting keywords as users type in queries letter by letter. The basic idea of our method is to use the topic model to estimate the probability of the next query keyword. Intuitively, if a keyword in patents is more topically coherent with the previously typed query keywords, it would obtain a higher score. Specifically, we can focus on estimating two important probabilities: the probability of a keyword conditioned on topics, and the probability of sampling a keyword from a patent. Both of the two probabilities are used to estimate the score of each keyword. An LDA model can be utilized to learn the keyword distribution over each topic from the underlying patents. LDA can be classified as a soft-clustering technique which allows a keyword to appear in multiple topics and takes into account the degree of a keyword belonging to each topic. The keyword distribution over a set of patents is learnt by using a language model. The language model approach can capture the property of the patents and predict the likelihood of sampling a specific keyword. Thus we can combine the two probabilities and use the topic-based method to suggest relevant keywords. 33. Query Expansion In many cases, users cannot understand the underlying data precisely. In this way, they may type in ambiguous keywords or inaccurate keywords. In addition, the same concept may have different representations. To this end, we can use Word Net to expand a keyword. If the query word is indexed by Word Net, we can easily get the relevant keywords of the query keyword using an inverted list structure. However Word Net is artificially generated for common words. If the query keywords are not in Word Net, we cannot recommend relevant keywords. To address this problem, we have two solutions. The first one is to utilize search engines, since most search engines will suggest relevant keywords as users type in queries. We can issue the patent query to search engines and get the relevant keywords from the search engines, such as Google. The second way is to mine the relevant keywords from the query logs. To this end, we use the click through data to mine the correlated queries as follows. For two queries, if users click the same returned result (patent), they are potentially relevant. We utilize this property to mine relevant queries. For two queries, we use the number of times user clicked on the same patent to denote their relevance. If a keyword pair with their co-occurrence is larger than a given threshold, the two keywords are relevant and we use them to do query expansion. 34. SYSTEM DESIGN USECASE DIAGRAM Login User Patent search Ok Patent Partitions QueryProcess Patent DB Top answer 35. CLASS DIAGRAM 36. OBJECT DIAGRAM 37. STATE DIAGRAM User Login Enters Keyword Errorcorrection Topic search Ok Verified Expansion Partitionselection Queryprocessing Topanswers 38. ACTIVITY DIAGRAM 39. SEQUENCE DIAGRAM 40. COLLABORATION DIAGRAM 41. SYSTEM ARCHITECTURE 42. DATA FLOW DIAGRAM LEVEL 1 43. DATA FLOW DIAGRAM LEVEL 2 44. E-R Diagram 45. FUTURE ENHANCEMENT In future, our proposed patent search paradigm will be implemented by connecting large number of database. This will increase the efficiency and search ability of patents with user friendly approach. Advantage 1. Keyword error correction 2.Partition based patent search 3. High search efficiency 4.Query suggestion and expansion Application 1. Google patent search 2 .Derwent Innovations Index (DII) 3. USPTO 46. CONCLUSION In this paper, we proposed a new patent-search paradigm. We developed three effective techniques, error correction, topic-based query suggestion, and query expansion, to make patent search more user- friendly and improve user search experience. Error correlation can provide users accurate keywords and correct the typing errors. Topic-based query suggestion can suggest topically coherent keywords as users type in query keywords. Query expansion can suggest synonyms and those relevant keywords of query keywords which are in the same concept with query keywords. We proposed a partition-based method to improve the search performance. Experimental results show that our method achieves high efficiency and quality. 47. REFERENCES [1] L. Azzopardi, W. Vanderbauwhede, and H. Joho. Search system requirements of patent analysts. In SIGIR, pages 775 776, 2010. [2] S. Bashir and A. Rauber. Improving retrievability of patents in prior art search. In ECIR, pages 457470, 2010. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003. [4] J. Fan, H. Wu, G. Li, and L. Zhou. Suggesting topic-based query terms as you type. In APWeb, pages 6167, 2010. [5] Y. Guo and C. P. Gomes. Ranking structured documents: A large margin based approach for patent prior art search. In IJCAI, pages 10581064, 2009. [6] S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search. In WWW, pages 371380, 2009. 48. [7] L. S. Larkey. A patent search and classification system. In ACM DL, pages 179187, 1999. [8] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257266, 2008. [9] G. Li, J. Feng, and C. Li. Supporting search-as-you- type using sql in databases. IEEE TKDE, 2011. [10] G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full- text type-ahead search. VLDB J., 20(4):617640, 2011. [11] G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou. Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD Conference, pages 903914, 2008. [12] W. Magdy, P. Lopez, and G. J. F. Jones. Simple vs. sophisticated approaches for patent prior-art search. In ECIR, pages 725728, 2011.