neural networks for web content filtering

11
48 1094-7167/02/$17.00 © 2002 IEEE IEEE INTELLIGENT SYSTEMS N e u r a l N e t w o r k s Neural Networks for  Web Co ntent Filtering Pui Y. Lee and Siu C. Hui,  Nanyang Techn ological University Alvis Cheuk M. Fong, Massey University, Albany  W ith the proliferation of harmful Internet content such as pornography , violence, and hate messages, effective content-filtering systems are essential. Many W eb-filtering systems are commercially available, and potential users can download trial versions from the Internet. Howev er, the techniques these systems use are insufficiently accurate and do not adapt well to the ever-cha nging Web. (For more on current approaches and systems, see the “W eb Content-Filtering Approaches” and “Current Web Content-Fil tering Systems” sidebars.) T o solve this problem, we propose using artificial neural networks (ANNs) 1,2 to classify Web pages during content filtering. We focus on blocking pornography because it is among the most prolific and harmful Web content. According to CyberAtlas, pornograph y-related terms such as “sex” and “porn” are among the top 20 search terms queried at the 10 leading Internet portals and search engines. 3 Fur- thermo re, resea rch suggests that pornography is addictive and causes harmful side effects. 4 However, our general framework is adaptable for filtering other objectionable Web material. Know the enemy How do pornographic Web pages differ from oth- ers? We attempt to answer this question by studying these pages’ characteristics and analyzing data. Characteristics Understanding pornograph ic Web pages’ charac- teristics can help us develop effecti ve content analysis techniques. Although it is well known that porno- graphic Web sites contain many sexually oriented images, text and other information can also help us dis- tinguish these sites. We focus on three characteristics: page layout format, use of PICS (Platform for In ternet Content Selection) ratings, and indicative terms. Page layout format. We treat all the HTML docu- ments used to construct a multiframe Web page as a single entity. This is because statistics obtained from any aspects of a Web page should be derived as a whole from the aggregated data collected from every HTML document that is part of that page. PICS use. Web publishers can use PICS labels to limit access to Web content (see the “Web Content- Filtering App roaches” sidebar). A PICS label is usu- ally distributed with the associated Web page by one of these methods: The Web publishe r (or W eb content owner) e mbeds the label in the HTML code’s header section. The Web server inserts the label in the HTTP packet’s header section before sending the page to a requesting client. In both cases, we can’t determine whether a W eb page has a PICS label simply by inspecting the Web page contents displayed in a browser. The first method requires viewing the Web page’s HTML code, which you can easily do with any major Web browser that supports such functionality. In contrast, the second method requires checking the HTTP packet’ s header section, which you can’t do with any known bro wser. Conseque ntly, to analyze porno- graph ic Web sites’ use of PICS systems , we collect statistics from the samples’ HTML code.  Indicative terms. Terms (words or phrases) that indi- cate pornographic Web pages fall into two major groups according to their meanings and use. Most The Intelligent Classification Engine uses neural networks’ learning capabilities to  pro vide fast, accur ate  differentiation  between pornographic  and nonpornograp hic Web pages. The basic  framework can also  serve to distinguish  other types of W eb  content.

Upload: rushaztech

Post on 08-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 1/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 2/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 3/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 4/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 5/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 6/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 7/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 8/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 9/10

8/7/2019 Neural Networks for Web Content Filtering

http://slidepdf.com/reader/full/neural-networks-for-web-content-filtering 10/10