classification of anti-phishing solutions...35] [5] [73] [1] [68] [38] [12] [55] dataset...
TRANSCRIPT
Vol.:(0123456789)
SN Computer Science (2020) 1:11 https://doi.org/10.1007/s42979-019-0011-2
SN Computer Science
SURVEY ARTICLE
Classification of Anti‑phishing Solutions
S. Chanti1 · T. Chithralekha2
Received: 6 April 2019 / Accepted: 28 June 2019 / Published online: 16 July 2019 © Springer Nature Singapore Pte Ltd 2019
AbstractPhishing is an online fraud through which phisher gains unauthorized access to the user system to lure the personal credentials (such as username, password, credit/debit card number, validity, CVV number, and pin) for financial gain. Phishing can be carried out in many ways: through emails, phone calls, instant messages, advertisements, and popups on the website and poisoning the DNS. To protect the users from phishing, many anti-phishing toolbars/extensions had been developed. These anti-phishing tools prevent the Internet users not to fall a victim of phishing scams. No anti-phishing approach can give 100 % security. In this paper, we present a complete classification of an anti-phishing solution in algorithmic perspective. The taxonomy helps in understanding various anti-phishing approaches and algorithms developed for phishing detection. Popular anti-phishing toolbars are taken to show the media they address, mode of operation, and their pros and cons. It also provides further research gap that has to be addressed.
Keywords Phishing · Anti-phishing · Content-based approach · Non-content-based approach · Machine learning · Anti-phishing toolbars
Introduction
Phishing is an Internet scam used by the phisher to fool Internet users for malicious activities. Phishing can be done in many ways. Among them, email phishing is the traditional and most common way of performing phishing. Usually, the phisher sends an email by stating some emergency which evokes the user to click on the hyperlink or the attachment provided in the email. The phisher comes with a new tech-nique every time to fool Internet users. According to the Anti-phishing Working Group [11] survey report, there are 1,220,523 unique phishing attacks that occurred in Janu-ary–March for the year 2018. Pharming is the advanced way
of phishing scams, where the phisher redirects the user to a spoofed site that looks and feels exactly like the original site. This can be done either by modifying the host files on the user system or by hijacking (replacing the IP address) the DNS servers. If the IP address on the DNS server is changed, the entire traffic of the website is redirected to the site specified by the phisher. Pharming is more dangerous and very difficult to detect.
To prevent internet users from phishing scams, anti-phishing solutions had been developed. Anti-phishing helps in detecting the phishing scams. In this study, we classified the existing anti-phishing solutions into two main categories, namely, content-based and non-content-based. The content-based approaches analyze the content from webpage, URL, email to decide whether it is phishing or not. The non-con-tent-based approaches do not analyze the content; instead, they verify with the existing blacklist (stores the phishing data), a whitelist (list of trusted sites). In this work, we focus on the following aspects that are different from the existing taxonomies:
• A complete classification of anti-phishing solutions.• Presenting a literature survey on existing anti-phishing
algorithms used by different approaches, the data set used and the limitations are discussed in detail.
This article is part of the topical collection “Advances in Internet Research and Engineering” guest edited by Mohit Sethi, Debabrata Das, P. V. Ananda Mohan and Balaji Rajendran.
* S. Chanti [email protected]
T. Chithralekha [email protected]
1 Department of Banking Technology, Pondicherry University, Puducherry, India
2 Department of Computer Science, Pondicherry University, Puducherry, India
SN Computer Science (2020) 1:1111 Page 2 of 18
SN Computer Science
• Presenting a comparison of existing anti-phishing tool-bars in the literature.
In this paper, a complete classification of anti-phishing solu-tions is provided: the classification assists to understand various approaches utilized for developing anti-phishing solutions and the current trends. “Research Methodology” is about research methodology to illustrate how this lit-erature is performed. “Anti-phishing Solutions” explains a complete classification of anti-phishing solutions and the existing anti-phishing approaches. “Existing Anti-phishing Browser Extensions/Toolbars” elucidates existing anti-phishing browser extensions/toolbars with pros and cons. “Discussion” answers all the research questions raised, and finally, “Conclusion” provides the conclusion of the paper.
Research Methodology
A complete classification of anti-phishing solutions had been chosen as the research methodology for this study. The goal of this classification is to provide an overview of anti-phish-ing solutions with the amount of research rendered in this area. Based on the idea and works carried out by various authors [17, 28, 45] encouraged us to write this survey paper with the following research questions:
Research Questions
The main goal of the study was to provide a complete clas-sification of anti-phishing solutions. To do that, we define the following questions:
RQ1 What are the areas that current anti-phishing solu-tions address?
RQ2 Do the existing anti-phishing toolbars cover all types of phishing attacks?
RQ3 What are the current research gaps in anti-phishing?
Searching for Papers
The preliminary search is conducted to collect the articles from different sources. The keywords such as anti-phish-ing, email-based phishing detection, website-based phish-ing detection, URL-based phishing detection, social media phishing, and DNS phishing are used to search the relevant articles from digital libraries such as IEEE, ACM, Emer-alds, Science Direct, and Springer. The above-mentioned keywords are used to search the relevant literature from these digital libraries. To find the relevant papers, the titles
with these keywords are filtered. Only leading journals and international conference papers were chosen for this study.
Finding the Relevant Papers
To find the most relevant articles, a screening process is done, based on the presence of the keywords in the title of the search results. These papers are further analyzed by reading the abstract. The second-stage filtering is per-formed by reading the abstract of the papers and relevant papers are examined. The selected papers are classified as email-based, website-based, DNS-based, and social media-based phishing detection/contact-based, and noti-fication-based. and examined thoroughly. After reading the papers, the clarification of anti-phishing solution is defined.
Details About the Papers
This section provides an exhaustive information about the number of papers available on phishing, the different sources, where they are available, how the relevant papers have been filtered, and the yearwise publication of those papers .
Selection process of papers The process of selecting the paper is given in Fig. 1. Initially, 5269 papers were retrieved from five digital libraries.
From obtained results, the first filtering is performed based on the title of the papers. By examining these papers, 279 papers were selected. The papers that are out of the scope are removed from the literature. After this, the abstract of all these papers is studied and filtered 113 papers. In the next stage, an in-depth analysis is done on unclear papers. All 113 papers are read completely to exclude the uncleared papers, and finally, we got 75 papers for the study.
Publication of papers in different sources All the search results are from the digital libraries such as IEEE, ACM, Emeralds, Science Direct, and Springer. We considered only the journal articles and conference proceeding articles for the study. We found 5269 articles out of which 3288 were research articles and 1981 conference passed proceedings. Springer and Science Direct have published more research articles than conference proceedings. However, IEEE and ACM have more conference proceeding articles than research articles. The details are given in Fig. 2.
Yearwise publication of selected papers Figure 3 shows the yearwise publication of selected papers. The selected papers are from 1992 to 2018. Since 1992, the number of publications is increasing steadily. From the selected papers, 11 papers (14.6%) were published in 2017, and 8 papers (10.6%) were published in 2007, 2011, and 2016. The
SN Computer Science (2020) 1:11 Page 3 of 18 11
SN Computer Science
number of phishing scams is increasing drastically and the phisher uses different techniques to lure the Internet users.
Anti‑phishing Solutions
According to the APWG survey, thousands of phishing sites are developing every year and billions of dollars are lost. To overcome this problem, many business companies
and researchers started developing anti-phishing solutions. Anti-phishing can be implemented both for client side and server side [28, 52]. Based on the works carried out on anti-phishing, we proposed a taxonomy of anti-phishing solution, as shown in Fig. 5.
An evolution roadmap of existing anti-phishing solutions is listed in Fig. 4. A consolidated features’ list is given in Table 1 which includes the email, website, URL, and Social media features from various sources for phishing detection
Fig. 1 Selection process of papers
Fig. 2 Publication of papers in different sources
SN Computer Science (2020) 1:1111 Page 4 of 18
SN Computer Science
[2, 3, 5, 10, 12, 23, 25, 35, 36, 38, 42, 48, 49, 58, 59, 63, 66, 68, 74]. Different anti-phishing approaches use differ-ent algorithms to classify the phishing attacks from the legitimate ones. The content-based and non-content-based approaches are further explained below in detail.
Content‑Based Phishing Detection
In content-based phishing detection, the phishing attack is detected by analyzing the content of the website. Analyz-ing the content requires some features such as checking the spelling and grammar, password fields, links, images, URLs, page rank, WHOIS information, verifying the HTML code, and JavaScript [19, 71].
Social media Social media Phishing is a new way of steal-ing user credentials using social networks such as Facebook, Twitter, LinkedIn, Google+, and so on. According to Ref. [31] study, stealing of user credentials from social networks sites is four times greater than the other phishing attacks. Social media phishing looks similar to email phishing, but it is not.
In email phishing, the phisher sends the email to either redirect the user to a suspicious site or attach some mali-cious code. However, in social media phishing, the attacker communicates with the user and slowly tries to collect their personal credentials or asks for financial support. As men-tioned in paper [65], the Social media Phishing can be dif-ferent from others in three ways:
• Social media phishing can be observed in the new social media environment, where the features and policies keep on changing.
• Second step can be performed in two levels: in level 1, the attacker creates a fake account to interact with the victim in a different manner (like a friend). In level 2, phisher collects the personal credentials of the victim.
• Finally, Social media phishing is successful, because it is very difficult to distinguish the fake request from a legitimate one.
In paper [7], detection of spear phishing attacks in relation to the individuals’ social media activities is performed. According to their preliminary results, social media sites provide the identity information, open to the public, which helps the phisher to target the individual user through spear phishing.
Website content-based phishing detection In website content-based phishing detection, the features from URL, Image, and text content are analyzed.
URL analysis URL analysis is conducted to verify whether the site requested by the user is trusted or phish-ing. This can be done by checking the presence of special character (@), IP address instead of the domain name, pre-fix/suffix, HTTPS in domain part and many other features. Rule-based approaches are the conditions that classify the phishing URLs from a legitimate one. Machine-learning approaches are also used for phishing detection [33].
Image analysis Image analysis includes images, logs, CAPTCHAs, and screenshots of the website that help to distinguish the phishing website from a legitimate website. The visual content similarity-based approach does image analysis by comparing the logo of the website. To do this, the screenshot of the page is captured and extracts the logo. This logo is compared with the blacklist, and if it matches,
Fig. 3 Yearwise publication of selected papers
SN Computer Science (2020) 1:11 Page 5 of 18 11
SN Computer Science
then it is a phishing site [19, 51]. The text extracted from the screenshot can also be used for phishing detection.
Text analysis Text content of a website helps in better detection of phishing attacks. Text content may be a sim-ple keyword, scripts, secure sockets layer (SSL) enabled or not, and so on. Text content-based approach [4], rule-based approach [19, 73], and machine-learning approaches [18, 29, 37] are used to analyze the text content.
Email content-based phishing Email-based Phishing is the most common way of phishing. In email phishing, the phisher either redirects the user to a fraudulent site/spoofed site or a malicious attachment that downloads and installs automatically without user’s knowledge when they click on that link. It provides unauthorized access to users’ system.
Spam filtering Spam filters [4, 30] classify the phish-ing emails from Spam; few instructions are given to the Spam filter like checking whether the sender information is
blacklisted or not, the presence of any urgency in the con-tent, malicious attachments, and suspicious URLs can help in classifying the phishing email from legitimate ones.
URL analysis Phishing email has become a very common and easy way of stealing the credentials of Internet users by redirecting their search. Before the user visits the site, the URL is to be validated to find the suspicious one. When the user clicks on the phishing hyperlink in an email, before loading the page, the URL will collect the information such as domain details, destination details, and age of the domain which are verified and allows the user only when the infor-mation is valid [12].
Spelling and grammar correction Phisher sends thou-sands of email every day to fool the Internet user to give their personal credentials. The content and the links look like a genuine mail that fools the user to click on the links provided. These types of emails can be verified by check-ing the misspelt words and grammar corrections from the incoming mail [34]. In paper [21], a toolbar is developed to provide an additional feature “scam blocker” which identi-fies the spelling and grammar correction in the email. The phisher normally uses misspelt words (For example, instead of Google they type Goog1e) which the Spam filters fail to detect. Scam Blocker assists in detecting this type of email and blocks them before reaching the inbox.
DNS DNS phishing (pharming) is phishing without a lure. In Phishing attacks, the attacker focuses on an individual, but in pharming, they target an entire network by modifying the DNS entries, so that all the requests are redirected to attackers’ server. Pharming attacks are very difficult to detect and even the URL looks exactly same as legitimate one. There are few works on pharming detection which compares the IP addresses. In paper [25], the author compared the IP address of the current site with the default DNS Server and if does not match, then it is pharming. More details are pro-vided in “Existing Anti-Phishing Approaches".
Non‑content‑Based Phishing Detection
Non-content-based approaches focus on the features other than the content. By verifying the suspicious URL in the blacklist, based on user rating, the popularity of the domain and many other features, it could be decided whether the site is phishing or legitimate.
Existing Anti‑phishing Approaches
The existing anti-phishing approaches are developed either by content-based or by non-content-based detection tech-niques. The efficiency of anti-phishing approaches depends on the factors such as the features, collection of data sets, and their size. For machine learning, approaches require more data samples to train the model to detect the phishing
Fig. 4 Evolution roadmap of anti-phishing solutions
SN Computer Science (2020) 1:1111 Page 6 of 18
SN Computer Science
attack. Anti-phishing algorithms are also developed for phishing detection. Table 2 shows different algorithms used in different approaches with their performance and limitations. The primary requirement for anti-phishing is data set. In paper [17], the author listed some Benchmark-ing data set sources that provide legitimate and phishing data sets. Data set from PhishTank.com is the most widely used data set for phishing. The existing content-based and non-content-based anti-phishing approaches are given below:
Behaviour-based The behaviour-based approaches work on the behaviour of the Internet user to detect suspicious activities. In paper [65], the author presented a behaviour-based technique to detect Social Network Phishing (SNP). A study is conducted by selecting 127 students randomly who use Facebook. Four accounts were created from those accounts: (i) with no photo, personal info and friends; (ii) next with a photo but no friends; (iii) next without a photo but ten friends; and (iv) account with a picture and ten friends. They categorized the SNP into two levels. At first level, the phisher uses phony profiles to identify the Facebook users. In the second level, they try to extract the information directly. The users responded to the request with more friends even if the picture of the person is not available.
Visual content similarity-based approach Visual content similarity-based approach is used to visually compare the
images, logos, screenshots of the phishing site. In Refs. [28, 52], the screenshot of the URL requested by the user is obtained from PhishTank website. Using clipping tools, the logo is separated from the screenshot. Later, the logo is given to Google search engine and text content is obtained from the search results. If the current URL is listed in the Google search results, then it is considered as legitimate, else phishing.
In paper [19], the image in a website is captured, and optical character recognition (OCR) is used to extract the textual content from the image. This textual content is then loaded into Google for domain matching. If it matches, a green color (for a trusted site) indicator is displayed, else a red color (phishing site) indicator appears.
In paper [75], the author introduced a visual similarity- based approach with local and global features to compare the phishing web page with a legitimate web page. A logo detec-tion method is used to extract local features and modified EMD algorithm for global feature extraction. The screenshot of the suspected site is taken to extract non-text content fea-tures that include images of flash objects in an HTML page. To improve the performance of their technique, they col-lected a large amount of phishing and legitimate web pages and produced the outcome with 90% true positive and 97% true negative rates.
Rule-based approach Rule-based Approach [25, 29] is a content-based approach that analyzes the content within
Fig. 5 Anti-phishing solutions for phishing detection
SN Computer Science (2020) 1:11 Page 7 of 18 11
SN Computer Science
Table 1 List of phishing detection features at different levels
Types of features List of features
Email features Header features Compare-Msg-Sender-Domain, HTML-mail, Text-mail, Multi-Part-Mail, Number-Of-Receivers, Number-Of-
Attachments, Subject-Bank-Word,Subject-Debit-Word, Subject-Fwd-Word, Subject-Reply-Word, Subject-Ver-ify-Word, Subject-Num-Chars, Subject-Num-Words, Subject-Richness, Send-Num-Words, Send-Diff-Replay-to, Number-Of-Recipients, Number-Of-CcRecipients, Number-Of-BccRecipients, Absence of names (first, middle, last)
URL feature in Email Num-Of-Link, Number-Of-Diff-Domain, Num-Of-Diff-Link-Text, Num-Domain-NLSender, Num-Of-Dots-InDo-main, Non visible links, Non matching links, Number-Link-Contain@, Number-Of-Link-ContainIP, Number-Of-Link-Contain-Esc, Number-Of-Link-Contain-NSPort, url-Bag-Link, Url-Num-Port, Black-List-URL, No. of Links Behind an image, Link with following word: Click, Here, Login, Update
Word list feature Boolean indicators of whether the words or stems listed below appear in the email body: account, update, con-firm, verify, secure, notify,log, click, inconvenience, customer, client, suspend, restrict, Hold,Verify, username, password, SSN, user
Structural features Total number of body parts, Total number of alternative parts HTML content HTML form, Contains Script, Count SSL Link, Number of linksusing Image, Number of non-ASCII links, Script
onclick, Script popup, Script status change Email body features Size of the document, Dear (keyword), no. of characters, no. ofwords, no. of unique words, Body richness, no. of
Functionalwords, no. of suspension words, Verify your account phrase, Disparities between “href” attribute and LINK text, Mention ofmoney, Presence of reply inducing sentence, sense of urgency
Website features Address bar features IP address, Long URL that hides suspicious part, Tiny URL, URLwith @ symbol, redirect using “//”, prefix or
suffix to domain, HTTPS, favicon, Using Non-Standard Port, Sub-domain and multisub-domains, Using free hosting Domains, Count of digit, Lengthof URL, Ration of special characters, Registration date of Domain, No. of dots(.) in the URL, Port no. in the URL, No. of tripletsin the path of URL, No. of triplets in the domain name, No.of Phishing keywords in the URL
Abnormal web features Request URL, URL of Anchor, Links in f<meta>,<script>, and<Link>g, Suspicious action upon submitted information, Submittedinformation to email, Website Owner, Abnormal URL, AbnormalDNS record, Abnor-mal Anchors, Abnormal server form handler, Abnormal certificate in SSL, The no. of web pages, The avg no. ofinbound links, The avg no. of internal links, The avg no. of images,The avg no. of input boxes, The avg no. of password boxes, The proportion of form links, Dynamic web page proportion
HTML and JavaScript Websites forwarding, Status bar customization, Disabling right click, Pop-up window, Iframe redirection, Count of hidden tags, Count ofexternal links, Count of unsymmetric tags, Count of JavaScriptsegments, Count of plug-ins and Active X controls, Count of longstring, Count of Unicode characters, Count of Hex and Octalcod-ing, Count of replace() function call, Count of eval() and exec() function, Count of string functions, Count of obfuscation function, Evaluation of (form, title, image, meta description, meta keywords, script, link and href) tags
Domain features Age of domain, DNS record, Web traffic, Google index, Number of links pointing to the web page, Statistics report-based features
Graphical features Grayscale histogram, color histogram, Spatial relationship between subgraphs are extracted from web image Country-code and TLD TLD evaluation in the domain name, TLD evaluation in the part of the URL, Country- code and TLD comparison
URL features IP-based URL, Age of the domain, Length of URL, No. of dots, Longest common sequence in URL, Presence of “@” and “-” symbol, Rank, Link-in-count, Mld-results, Mld-ps-results, Cardinality, Ration-associated, Ration-related, Jaccard-(RR, RA, AR, AA), Jaccard-AR-Registered, Jaccard-AR-Renaming, Domain exists inAlexa rank, Sub-domain length, Path length, URL entropy, Lengthratio, Punctuation count, Euclidean distance
Social media features Twitter Account- specific features Length of the account name, Length (size) of the account Description, Total count of friends, Total count of fol-
lowers, User reputation, Ratio of followers and friends, Life time of the user account, Rate of friends, Rate of followers, Total count of tweets posted, Average count of tweets posted per day, Average count of tweets gener-ated per week, Total count of tweets liked/favorited, Total count of lists
Object Specific Features Average count of hash-tags present in a tweet, maximum count of hash-tags present in a tweet, Fraction of tweets with a hash-tag, Average count of URLs per tweet, Maximum count of URLs present in a tweet, Fraction of tweets with URLs, Average count of mentions per tweet, Maximum count of mentions per tweet, Fraction of tweets with mention, Average count of re-tweets per tweet Maximum count of re-tweets per tweet, Average count of favourites per tweet
SN Computer Science (2020) 1:1111 Page 8 of 18
SN Computer Science
a URL, email, social media, and web content with some conditions (heuristics). In URL analysis, the content of the URL alone is analyzed.
The heuristics like more number of dots and slashes in domain part, whether it is an IP-based URL or not, the pres-ence of any special character (@) are grasped from the URL to predict phishing.
In paper [4], heuristics are such as the primary domain, sub-domain, path domain, page rank, Alexa rank, and Alexa reputation are considered. When the user clicks on any link, these features are extracted and checked whether it satis-fies the conditions or not. If URL satisfies with the above condition, then it is legitimate, else phishing. The lifespan of phishing URLs is very small and it will not be available in top search results.
In paper [73], the author introduced a content-based approach CANTINA for phishing web page classification. Term frequency-inverse document frequent (TF-IDF) is used to calculate the score of each term in a web page. Among the words, which contains high TF-IDF score is taken to gener-ate lexical signatures. This information is then provided to the Google search engine to check whether the domain is listed in the top 30 results or not. If the current domain is not listed in the top 30 search results, then it is a phishing site.
Text content similarity-based approach In text content similarity-based approach, the keywords that are very simi-lar to the actual words like IC1CI instead of ICICI to fool online customer to give up their personal credentials. To prevent this type of fraud, the textual content is analyzed, and a list of keywords is stored for verification. The data-base contains the keywords (such as click here, verify, login, apply online, dear, free access) commonly used in phishing emails. These approaches can monitor the incoming emails to check whether these keywords are present or not. If so, it is classified as Spam mail.
Text analysis also compares the current website con-tent with the stored profiles to spot the phishing scams. A stored profile contains URLs, SSL certificate details, images, HTML contents, and scripts. In Ref. [4], the tool-bar maintains a database with these profiles and extracts
these features from the current site. If the extracted infor-mation does not match with the stored profiles, then it is phishing.
In paper [4], they maintain a blacklist of keywords as tokens, and for every token, it is verified whether it is avail-able in that list of blocked keywords. If it is found, then the count automatically increases, and finally, if it crosses the threshold value, then it is a phishing email.
Machine learning Machine learning is a complex com-putation process of automatic pattern recognition and intel-ligent decision making based on training sample data [18]. Supervised and unsupervised classifiers are the two main classifications of machine learning. Machine learning has the ability to learn from the data without being explicitly programmed. Initially, in the training phase, we take few instances (each row in the data set is called one instance) to train the model with a machine-learning classifier, and then, we load a set of new instances to check whether it classifies them properly or not.
In paper [29], supervised machine-learning algorithms Adaline network, back propagation network along with sup-port vector machine are used and they found 15 features such as presence of IP-based URL, special character (@), adding (-), using anchor tags, the age of the domain, etc., for phishing detection. The data set is collected from PhishTank (phishing URLs) and Alexa (trusted URLs). It is a super-vised classifier, so that the output should know while train-ing. Later, the testing data without output label are given to check the efficiency of the model developed for phishing detection. The detection rate of machine learning can be calculated in terms of accuracy, precision, recall, false posi-tive, and false negative.
Bayesian anti-phishing toolbar (B-APT) a browser exten-sion used to filter the phishing email. The B-APT [37] has two parts:
• User interface.• B-APT engine.
Table 1 (continued)
Types of features List of features
Facebook Account specific features Average count of hash-tags present in a tweet, maximum count of hash-tags present in a tweet, Fraction of tweets
with a hash-tag, Average count of URLs per tweet, Maximum count of URLs present in a tweet, Fraction of tweets with URLs, Average count of mentions per tweet, Maximum count of mentions per tweet, Fraction of tweetswith mention, Average count of re-tweets per tweet Maximum count of re-tweets per tweet, Average count of favourites per tweet
Object specific features Average count of hash-tags per post, Maximum count of hash-tags per post, Fraction of posts with hash-tags, Average count of an occurrence of URLs per post, Maximum count of URLs present in a post, Fraction of posts with URLs, Average count of tags per post, Maximum count of tags per post
SN Computer Science (2020) 1:11 Page 9 of 18 11
SN Computer Science
Tabl
e 2
Pop
ular
ant
i-phi
shin
g al
gorit
hms u
sed
in p
hish
ing
dete
ctio
n
Rese
arch
pap
ers
[63]
[35]
[5]
[73]
[1]
[68]
[38]
[12]
[55]
Dat
a se
t D
ata
set s
ourc
e P
hish
ing
APW
G a
rchi
ves
Phis
hTan
kM
anua
lPh
ishT
ank
Phis
hTan
kW
orld
Wid
e W
ebW
estP
acPh
ishT
ank
PIRT
repo
rt L
egiti
mat
e–
Goo
gle
whi
telis
tM
anua
lA
lexa
, Yah
ooW
eb c
raw
ler
Wor
ld W
ide
Web
Wes
tPac
Com
mon
cra
wl
Goo
gle
sear
ch D
ata
set s
ize
Phi
shin
g20
3 A
rchi
ves
200
web
site
s60
0 em
ails
100
UR
Ls36
11 w
ebsi
tes
279
web
site
s61
3048
em
ails
1 m
illio
n em
ails
30 sa
mpl
es L
egiti
mag
e–
200
web
site
s40
0 em
ails
100
UR
Ls16
38 w
ebsi
tes
100
web
site
s46
25 e
mai
ls1
mill
ion
emai
ls50
0 sa
mpl
esFe
atur
es E
mai
l*
**
Web
site
**
**
UR
L*
**
* S
ocia
l med
ia D
NS
App
roac
h us
edRu
le-b
ased
, pat
-te
rn m
atch
ing
Mac
hine
lear
ning
Mac
hine
lear
ning
Rule
-bas
edM
achi
ne le
arni
ngM
achi
ne le
arni
ngM
achi
ne le
arni
ngM
achi
ne le
arni
ngB
lack
list
Alg
orith
m u
sed
Link
Gua
rdTS
VM
Nat
ural
lang
uage
pr
oces
sing
, W
ordn
et
TF-I
DF
Goo
gle
page
rank
Supp
ort v
ec-
tor m
achi
ne
(SV
M)
Dec
isio
n tre
esR
ando
m fo
rest,
LS
TMB
lack
list g
ener
ator
Perfo
rman
ce in
% F
PR–
–2
1–
––
–9
FN
R–
–4
––
––
––
Pre
cisi
on–
96.4
99.6
––
––
98.6
– R
ecal
l–
90.7
99.3
––
––
98.9
– A
ccur
acy
9695
.599
.490
–84
99.8
98.7
– F
-Mea
sure
––
––
––
–98
.7–
SN Computer Science (2020) 1:1111 Page 10 of 18
SN Computer Science
User interface contains a toolbar and a wizard. The tool-bar normally interacts with the B-APT engine and provides the URL or HTML. B-APT engine decides the incoming URL is phishing or not. B-APT engine has three modules: document object model (DOM) analyzer, Whitelist module, and a scoring module. The DOM analyzer is a JavaScript program that has the ability to navigate a web site’s DOM. Later, the DOM analyzer matches the current domain with the whitelist, and it also verifies the presence of any input fields. If not there is no way that a user can enter their per-sonal credentials. If there is an input field on the page, then the HTML is tokenized and sent it to the scoring module. In the scoring model, it assigns some weights to the token using Bogofilter. Bogofilter checks for the number of times a par-ticular token is repeated and assigns the weight accordingly. These tokens help in detecting the phishing site accurately.
The author in paper [6] proposed a machine-learning-based approach for detecting malicious URLs in social net-works such as Twitter. The data collection is prepared using twitter API and filter the tweets that contain URLs. From that URLs, 12 features are extracted for initial assessment and later pre-processing is performed to improve the results. Random Forest, a supervised classifier, is used to classify whether it is phishing URL or not with recall value of 0.92.
In another work [14], the author proposed logo-based website detection scheme using machine learning which has two steps in this process. Logo extraction is the first step, where the images are extracted from the web page using a machine-learning technique. In the next step, the images are loaded in Google search engine and compares the domain information with that image for phishing detection.
Email metadata Email metadata is used to store the data in an email about the email [64]. Metadata is collected and stored as one file entry for each email and they use these data to cross verify whether the emails are correctly classified as Spam or not. Metadata contains a large number of fields, and for classifying the phishing email from the legitimate, we require only a few fields.
In paper [30], they used WEBCO’s (an email System) Metadata for phishing email classification in DROPBOXES with the following fields: Time stamp, Source IP address, SMTP “mail to”, SMTP “mail from”, From, Subject, and URLs. They followed three different ways to classify the phishing emails [30]:
• Direct identification of DROPBOXES in WEBCO.• Indirect identification of DROPBOXES in WEBCO.• Identifying the Source of DROPBOX email.
Pattern matching Pattern matching is normally used to detect the unknown phishing attacks. In pattern matching, the DNS information is verified to spot the malicious links. Sometimes, the DNS name in the URL is different with the Ta
ble
2 (c
ontin
ued)
Rese
arch
pap
ers
[63]
[35]
[5]
[73]
[1]
[68]
[38]
[12]
[55]
Lim
itatio
nsLi
nkG
uard
may
re
sult
in fa
lse
posi
tives
, sin
ce
usin
g do
tted
deci
mal
IP
addr
ess i
nste
ad
of d
omai
n na
mes
may
be
desi
rabl
e in
so
me
spec
ial
circ
umst
ance
s
Maj
or li
mita
tion
of T
SVM
is
that
it in
volv
es
an e
xpen
sive
m
atrix
inve
rse
oper
atio
n w
hen
solv
ing
the
dual
pr
oble
m
The
data
set s
ize
is sm
all.
The
mac
hine
-lear
n-in
g cl
assi
fier
need
s mor
e da
ta fo
r tra
inin
g th
e m
odel
to g
et
good
resu
lts
It fa
ils if
the
phis
her u
ses
a di
ffere
nt
lang
uage
oth
er
than
Eng
lish.
It
is a
tim
e-co
nsum
ing
proc
ess a
s it
choi
rs g
oogl
e ea
ch ti
me.
It
also
fails
in
the
follo
win
g ca
ses.
(a) U
sing
im
ages
in p
lace
of
text
, (b)
us
ing
invi
sibl
e te
xt, (
c) c
hang
-in
g th
e w
ords
to
con
fuse
the
syste
m
Goo
gle
page
rank
al
gorit
hm c
an’t
clas
sify
phi
sh-
ing
atta
cks
corr
ectly
if it
is
a ne
wly
regi
s-te
red
dom
ain
A sm
alle
r num
-be
r of m
isla
-be
led
exam
ples
ca
n dr
astic
ally
aff
ect D
NS
phis
hing
at
tack
s
They
con
side
red
only
one
par
t of
feat
ures
and
th
ey d
idn’
t ad
dres
s DN
S ph
ishi
ng
atta
cks
The
inne
r wor
ks
are
not e
asy
to
inte
rpre
t eas
ily
in L
STM
. The
ra
ndom
fore
st re
quire
d ex
pert
know
ledg
e fo
r fe
atur
e se
lec-
tion
Acc
urac
y in
de
tect
ing
new
ph
ishi
ng a
ttack
s is
bas
ed o
n th
e up
date
s re
ceiv
ed. I
t has
a
high
fals
e-po
sitiv
e ra
te
SN Computer Science (2020) 1:11 Page 11 of 18 11
SN Computer Science
DNS name in the sender information. Pattern matching com-pares these two names to identify the phishing URL.
In paper [63], LinkGuard algorithm is used for phishing detecting using pattern matching. Pattern matching can be done either by extracting the domain names from URL plus sender information, and if these two pieces of information do not match, it can be treated as phishing. It can also be done by manually storing the list of domain names and comparing the current domain name with that list to generate a similar-ity score. It helps to distinguish the phishing URLs from legitimate URLs. DontPhishMe [43] is a browser extension (Firefox) that uses pattern matching for Phishing detection.
Blockchain Blockchain-based solutions are good in detecting phishing attacks at the DNS level. As it maintains their own naming system and all the users can have a com-plete copy of information locally, any correction made can be automatically updated everywhere. Namecoin, Block-stack, Nebulis, Bitforest, and so on are the examples of blockchain-based naming system.
Namecoin [20] is developed by modifying the Bitcoin source code to store the information other than digital cur-rency. It is the first blockchain-based naming system that introduced merged mining (mining of more than one cryp-tocurrency) concept in the blockchain. Information stored in blockchain is like an open ledger that can be available to everyone in a decentralised manner. Data in the blockchian are immutable, so that unauthorized modifications are not possible.
In paper [9], the author proposed an alternative naming system called Blockstack by addressing the limitations in their previous work Namecoin [20]. Blockstack is a Block-chain-based naming and storage system. The main limitation of Namecoin is storage, which is addressed in Blockstack by providing a separate layer for storage. It splits stored domain information into zones and maintains them in a separate lay-ers. Layer 1 is used for a consensus of the data stored in blockchain. Layer 2 is a virtual layer for Blockstack opera-tions and maintains a virtual chain. Layer 3 is routing layer that helps in fetching data from the actual source and it sup-ports multiple storage providers. Layer 4 is the top layer, where the actual data are stored. These layers help in iden-tifying the data in a fast manner and it is the best alternative to the DNS-naming system.
Blacklist The blacklist-based approach maintains the list of phishing URLs. The blacklist maintains a list of known phishing URLs and checks whether the currently visiting URL is listed in the data set or not. Phishing data can be col-lected manually or from the third party. It helps in detecting the phishing in an easy and effective manner. A newly reg-istered domain cannot be identified more accurately unless the data set is updated more frequently [28, 41, 72].
Whitelist The whitelist does not maintain any phish-ing data. Instead, it maintains a list of all trusted websites’
information. Any URL that does not appear in the whitelist is treated as a suspicious. The Whitelist should maintain all the trusted site’s information. However, it is not easy to maintain all the legitimate sites in the web under one roof to decide the legitimacy of the web page [28, 64].
Domain popularity The domain popularity-based approach [30] works based on the certificate details, domain registration details, certificate authority, and so on. If the user clicks on the suspicious link, then the browser extension will send the link to the server that is under the control and extract the features such as domain name, validity, certifi-cate authority and verify this information from Google, and based on the results, the toolbar will alert the user.
Restricted form filling Restricted form filling [47, 67] is an anti-phishing browser extension that keeps track of user credentials and alerts the users when they try to enter that information in any fraudulent site. The credentials of the user are stored and protected with a master password. Next time when the user visits the site, there is no need to enter his/her credentials instead; instead just click on the Icon provided by the browser extension. Once you login to the Browser extension, later, you can simply log into any of the websites without entering the credentials again. The anti-phishing tools [47, 67] will maintain a database to store the login details of the users. These login details can be accessed from any system by simply installing the extension/toolbar.
Dummy content filling Dummy content filling [69] is a browser extension that helps the user to not fall victim to phishing. When the user visits the fraudulent site by ignoring the security alert, then the bogus bitter will split the creden-tials (S) into a set of S-1 bogus credentials; then, it starts submitting the credentials one by one with few milliseconds delay and validates the web page. If the user tries to click on the warning alert and get back to the original site, then the credentials are filled in the trusted site.
Layout similarity The layout-similarity-based approach works by comparing the layout of the web pages. This can be performed with the help of a domain object model (DOM), an internal representation of the web pages. Extracting the DOM tree from the web page can be achieved in two ways:
• Simple HTML tags.• Identifying the isomorphic sub-trees.
To detect the phishing sites, the DOM Tree is extracted from both the websites (i.e., current website and the original web-site). If both websites have the same layout, then the current website is a phishing site that replicates the layout of the original website. The DOM AntiPhish is an example of a layout-similarity-based approach. In DOM AntiPhish [51], the password is hashed and the DOM Tree of the website, where the user first entered his/her credentials is stored. Later, if the credentials are used on any site, it will compare
SN Computer Science (2020) 1:1111 Page 12 of 18
SN Computer Science
the layout of the page to see whether the current website is phishing or not.
User website rating In user website rating [21, 44], the feedback to the website is collected from the user, and based on that feedback, the website’s trustworthiness is decided. When the customer visits the site, they have to rate the legiti-macy of the site, so they can classify the website accord-ing to the user response. They consider some other features including this to decide whether the website is a phishing site or not.
Crowdsourcing The web of trust (WOT) is a crowdsourc-ing-based browser extension that depends on the user rating to the website they visit [76]. It protects the user from the attacks that can only be identified by the human eye such as scams unreliable web stores and content with questions. WOT is a patented system, where the behaviour of the user is regularly observed and analyzed to justify the rating. The working of WOT is when the user search for some content in the search engine then the search result will be displayed with some indicators at the corner. Green color indicates the trusted site; yellow color indicates the doubtful and red color for suspicious sites.
Steganography-based In steganography approach, it uses novel robust message-based image steganography algorithm [61]. Pre-processing is the first step in RMIS technique which outputs the embedding sequence by converting binary values to decimal values. Next, the product of embedding sequence and image size (rows × column) gives the Stego-Key. Embedding phase hides the secret messages into the given cover image in such a way that the resultant Stego-image is not differentiable by human visual system (HVS). The extraction phase extracts the secret message embedded from the Stego-image by the same secret key as in embed-ding phase. Bank website who wishes to use Pixastic plug-in should incorporate the Stego-image generated from robust message-based image steganography embedding algorithm in their website.
One-time password The one-time password (OTP) is very important for the present financial security which helps to defend the session hijacking attacks and the valid customer has access to perform the transaction. The OTP was sent by the server to the customers during any transaction either to a mobile phone or email which is already registered to the concern bank account. If the OTP entered by the user matches, then only the bank allows the particular transac-tion [54]. The single password protocol (SPP) allows the customer to use the one-time password for their accounts. There are two one-time password protocols, namely, Lamp-ort’s one-time password and Rubin’s one-time password. These two protocols work prior to the operation of SSL.
The author in paper [32] proposed a visual cryptog-raphy technique for phishing website detection. Image-based verification is applied in this technique, i.e., the
original CAPTCHA into two shares: one is with the user and the other with the server. While authenticating, both the CAPTCHAs should appear simultaneously. Then, only the CAPTCHA will be used as a password. It helps in authenti-cation each other before connecting.
Watermarking Watermarking can be used for protecting the user not to enter the credentials for the fraudulent site. In this approach, they ask the user to select the watermarking image, the position of the watermarking image, the secret key is collected at the time of registration. Based on this information, a customer is identified uniquely. When the user tries to log in, first identifies the position, where the watermarking image is fixed and then enters the secret key to authenticate oneself [56].
DNS-based An advanced form of phishing, i.e., phishing without lure, is called pharming or DNS-based Phishing. To detect DNS-based phishing, we have to find whether the IP address provided by the DNS server is genuine or fake. In DNS attack, the phisher modifies the DNS entries of a targeted domain with phishers’ server IP address to redirect the traffic.
In paper [13], a database is maintained to store the Bank name, its DNS’s server IP and user personal credentials. If the personal credentials of the user are being entered in some other site, then an inverse DNS query is sent to the respective bank to confirm whether it is a domain of that bank or not. Then, only the device allows the transaction to be happening.
In paper [26], the author developed a dual approach to detect client-side pharming attacks. When a user request for a website, the DNS request is sent to two DNS servers, i.e., local DNS (default DNS) and third party DNS and checks whether the IP address given by the local DNS is included in the list of IP addresses obtained from third party DNS server and allow the user only if it matches, else it collects the source code of the current page and the original site (from Third party DNS) for the web content analysis [26]. A score is calculated and compared with the threshold value and the site is considered as a phishing site if it crosses the threshold value.
Hashing‑Based
Hashing techniques can be used to protect the user creden-tials by hashing the password, domain name, email ID, etc., that helps in verifying the site before providing the pass-word. There are password hashing techniques available for phishing detection, i.e., Passpet [67], PwdHash [53].
In paper [67], they provide a single password (master password) to manage multiple accounts. A user assigned pet name helps to identify the site uniquely. The password (mas-ter password) is generated using some hashing techniques.
SN Computer Science (2020) 1:11 Page 13 of 18 11
SN Computer Science
In paper [53], the hashing technique is applied to gener-ate a separate password to each site. The hashed password is generated by combining the domain and the password of that site. This is because if the password of one site is known will not affect the other site.
Existing Anti‑phishing Browser Extensions/Toolbars
Most of the anti-phishing solutions are available as a browser extension/toolbar. When the users install any anti-phishing toolbar/browser extension, it keeps monitoring the user activities and alerts the user when they try to access any suspicious links. There are few approaches that still at the research level, which is not fully evolved as a browser extension. In Fig. 5, the black colored rectangle boxes are the approaches that evolved as browser extensions and the pink colored rectangle boxes are the approaches still at the research level.
Maturity Level
The maturity level of anti-phishing approaches is catego-rized into two types:
Anti‑phishing Approaches that Evolved as Browser Extensions/Toolbars
The anti-phishing browser extension is very useful in pro-tecting Internet users from phishing attacks. There are dif-ferent types of anti-phishing solutions are available and each of them follows various approaches such as the blacklist, whitelist, heuristics, layout similarity, machine learning, and so on.
The anti-phishing browser extensions protect the Internet users from phishing scams. Some popular browser exten-sions/toolbars are listed in Table 3. It also includes the approach used, mode of operation, advantages and disad-vantages of these toolbars.
Anti‑phishing Approaches at Research Level
A lot of research is going on to find a better solution for the prevention of phishing attacks. Approaches such as water-marking, one-time password, and Email Metadata-based approaches are at research level and not fully evolved as browser extensions/toolbars. Corporate companies, Banks, Anti-Phishing Organization (APWG, PhishTank, PhishME, and so on) and many others are fighting against phishing. Machine learning, rule-based, and list-based approaches (blacklist, whitelist) are available as a browser extension and more research works are also available.
Mode of Operation
The anti-phishing toolbars work based on the data set used by different anti-phishing approaches to detect the phish-ing scams. Some toolbars maintain their own data set to check whether the given link is phishing or not. Few toolbars depend on some third party for phishing detection. Depend-ing on the anti-phishing approach and data set they used, the mode of operation can be classified as follows:
1. Stand-alone In stand-alone mode, the toolbars will main-tain their own database or predefined rules for decision making. From the locally available information, it clas-sifies the phishing and non-phishing content correctly. Antiphish, BogusBitter, PhishZoo, etc., are the example tools that work independently.
2. From server In this mode, the anti-phishing tools get the help from their own server to check whether the given website or URL is phishing or not. For example, main-taining an updated blacklist, whitelist to verify the mali-cious URL’s from the trusted one. TrustWatch, Pixastic, PhishProof, etc., are fully dependent on their server.
3. From third party In some cases, the anti-phishing tools must depend on some other third parties for better clas-sification. To verify the DNS information, domain validity, SSL certification, verifying the URLs from the blacklist through API, extracting the text from an image and many more. GoldPhish, LinkGuard, web of trust (WOT) are few works that come under this category.
Discussion
In this paper, a taxonomy of anti-phishing solutions is dis-cussed. The anti-Phishing solutions are broadly classified into content and non-content-based approaches are briefly explained. The raised research questions are answered below:
RQ1 What are the areas that current anti-phishing solu-tions address?
When compared to non-content-based approaches, con-tent-based approaches are better in detecting phishing. New phishing attacks are difficult to detect by Non-content-based approaches because of the delay in their updates. Content-based approaches such as rule-based and machine learning are good in detecting, but sometimes, machine-learning approaches may have high false-positive rates. Blockchain-based solutions (blockstack) are good in detecting DNS phishing (pharming). Different approaches use different anti-phishing algorithms for phishing detection. Mobile phish-ing, voice phishing, and social media phishing are the areas, where more research is required.
SN Computer Science (2020) 1:1111 Page 14 of 18
SN Computer Science
Tabl
e 3
List
of e
xisti
ng a
nti-p
hish
ing
brow
ser e
xten
sion
s
S. n
o.N
ame
of th
e to
olba
rA
ppro
ach
used
Mod
e of
ope
ratio
nPR
OS
CON
S
1.A
ntiP
hish
[47]
Restr
icte
d fo
rm fi
lling
Stan
d-al
one
Ant
iPhi
sh d
etec
ts p
hish
ing
atta
cks c
or-
rect
ly if
it is
pur
ely
an H
TML
web
page
It re
quire
s man
ual i
nter
actio
n of
the
user
. G
ener
ates
fals
e al
arm
s2.
B-A
PT [3
7]M
achi
ne le
arni
ngSt
and-
alon
eIt
uses
mac
hine
-lear
ning
app
roac
h w
ith
DO
M a
naly
zer f
or p
hish
ing
dete
ctio
nB
-APT
is v
ulne
rabl
e to
web
site
spoo
fing
atta
ck3.
Bog
usB
itter
[69]
Dum
my
cont
ent fi
lling
Stan
d-al
one
It fe
eds a
larg
e nu
mbe
r of b
ogus
cre
den-
tials
to p
rote
ct th
e us
er c
rede
ntia
ls fr
om
the
phis
her
The
Phis
her u
ses fi
lterin
g te
chni
ques
to
colle
ct th
e cr
eden
tials
4.D
OM
Ant
iPhi
sh [5
1]La
yout
sim
ilarit
ySt
and-
alon
eTh
e br
owse
r aut
omat
ical
ly st
ores
the
user
pa
ssw
ord
by h
ashi
ng it
. If t
he p
assw
ord
is re
used
it w
ill g
ive
an a
lert
to th
e us
ers
Spoo
fed
web
pag
es w
ith si
mila
r im
ages
an
d vi
sual
look
s of t
he le
gitim
ate
site
to
fool
the
user
5.D
ynam
ic se
curit
y sk
in [1
6]V
isua
l Sim
ilarit
ySe
rver
The
user
has
to re
mem
ber a
imag
e an
d a
imag
e to
aut
hent
icat
e on
esel
f to
the
serv
er. T
o au
then
ticat
e, th
e us
er h
as to
pe
rform
a v
isua
l mat
chin
g
Ther
e is
a c
hanc
e of
leak
ing
the
verifi
er,
leak
of i
mag
es, v
isua
l con
tent
s can
be
spoo
fed
by th
e ph
ishe
r
6.eB
ayA
ccou
nt G
uard
[22]
Heu
ristic
, bla
cklis
tSe
rver
It al
low
s use
rs to
subm
it th
e su
spec
ted
site
s to
eBay
whi
ch c
an b
e ad
ded
to th
e th
eir b
lack
list
Onl
y ap
plic
able
to e
Bay
and
Pay
Pal s
ites
and
deni
al o
f ser
vice
atta
cks a
re p
ossi
ble
7.Fi
rePh
ish
[60]
Ope
n da
taba
seSe
rver
It m
aint
ains
its o
wn
data
base
to st
ore
the
phis
hing
site
for b
ette
r det
ectin
g th
e at
tack
s
They
hav
e to
mai
ntai
n th
eir o
wn
safe
and
ph
ishi
ng si
tes
8.G
oldP
hish
[19]
Vis
ual s
imila
rity
Third
par
tyPr
otec
ts fr
om z
ero-
day
phis
hing
Del
ays t
he re
nder
ing
of a
web
pag
e.
Goo
gle
Page
Ran
k al
gorit
hm is
vul
ner-
able
to n
ew p
hish
ing
atta
cks
9.iT
rustP
age
[50]
Bla
cklis
t, w
hite
list
Third
par
tyIt
is e
ffect
ive
and
easy
to u
sePh
ishi
ng p
ages
shou
ld b
e di
scov
ered
qu
ickl
y an
d ad
ded
to a
bla
cklis
t. Th
e B
lack
list a
lone
can
’t be
a b
ette
r sol
utio
n fo
r phi
shin
g de
tect
ion
10.
Link
Gua
rd [6
3]B
lack
list,
whi
ltelis
t, pa
ttern
mat
chin
gTh
ird p
arty
It de
tect
s kno
wn
and
unkn
own
atta
cks
with
an
accu
racy
of 9
6%. T
here
is n
o fa
lse
posi
tive
and
fals
e ne
gativ
es fo
r ca
tego
ry 1
Fals
e po
sitiv
es c
an p
ossi
ble
in c
ateg
ory
2 so
lutio
n in
the
case
of I
P ad
dres
s ver
ifi-
catio
n in
the
plac
e of
Dom
ain
nam
e
11.
McA
fee
site
adv
isor
[57]
Rat
ing
the
site
with
thei
r ow
n te
stsSe
rver
McA
fee
mai
ntai
ns th
eir o
wn
data
base
th
at u
ses a
utom
atic
cra
wle
rs th
at se
arch
th
e si
tes a
nd p
erfo
rm te
sts a
nd in
clud
es
in th
e da
taba
se
It is
vul
nera
ble
to d
etec
t phi
shin
g si
tes
with
em
bedd
ed o
bjec
ts
12.
Mic
roso
ft sm
art s
cree
n fil
ter [
40]
Bla
cklis
t, he
urist
ics
Serv
erIt
prov
ides
add
ition
al se
curit
y at
the
netw
ork
leve
l. It
also
pro
tect
s fro
m
mal
icio
us a
ttach
men
ts li
ke k
eylo
gger
s
It m
ay b
e vu
lner
able
to n
ewly
cre
ated
ph
ishi
ng a
ttack
s if t
he b
lack
list n
ot re
gu-
larly
upd
ated
SN Computer Science (2020) 1:11 Page 15 of 18 11
SN Computer Science
Tabl
e 3
(con
tinue
d)
S. n
o.N
ame
of th
e to
olba
rA
ppro
ach
used
Mod
e of
ope
ratio
nPR
OS
CON
S
13.
Net
craf
t [44
]B
lack
list,
heur
istic
s, us
er ra
ting
Stan
d-al
one
It al
low
s phi
shin
g si
te fe
ed, p
rovi
des
phis
hing
ale
rts, m
appi
ng o
f cur
rent
ph
ishi
ng a
ttack
s
The
info
rmat
ion
like
site
rank
, IP
addr
ess,
web
serv
er, n
et-b
lock
ow
ner,
and
last
chan
ges m
ade
can
help
the
phis
her i
n m
any
way
s14
.Pa
sspe
t [67
]Re
stric
ted
form
filli
ngSe
rver
Allo
ws t
he u
ser t
o re
mem
ber o
nly
pass
-w
ord
to lo
g in
with
mul
tiple
syste
ms
Vul
nera
ble
to p
harm
ing
atta
ck. T
he
phis
her c
an st
eal t
he c
rede
ntia
ls o
f no
n-SS
L pr
otec
ted
site
s by
hija
ckin
g. It
is
als
o vu
lner
able
to o
fflin
e di
ctio
nary
at
tack
s15
.Ph
ishP
roof
[70]
Bla
cklis
t, w
hite
list,
heur
istic
sSe
rver
Phis
hPro
of u
ses t
hree
leve
ls o
f sec
urity
. It
aler
ts th
e us
ers o
n ph
ishi
ng si
tes.
Use
r inp
ut is
not
requ
ired.
Use
r can
al
so re
port
phis
hing
site
s
It ca
nnot
pro
tect
the
user
s fro
m m
alw
are
16.
Phis
hTan
k Si
te C
heck
er [6
2]O
pen
data
base
Serv
erIt
bloc
ks th
e us
ers f
or th
e si
tes w
hich
are
al
read
y re
porte
d as
phi
shin
g in
thei
r op
en d
atab
ase
New
phi
shin
g at
tack
s bec
ome
diffi
cult
to
dete
ct u
nles
s the
dat
abas
e is
upd
ated
fr
eque
ntly
. It i
s slo
w, b
ecau
se th
e us
ers
have
to re
port
the
site
as p
hish
ing
17.
Phis
hZoo
[4]
Con
tent
sim
ilarit
ySe
rver
Phis
hZoo
cre
ates
thei
r ow
n tru
sted
pro-
files
with
legi
timat
e si
tes u
sing
a fu
zzy
hash
ing
tech
niqu
e to
det
ect p
hish
ing
Phis
hZoo
is v
ulne
rabl
e to
web
site
spoo
fing
atta
ck
18.
Pixa
stic
[61]
Steg
ano-
grap
hy-b
ased
Serv
erRo
bust
mes
sage
-bas
ed im
age
stegn
ogra
-ph
y al
gorit
hm is
use
d to
hid
e th
e se
cret
im
age
and
prot
ect t
he u
sers
not
to e
nter
th
e pe
rson
al c
rede
ntia
ls in
phi
shin
g w
ebsi
tes
Vul
nera
ble
to D
NS
spoo
fing
atta
ck, b
rute
fo
rce
atta
ck, a
nd p
rint s
cree
n is
als
o po
ssib
le
19.
Spoo
fGua
rd [1
5]H
euris
tics
Stan
d-al
one
The
adva
ntag
e of
this
tool
bar i
s sto
ping
th
e ou
tgoi
ng d
ata
to p
hish
ing
site
s by
perfo
rmin
g im
age
chec
k an
d pa
ssw
ord
chec
k
It sh
ows a
fals
e al
arm
whe
n th
e us
er v
isits
th
e le
gitim
ate
site
for t
he fi
rst t
ime
20.
Spoo
fStic
k [3
9]–
Stan
d-al
one
The
user
can
cha
nge
the
appe
aran
ce o
f th
e to
olba
r bec
ause
of i
ts u
ser-f
riend
-lin
ess a
nd th
ey a
ddre
ss th
e gr
aphi
cs
prop
erty
Vul
nera
ble
to if
ram
es a
ttack
if th
e us
er
open
s mul
tiple
win
dow
s, w
hile
surfi
ng
21.
The
Earth
link
tool
bar [
21]
Heu
ristic
s, us
er ra
ting
Serv
erIt
rela
ys o
n th
e co
mbi
natio
n of
heu
ristic
s, us
er ra
tings
and
man
ual v
erifi
catio
n.
Tool
bar d
ispl
ays a
thum
b to
indi
cate
w
heth
er th
e si
te is
phi
shin
g or
not
No
aler
t mes
sage
is d
ispl
ayed
for u
sers
. U
ser r
atin
gs p
rodu
ce m
ore
fals
e al
arm
s
22.
Trus
tWat
ch [2
7]B
lack
list
Serv
erTr
ustW
atch
pro
vide
s a p
erso
nal s
ecur
ity
ID to
pre
vent
the
tool
bar s
poofi
ng. I
t is
easy
to u
se
Vul
nera
ble
to n
ewly
cre
ated
phi
shin
g at
tack
s if t
he d
atab
ase
is n
ot u
pdat
ed
regu
larly
SN Computer Science (2020) 1:1111 Page 16 of 18
SN Computer Science
RQ2 Do the existing anti-phishing toolbars cover all the phishing attacks?
Whenever the researcher provides a solution to the prob-lem, the attacker comes up with a new trick as it is like a race. Most of the anti-phishing toolbars work on any specific type of attacks. BogusBitter [69] is a toolbar that fills the bogus credentials to the phishing site to prevent the user credentials from phisher. However, with a simple filtering technique, the phisher filters the information. Web of trust (WOT) [76] is Crowdsourcing-based technique that depends on user rating. If a single user rates the site a suspicious, the result will change drastically. Few toolbars use heuristics and blacklist for phishing detection. However, they may fail in detecting new phishing scams if the update delayed. Most of the Internet users are not aware of many phishing attacks. The performance of the anti-phishing toolbar depends on the approach and data set they used.
RQ3 What are the current research gaps in anti-phishing?Anti-phishing solutions help Internet users to accurately
identify the phishing attack. More works have been done on email phishing detection and website phishing detection and are published in many online sources. Social media phishing is difficult to detect due to its changing nature. Identifying fake news, fake offers, malicious attachments, links, and fake profiles makes the social media phishing complicated in detecting. As [8] said, fewer works have been done in instant messaging, social media, voice, blogs, and web forums.
Conclusion
In the above literature survey, we discussed phishing, anti-phishing, a complete classification of anti-phishing solu-tions, evolution roadmap of anti-phishing solutions, consoli-dated feature list for phishing detection, and a list of existing anti-phishing toolbars. Anti-phishing solutions can be classi-fied into two categories, i.e., (i) content-based and (ii) non-content-based approaches. Content-based approaches work by analyzing the content of the web page, Email, and URL. Non-content-based approaches use non-content features such as a blacklist, whiltelist, and so on. Different anti-phishing approaches use different algorithms for phishing detection. These algorithms have been listed in Table 2 with their per-formance metrics, data sets, and limitations. All the anti-phishing approaches are not evolved as a browser extension, but there are few approaches at research level are listed. The approaches at the research level and fully evolved browser extensions are distinguished with two different colors. The pros and cons of the existing anti-phishing toolbar are also listed. From the study, it infers that existing anti-phishing approaches focus only specific type of attacks. Mobile phish-ing, voice phishing, and social media phishing are the areas, where more research is required.Ta
ble
3 (c
ontin
ued)
S. n
o.N
ame
of th
e to
olba
rA
ppro
ach
used
Mod
e of
ope
ratio
nPR
OS
CON
S
23.
Veris
ign
EV g
reen
bar
ext
ensi
on [2
4]D
omai
n po
pula
rity
Serv
erIt
dete
cts t
he p
hish
ing
site
s by
verif
ying
th
e SS
L ce
rtific
ates
of t
he si
teIt
only
iden
tifies
SSL
cer
tifica
tes g
iven
by
Ver
iSig
n, n
ot th
e ot
her v
alid
SSL
ce
rtific
ates
24.
Virt
ual b
row
ser e
xten
sion
[46]
Bla
cklis
t, he
urist
ics,
visu
al si
mila
rity
Third
par
tyA
lerts
the
user
s if t
he si
te is
not
pre
sent
in
the
whi
telis
t the
y ar
e m
aint
aini
ngV
ulne
rabl
e to
key
-logg
ers,
scre
en lo
gger
s, an
d cl
ient
-sid
e sc
riptin
g at
tack
25.
Web
of t
rust
(WO
T) [7
6]B
lack
list,
crow
dsou
rcin
gTh
ird p
arty
The
repu
tatio
n of
the
site
is sh
own
next
to
the
sear
ch re
sults
. Ver
y us
er-f
riend
lyA
sing
le ra
ting
from
a p
erso
n ca
n m
ake
the
site
uns
afe,
bec
ause
it d
epen
ds o
n us
er
ratin
gs
SN Computer Science (2020) 1:11 Page 17 of 18 11
SN Computer Science
Funding This study was not funded by anyone.
Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of interest.
References
1. Abunadi A, Akanbi O, Zainal A. Feature extraction process: a phishing detection approach. In: Intelligent systems design and applications (ISDA), 2013 13th international conference on. IEEE. 2013. pp. 331–335.
2. Abutair HY, Belghith A. Using case-based reasoning for phishing detection. Proc Comput Sci. 2017;109:281–8.
3. Adewumi OA, Akinyelu AA. A hybrid firefly and support vec-tor machine classifier for phishing email detection. Kybernetes. 2016;45(6):977–94. https ://doi.org/10.1108/K-07-2014-0129.
4. Afroz S, Greenstadt R. Phishzoo: detecting phishing websites by looking at them. In: 2011 IEEE fifth international conference on semantic computing. 2011. https ://doi.org/10.1109/ICSC.2011.52.
5. Aggarwal S, Kumar V, Sudarsan S. Identification and detection of phishing emails using natural language processing techniques. In: Proceedings of the 7th international conference on security of information and networks. ACM, ACM, Glasgow, Scotland UK. 2014. p. 217.
6. Al-Janabi M, Quincey E, Andras P. Using supervised machine learning algorithms to detect suspicious URLs in online social networks. In: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017. ASONAM ’17, ACM, New York, NY, USA. 2017. pp. 1104–1111. https ://doi.org/10.1145/31100 25.31162 01.
7. Alam S, El-Khatib K. Phishing susceptibility detection through social media analytics. In: Proceedings of the 9th international conference on security of information and networks. SIN ’16, ACM, New York, NY, USA. 2016. pp. 61–64. https ://doi.org/10.1145/29476 26.29476 37.
8. Aleroud A, Zhou L. Phishing environments, techniques, and coun-termeasures: a survey. Comput Secur. 2017;68:160–96.
9. Ali M, Nelson JC, Shea R, Freedman MJ. Blockstack: a global naming and storage system secured by blockchains. In: USENIX annual technical conference. 2016. pp. 181–194.
10. AlShboul R, Thabtah F, Abdelhamid N, Al-diabat M. A visu-alization cybersecurity method based on features’ dissimilarity. Comput Secur. 2018;77:289–303.
11. Anti-Phishing Working Group. Phishing Activity Trends Report 1 Quarter. Most, no. March, 2018. pp. 1–12.
12. Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA. Classifying phishing urls using recurrent neural networks. In: 2017 APWG symposium on electronic crime research (eCrime). 2017. pp. 1–8. https ://doi.org/10.1109/ECRIM E.2017.79450 48.
13. Bin S, Qiaoyan W, Xiaoying L. A DNS based anti-phishing approach. In: Networks security wireless communications and trusted computing (NSWCTC), 2010 second international con-ference on. vol. 2. IEEE. 2010. pp. 262–265.
14. Chiew KL, Chang EH, Sze SN, Tiong WK. Utilisation of website logo for phishing detection. Comput Secur. 2015;54:16–26. https ://doi.org/10.1016/j.cose.2015.07.006.
15. Chou N, Ledesma R, Teraguchi Y, Mitchell JC, Ca S. Client-side defense against web-based identity theft. In: NDSS 2004.
16. Dhamija R, Tygar JD. The battle against phishing: dynamic security skins. In: Proceedings of the 2005 symposium on usa-ble privacy and security. ACM, 2005. pp. 77–88.
17. Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M. Systematization of knowledge (SoK): a systematic review of software-based web phishing detection. IEEE Commun Surv Tutor. 2017;19(4):2797–819. https ://doi.org/10.1109/COMST .2017.27520 87.
18. Dua S, Du X. Data mining and machine learning in cybersecu-rity. Boca Raton: CRC Press; 2016.
19. Dunlop M, Groat S, Shelly D. Goldphish: using images for content-based phishing analysis. In: 2010 Fifth international conference on internet monitoring and protection. 2010. pp. 123–128. https ://doi.org/10.1109/ICIMP .2010.24.
20. Durham V. Namecoin. 2011. https ://namec oin.info. Accessed Sept 2018.
21. Earthlink: Spam Blocker. 1994. http://www.earth link.net/elink /issue 95/secur ity_archi ve.html. Accessed Oct 2018.
22. eBay Toolbar and Account Guard. http://pages .ebay.in/help/accou nt/toolb ar-accou nt-guard .html. Accessed 5 Oct 2018.
23. Fette I, Sadeh N, Tomasic A.: Learning to detect phishing emails. In: Proceedings of the 16th international conference on World Wide Web. ACM; 2007. pp. 649–656. https ://doi.org/10.1145/12425 72.12426 60. Accessed May 2019.
24. Firefox: Verisign for firefox. 2007. https ://addon s.mozil la.org/en-US/firef ox/addon /veris ignev -green -bar-exten sio/. Accessed Aug 2018.
25. Gastellier-Prevost S, Granadillo GG, Laurent M. Decisive heuristics to differentiate legitimate from phishing sites. In: 2011 Conference on Network and Information Systems Security. IEEE; 2011. pp. 1–9. https ://doi.org/10.1109/SAR-SSI.2011.59313 89.
26. Gastellier-Prevost S, Granadillo GG, Laurent M. A dual approach to detect pharming attacks at the client-side. In: New technologies, mobility and security (NTMS), 2011 4th IFIP international conference on. IEEE. 2011. pp. 1–5.
27. GeoTrust: TrustWatch Toolbar. https ://www.geotr ust.com/comca sttoo lbar/. Accessed Nov 2018.
28. Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.
29. Hajgude J, Ragha L. Phish mail guard: phishing mail detection technique by using textual and url analysis. In: 2012 World con-gress on information and communication technologies. 2012. pp. 297–302. https ://doi.org/10.1109/WICT.2012.64090 92.
30. Herzberg A, Jbara A. Security and identification indicators for browsers against spoofing and phishing attacks. ACM Trans Internet Technol. 2008;8(4):1–36. https ://doi.org/10.1145/13919 49.13919 50.
31. Jagatic TN, Johnson NA, Jakobsson M, Menczer F. Social phishing. Commun ACM. 2007;50(10):94–100.
32. James D, Philip M. A novel anti phishing framework based on visual cryptography. In: 2012 International conference on power, signals, controls and computation. 2012. pp. 1–5. https ://doi.org/10.1109/EPSCI CON.2012.61752 28.
33. Jeeva SC, Rajsingh EB. Intelligent phishing URL detection using association rule mining. Hum Centric Comput Inf Sci. 2016;6(1):10.
34. Laorden C, Ugarte-Pedrero X, Santos I, Sanz B, Bringas PG. Enhancing scalability in anomaly-based email spam filtering. In: Proceedings of the 8th annual collaboration, electronic mes-saging, anti-abuse and spam conference. CEAS ’11, ACM, New York, NY, USA, 2011. pp. 13–22. https ://doi.org/10.1145/20303 76.20303 78.
35. Li Y, Xiao R, Feng J, Zhao L. A semi-supervised learning approach for detection of phishing webpages. Optik Int J Light Electron Opt. 2013;124(23):6027–33.
SN Computer Science (2020) 1:1111 Page 18 of 18
SN Computer Science
36. Li Y, Yang L, Ding J. A minimum enclosing ball-based support vector machine approach for detection of phishing websites. Optik Int J Light Electron Opt. 2016;127(1):345–51.
37. Likarish P, Jung E, Dunbar D, Hansen TE, Hourcade JP. B-apt: Bayesian anti-phishing toolbar. In: Communications, 2008. ICC’08. IEEE international conference on. IEEE. 2008. pp. 1745–1749.
38. Ma L, Ofoghi B, Watters P, Brown S. Detecting phishing emails using hybrid features. In: Ubiquitous, autonomic and trusted computing, 2009. UIC-ATC’09. Symposia and workshops on. IEEE. 2009. pp. 493–497.
39. Majorgeeks: SpoofStick. 2004. http://www.major geeks .com/files /detai ls/spoof stick _for_inter net_explo rer.html. Accessed Nov 2018.
40. Microsoft: Microsoft Smart Screen Filter. https ://suppo rt.micro soft.com/en-in/help/17443 /windo ws-inter net-explo rer-smart scree n-filte r-faq. Accessed Oct 2018.
41. Mishra M, Jain A. Anti-phishing techniques: a review. 2012;2(2):350–5.
42. Mohammad RM, Thabtah F, McCluskey L. Intelligent rule-based phishing websites classification. IET Inf Secur. 2014;8(3):153–60.
43. MYCERT: About DontPhishMe toolbar. 2010. http://www.broth ersof t.com/dontp hishm e-39095 1.html. Accessed Dec 2018.
44. Netcraft: Netcraft Toolbar. 2004. http://toolb ar.netcr aft.com/. Accessed Dec 2018.
45. Purkait S. Phishing counter measures and their effectiveness-lit-erature review. Inf Manag Comput Secur. 2012;20(5):382–420.
46. Purkait S. Preventing phishing attacks with virtual browser extension. IUP J Inf Technol. 2013;9(3):7.
47. Raffetseder T, Kirda E, Kruegel C. Building anti-phishing browser plug-ins: an experience report. In: Proceedings of the third international workshop on software engineering for secure systems. IEEE Computer Society. 2007. p. 6.
48. Rathore S, Loia V, Park JH. Spamspotter: an efficient spammer detection framework based on intelligent decision support sys-tem on facebook. Appl Soft Comput. 2018;67:920–32. https ://doi.org/10.1016/j.asoc.2017.09.032.
49. Rathore S, Sangaiah AK, Park JH. A novel framework for internet of knowledge protection in social networking ser-vices. J Comput Sci. 2018;26:55–65. https ://doi.org/10.1016/j.jocs.2017.12.010.
50. Ronda T, Saroiu S, Wolman A. iTrustPage: pretty good phishing protection. Toronto: University of Toronto; 2007.
51. Rosiello APE, Kirda E, Kruegel, Ferrandi, F. A layout-similar-ity-based approach for detecting phishing pages. In: 2007 third international conference on security and privacy in communica-tions networks and the workshops—SecureComm 2007. 2007. pp. 454–463. https ://doi.org/10.1109/SECCO M.2007.45503 67.
52. Rosiello A. Anti-phishing security strategy.: Black Hat Briefing. 2008. pp. 1–31. https ://www.black hat.com/prese ntati ons/bh-europ e-08/Rosie llo/Prese ntati on/bh-eu-08-rosie llo.pdf.
53. Ross B, Jackson C, Miyake N, Boneh D, Mitchell JC. Stronger password authentication using browser extensions. In: Proceed-ings of the 14th conference on USENIX security symposium—vol. 14. SSYM’05, USENIX Association, Berkeley, USA. 2005. pp. 2–2. http://dl.acm.org/citat ion.cfm?id=12513 98.12514 00. Accessed Oct 2018.
54. San Martino A, Perramon X. A model for securing e-banking authentication process: antiphishing approach. In: Services-part I, 2008. IEEE Congress on. IEEE. 2008. pp. 251–254.
55. Sharifi M, Siadati SH. A phishing sites blacklist generator. In: 2008 IEEE/ACS international conference on computer systems and applications. 2008. pp. 840–843. https ://doi.org/10.1109/AICCS A.2008.44936 25.
56. Singh AP, Kumar V, Sengar SS, Wairiya Manoj EVV, Thomas G, Lumban Gaol F. Detection and prevention of phishing attack using
dynamic watermarking. In: Information technology and mobile communication. Berlin: Springer; 2011. pp. 132–137.
57. SiteAdvisor: MCAfee Site Advisor. 2006. https ://en.wikip edia.org/wiki/McAfe e_SiteA dviso r. Accessed July 2018.
58. Smadi S, Aslam N, Zhang L. Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis Support Syst. 2018;107:88–102. https ://doi.org/10.1016/j.dss.2018.01.001.
59. Sonowal G, Kuppusamy K. Phidma—a phishing detection model with multi-filter approach. J King Saud Univ Comput Inf Sci. 2017;. https ://doi.org/10.1016/j.jksuc i.2017.07.005.
60. Sureshkumar A, Palanisamy S, Sowmiya RAS. Data isolation and pro-tection in online social networks. In: 2013 International conference on information communication and embedded systems (ICICES). 2013. pp. 150–155. https ://doi.org/10.1109/ICICE S.2013.65082 28.
61. Thiyagarajan P, Mahindra VPV. Pixastic: steganography based anti-phihsing browser plug-in. J Internet Bank Commerce. 2012;17(1):1–19.
62. Ulevitch D. PhishTank site checker. 2006. https ://addon s.mozil la.org/en-US/firef ox/addon /phish tank-sitec hecke r/.
63. Naresh U. Intelligent phishing website detection and prevention system by using link guard algorithm. IOSR J Comput Eng IOSR-JCE. 2013;14(3):28–36.
64. Vaishnaw N, Tandan SR. A bird’s eye view of anti-phishing tech-niques for classification of phishing e-mails. Int J Res Appl Sci Eng Technol. 2015;3(6):263–75.
65. Vishwanath A. Getting phished on social media. Decis Support Syst. 2017;103:70–81. https ://doi.org/10.1016/j.dss.2017.09.004.
66. Wang R, Zhu Y, Tan J, Zhou B. Detection of malicious web pages based on hybrid analysis. J Inf Secur Appl. 2017;35:68–74.
67. Yee KP, Sitaker K. Passpet: convenient password management and phishing protection. In: Proceedings of the second symposium on Usable privacy and security. ACM. 2006. pp. 32–43.
68. Ying P, Xuhua D. Anomaly based web phishing page detection. n: 2006 22nd Annual Computer Security Applications Conference (ACSAC’06). IEEE, 2006. pp. 381–390. https ://doi.org/10.1109/ACSAC .2006.13
69. Yue C, Wang H. Bogusbiter: a transparent protection against phishing attacks. ACM Trans Internet Technol (TOIT). 2010;10(2):6.
70. Zahid T. An anti-phishing tool: Phishproof. Ph.D. thesis, Univer-sity of Manchester. 2012.
71. Zhang H, Liu G, Chow TW, Liu W. Textual and visual con-tent-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw. 2011;22(10):1532–46. https ://doi.org/10.1109/TNN.2011.21619 99.
72. Zhang Y, Egelman S, Cranor LF, Hong J. Phinding phish: evaluat-ing anti-phishing tools. 2006.
73. Zhang Y, Hong JI, Cranor LF. Cantina: A content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on world wide web. WWW ’07, ACM, New York, NY, USA, 2007. pp. 639–648. https ://doi.org/10.1145/12425 72.12426 59.
74. Zhang N, Yuan Y. Phishing detection using neural network—CS229 lecture notes. 2012.
75. Zhou Y, Zhang Y, Xiao J, Wang Y, Lin W. Visual similarity based anti-phishing with the combination of local and global features. In: Proceedings—2014 IEEE 13th international conference on trust, security and privacy in computing and communications, TrustCom 2014, 2014. pp. 189–196. https ://doi.org/10.1109/Trust Com.2014.28.
76. Zimmermann P. Web of trust (WOT). 1992. https ://www.mywot .com/en/about us. Accessed May 2018.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.