using internet data sets to understand digital threats...more customer and business operations are...

Using Internet Data Sets to Understand Digital Threats

CONTENTS EXECUTIVE SUMMARY .................................................................................................................1

ACTIONS LEAVE BREADCRUMBS. MAKE SURE TO FOLLOW THEM ...........................................2

INFRASTRUCTURE CHAINING ......................................................................................................3

INTERNET DATA SETS ..................................................................................................................3

PASSIVE DNS ........................................................................................................................................4

WHOIS ..................................................................................................................................................5

SSL CERTIFICATES ..............................................................................................................................6

ANALYTICAL TRACKERS .....................................................................................................................7

HOST SEQUENCE PAIRS ....................................................................................................................8

WEB COMPONENTS ..........................................................................................................................9

OPEN SOURCE INTELLIGENCE (OSINT) ........................................................................................10

COOKIES ............................................................................................................................................11

ABOUT PASSIVETOTAL ...............................................................................................................12

1

EXECUTIVE SUMMARY

As businesses adapt to the rapidly changing digital landscape, more customer and business operations are shifting from behind the protection of firewalls to the open internet. This new level of exposure makes your company, customers, and prospects vulnerable to extremely skilled, persistent threats across the web, mobile, social, and email. However, the good news for threat hunters and defenders is that data exists to help expose the infrastructure being used by attackers, which allows them to find, block, and prevent attacks.

Internet data can be sorted, classified, and monitored over time to provide a complete picture of your attackers and their evolving techniques. Infrastructure chaining leverages the relationships between these highly connected data sets to build out a thorough investigation. This process is the core of Threat Infrastructure Analysis and allows organizations to surface new connections, group similar attack activity, and substantiate assumptions during incident response.

In this white paper, we’ll explore the data sets available to security professionals and how to use them to proactively protect your organization’s digital presence.

1

2

ACTIONS LEAVE BREADCRUMBS. MAKE SURE TO FOLLOW THEMOne of the byproducts of using the internet is the concept of creating signals, or breadcrumbs. It doesn’t matter if an action is malicious in nature or not, chances are, impressions will be made, and if observed, could be used to correlate activity.

As an example, consider the series of actions that take place when sending an email from one person to another. At a high level, sending an email requires two people to have valid email addresses with registered service providers. Those providers need servers in place to host the email applications and a way to transmit data from one end of the action to the other, which often requires routers and switches. The act of sending an email generates hundreds, if not thousands, of breadcrumbs online across various systems, services, and networks.

Now, imagine you are a malicious actor looking to steal financial data from some people at a bank. If your goal is to compromise specific users within the organization, then sending a phishing email is a viable avenue of attack. As some first steps, you would scan for information about the financial institution and identify the individuals you want to target. After finding a few, you would search for their work history or any public personal data you can use to engage them. Then, you would register a legitimate-looking business domain, setup an email address, and craft a convincing email message directing the user to a page meant to steal their credentials. Upon successfully phishing the users, you would start a VPN, authenticate to the financial provider’s network, and take as much data as you can.

This example is something that occurs on a regular basis across the internet. Users get phished for credentials, their workplace accounts are compromised, and data is stolen from under their noses. But, from an analyst’s perspective, all those actions taken by the malicious actor reveal breadcrumbs that can be used to trace their activity.

To begin investigations, analysts must ask a few key questions:

• What IP address was used to perform the initial target research?

• Were there any services the actor used to perform research that required having an account?

• What information was used when registering the domain?

• What server did the phishing email come from?

• Who was the hosting provider for the credential phishing?

Normally, questions like these would go unanswered without the proper data, but modern technology has afforded the ability to scan the internet in short periods of time to collect many of these impressions. RiskIQ®’s extensive global network of egress points deploys virtual users (web crawlers) to visit web pages and save the content of the visit to collect the data necessary to answer these questions. Collecting this data at scale opens up the opportunity to begin surfacing the breadcrumbs of malicious actors in places where they can’t hide.

3

Internet data sets are best defined as any source of content that describes a portion of or process on the internet. These data sets are vast and ever-changing, but provide immense insight into actions taken by malicious actors and give defenders an advantage when combating attacks. The following data sets are tracked and exposed through RiskIQ PassiveTotal™.

RiskIQ PassiveTotal user interface

INFRA- STRUCTURE CHAINING

Having a database of breadcrumbs from interactions on the internet lends itself well to identifying relationships between disparate entities. Infrastructure chaining, a powerful methodology leveraged by analysts, uses the highly connected nature of these internet breadcrumbs to expand one indicator into many based on overlapping details or shared characteristics. Building infrastructure chains also allows analysts to quickly build context around an incident or investigation, allowing for more effective triage of alerting and actioning of incidents within an organization.

Starting with a single point, analysts can look at any connected data sets to find more indicators. As the analyst continues to branch out at each stage, they form links back to the original starting point. Not only does this process expand the scope of the investigation and potentially find more malicious content, but it is also self-documenting in the fact that any other analyst can see how connections were made from one data set to the next.

Since 2009, RiskIQ has been collecting data from web pages, global sensors, and a robust proxy network to provide analysts with a look back in time at how given parts of the internet once appeared. PassiveTotal supports the concept of infrastructure chaining and extracts all RiskIQ data into one single platform, so analysts can spend their time focusing on threats to their organizations and not data collection and processing.

INTERNET DATA SETS

4

PASSIVE DNS

Passive DNS is a system of record that stores DNS resolution data for a given location, record, and time period. This historical resolution data set allows analysts to view which domains resolved to an IP address and vice versa. This data set allows for time-based correlation based on domain or IP overlap.

What to Look For• Historical repository of domains and IP addresses that could

show overlap between values• Provides a method to get second-order domains and IP

addresses that may be related to your original query• Identifies subdomains associated with a particular query,

potentially revealing target details or more suspect infrastructure

Questions to Ask• Do the passive DNS results line up with the period I am

interested in? ￮ Infrastructure like domain names and IP addresses

may trade hands or be assigned to new customers by service providers over time. Beginning analysis with a known time frame of interest can aid in narrowing down what may be larger data sets spanning years of unrelated activity, allowing analysts to pinpoint specific attacker activity.

• Are there other data points (WHOIS, SSL Certificates, Host Pairs, etc.) that could be used to improve a connection point?

￮ Observations based on one type of data, such as infrastructure relationships, are sometimes sufficient on their own, but many observations can be strengthened and confirmed when combined with other supporting data. Supporting data points may also reveal other avenues of investigation that analysts were not initially aware of at the beginning of their research.

• Has the domain or IP address had a lot of changes over time?• Does it appear like the domain or IP address is part of a

shared hosting network?

5

WHOIS data, an internet database of ownership information about a domain, IP address, or subnet, can give an organization insight into those behind an attack campaign. WHOIS data helps determine the maliciousness of a given domain or IP address based on ownership records. Using domain registration information, an organization can unmask an attacker’s infrastructure by linking a suspicious domain to other domains registered using the same or similar information.

WHOIS

What to Look For• Allows for attack timeline analysis based on domain

registration and update or expiration time periods• Leverage history (hosting/record) to identify trends or specific

patterns in data for a given owner or set of owners• Use the content of the various WHOIS fields to find other

records that share similar patterns or exact values

Questions to Ask• How long has the domain been registered or owned?

￮ New or recently created domains may help confirm suspicions about malicious activity, as many domains are registered shortly before staging an attack. This may indicate dedicated attacker domain ownership and can strengthen an observation’s value as an indicator of compromise. On the other hand, domains with older registration dates may indicate the use of compromised hosts, hijacked domains, or purchase of older domains from a reseller service.

• Is the WHOIS record privacy protected or using a third-party provider’s information to obscure the real identity of the registrar?

• Does any data supplied by the user appear to be unique (i.e., spelling errors, strange names, conventions observed across multiple domains, etc.)?

￮ Even if a set of domains do not share obvious commonalities such as the domain registrant name or email address, relations may be possible to establish based on multiple matching attributes such as domain registrar, nameserver domains (and second-order nameserver domain ownership), registration timeframe, and registrant contact email domains.

• Do the nameservers listed on the WHOIS record appear unique or reveal any additional infrastructure that may be related?

• Does the WHOIS record have any history associated with it? If so, how long and what information has changed?

6

SSL certificates are cryptographically generated files used to provide an encrypted channel between a client and a server. Delegated authorities are in charge of validating organization details, issuing certificates, and maintaining the health of certificates issued. SSL certificates are most often associated with publicly facing websites and, for them to function, need to be accessible via a routable internet address. Beyond securing your data, certificates are a great way for analysts to connect disparate malicious network infrastructure through identifying overlapping usage of IP addresses. Actors often use the same certificate across multiple attack campaigns to encrypt command and control communication or make a malicious website look legitimate.

SSL CERTIFICATES

What to Look For• Identifies additional infrastructure based on a shared

certificate or infrastructure that was used to host the certificate• May identify connections when WHOIS or DNS data come up

with nothing• Data within the certificate may overlap with other certificates,

revealing more infrastructure

Questions to Ask• Is the SSL certificate valid (i.e., not expired, not self-signed)

and issued by a reputable provider?• Does the SSL certificate belong to a content provider or

content distribution network?• Does any of the user-supplied data in the certificate appear

unique?• Do any of the details in the certificate reveal WHOIS or

passive DNS leads?• Has the certificate been hosted on more than one IP address

or has it been moved from or shared between multiple servers?

Actors often use the same certificate across multiple attack campaigns to encrypt command and control communication or make a malicious website look legitimate.

7

Trackers are unique codes or values found within web pages and often are used to track user interaction. Website operators that maintain multiple sites may utilize the same analytic service account across all sites, and these trackers can be used to correlate a disparate group of websites to a central entity. PassiveTotal’s tracker data set includes IDs from providers like Google, Yandex, Mixpanel, New Relic, and Clicky, and is continuing to grow.

ANALYTICAL TRACKERS

What to Look For• Some analytical providers will share a top-level account and

create subkeys for different web properties that could reveal a single source owner

• Copied web pages often contain analytical codes that may not be scrubbed and could be used in conjunction with other data sets to identify abuse

• May identify overlapping connections between domains that would normally go undetected or unlinked due to dissimilar content

Questions to Ask• Do the analytical tracker codes appear to be unique and

owned by a legitimate brand?• Are malicious actors making use of any specific tracking

providers in their pages?• Is there a high degree of overlap between unrelated

websites?• What other web pages have made use of a given tracker?• Is there any relationship between multiple analytics accounts?

8

Host pairs are two domains (a parent and a child) that shared a connection observed from a RiskIQ web crawl. The connection could range from a top-level redirect (HTTP 302) to something more complex, like an iframe or script source reference. Each connection has a first time and last time observed that helps to establish a time period for the pair of web properties.

HOST PAIRS

What to Look For• Shows a natural sequence chain for a given set of websites

that can be walked up or down to surface new hosts• Preserves a given relationship between two web properties

that may no longer exist or remain valid• Context associated with the host pair dictates the nature of

the relationship and could provide insight into the attack

Questions to Ask• Are there any additional suspicious hosts redirecting or being

redirected from the indicator being investigated?• Are there any patterns with the redirections taking place?• Does the amount of pairs for a given property help in

dictating if it’s malicious or non-malicious?

9

Web components are details describing a web page or server infrastructure gleaned from performing a web crawl using RiskIQ technology. These components provide analysts with a high-level understanding of what was used to host the page and what technologies may have been loaded at the specific time of the crawl. When possible, we attempt to categorize the specific components and include version numbers.

WEB COMPONENTS

What to Look For• Provide insight into a given server infrastructure or web page,

content of which may have changed or may not exist anymore• Specific versions or technologies used could be combined to

form a component signature about an attacker’s operations• Observation time periods are a good way to establish when

an operation may have first been set up or went live for an attack

Questions to Ask• Are there any unique or less-popular technologies used that

could indicate a favored platform for attackers?• Are any of the technologies used vulnerable to attack via live

exploits or known bugs?• Is there any sort of technology that could be leveraged to

obfuscate or hide elements of an attack?

10

Open source intelligence (OSINT) is data that can be found publicly online and is freely available for use inside your organization. This data is often produced by individuals or companies and is either given away in the form of marketing material or shared amongst other companies as a source of goodwill for defenders. However, while great content can easily be found online, it may not be a full replacement for paid intelligence services. Some OSINT may draw incorrect conclusions or could be missing significant analysis, so any data collected should be processed before being applied within your organization.

OPEN SOURCE INTELLIGENCE (OSINT)

What to Look For• Provides additional context to indicators that may be linked to

your original query• Aids analysts in discovering a larger narrative around the threat• Could help an analyst find malware or other artifacts• Shows third-party perspectives and could be used to begin a

conversation with another organization

Questions to Ask• How does the indicator I am interested in relate to the

OSINT?• Are the OSINT claims backed up using data?• Is the OSINT provided by an individual, trusted group, or

larger organization?• Does there appear to be any misleading material in the

OSINT?

...while great content can easily be found online, it may not be a full replacement for paid intelligence services.

11

Web cookies are small pieces of data passed from the server to the client during web browsing. These values are associated with the domain being viewed and can be used to keep track of state or other information the server may use. Cookies can be encrypted with a secure flag and are restricted to specific domains to ensure a level of security.

What to Look For• Named services or additional indicators that could be derived

from the cookie name or path associated with the cookie• Number of results, low or high, from pivoting on the cookie

name or path• Time period of when the cookie was observed being

associated with the clients• Whether or not cookies are associated with the indicator

being searched

Questions to Ask• Does the cookie name appear unique?• Does the cookie path match the value of the indicator being

viewed?• Is there a low frequency of shared items based on the cookie

name?• Does the cookie name reveal any additional infrastructure,

services, or indicators?

COOKIES

12

ABOUT PASSIVETOTAL

ABOUT RISKIQ

PassiveTotal is a platform for security operations and analysis teams to simplify and accelerate the investigation processes for events, threats, and attacks. For example, any attack conducted via the internet is likely to leave behind traces or fingerprints.

The PassiveTotal platform automatically aggregates and correlates the most comprehensive internet data sets available, including passive DNS, email, SSL certificates, host pairs, web trackers, and WHOIS data to deliver insights about the ownership, use, and activity of specific assets involved in an event or attack, as well as show related assets that might otherwise go unknown. These insights allow security teams to investigate, respond to, and eliminate threats.

PassiveTotal leverages the RiskIQ crawl data set to provide deeper contextual understanding of assets through our virtual user technology. Understanding and interacting with web assets as a real user would experience them provides an unparalleled view into techniques employed by attackers.

PassiveTotal uses these vast data sets and predictive analytics to automate investigative processes and keep pace with the shifting threat landscape. Pivot between data relationships to gain the offensive edge against an attacker by preventing their next move.

Community Editions of RiskIQ products, including PassiveTotal, are now available. Cyberthreat hunters and defenders can sign up for RiskIQ Community for free to get started on the path to superior discovery, investigation, and research of threats. All RiskIQ products utilize the data sets that we discuss in this white paper.

RiskIQ is the leader in digital threat management, providing the most comprehensive discovery, intelligence, and mitigation of threats associated with an organization’s digital presence. With more than 75 percent of attacks originating outside the firewall, RiskIQ allows enterprises to gain unified insight and control over web, social, and mobile exposures. Trusted by thousands of security analysts, RiskIQ’s platform combines advanced internet data reconnaissance and analytics to expedite investigations, understand digital attack surfaces, assess risk, and take action to protect business, brand, and customers. Based in San Francisco, the company is backed by Summit Partners, Battery Ventures, Georgian Partners, and MassMutual Ventures.

12

THINK OUTSIDE THE FIREWALL™riskiq.com • 22 Battery Street, 10th Floor, San Francisco, CA 94111, USA • [email protected] • 1.888.415.4447

© 2017 RiskIQ, Inc. All rights reserved. RiskIQ and PassiveTotal are registered trademarks and Outside the Firewall is a trademark of RiskIQ, Inc. in the United States and other countries.All other trademarks contained herein are property of their respective owners.

using internet data sets to understand digital threats...more customer and business operations are...

Documents