network traffic search using apache hbase
TRANSCRIPT
Network Traffic Search using Apache HBase
Evans Ye @ TWHUG 2014 Q1
2014/3/8
• Evans Ye @
– Dumbo Team• Dumbo In Taiwan Blog
– Talk in TWHUG 2013 Q4• Building Hadoop Based Big Data Environment
– Apache Bigtop Contributor
Who am I
04/11/2023 Copyright 2013 Trend Micro Inc.
• Problem to Solve
• Solution Design
• Flume ETL Process
• Experience Sharing
• Future Work
Agenda
04/11/2023 Copyright 2013 Trend Micro Inc.
04/11/2023 Copyright 2013 Trend Micro Inc.
閃開讓專業的來!Security Department:Hey SPN, I have a big data problem…
Network Traffic Analysis Example
04/11/2023 Copyright 2013 Trend Micro Inc.
TW branch US branch
INTRANET
INTERNET
VICTIM 1 VICTIM 2 VICTIM 3 VICTIM 4
C&C 1 C&C 3C&C 2
• ArcSight Common Event Format– Volume: 250G/180 million record per day
Find Malicious Connections by Searching Netflow logs
04/11/2023 Copyright 2013 Trend Micro Inc.
• src: source ip
• dst: destination ip
• spt: source port
• dpt: destination port
• proto: protocol, TCP,UDP…
• rt: timestamp, 1386018915000
Valuable Fields in Netflow log
04/11/2023 Copyright 2013 Trend Micro Inc.
Search for Connections
04/11/2023 Copyright 2013 Trend Micro Inc.
NetflowLogger
Query
……
about 8~10min
Big Data Problem
04/11/2023 Copyright 2013 Trend Micro Inc.
• Big data solutions
• Why HBase?– We want to try and figure out HBase Thrift limitation– How HBase performs when dealing with this kind of problem
Choosing The Right Tool
04/11/2023 Copyright 2013 Trend Micro Inc.
04/11/2023 Copyright 2013 Trend Micro Inc.
Solution Design
Architecture
04/11/2023 Copyright 2013 Trend Micro Inc.
HBase Thrift
Server
Send Netflow via syslog
Data Soruce
Query
Talk to HBase using C++, Python, PHP, Ruby, Perl…
A simple Python web frameworkOnly one file under 150k
• Searchable Fields – src: source ip– dst: destination ip– spt: source port– dpt: destination port– proto: protocol, TCP,UDP…– rt: timestamp, 1386018915000
• Values– in, cn2, ad.tcp__flags
User Requirement
04/11/2023 Copyright 2013 Trend Micro Inc.
• Compose searchable fields to be rowkey
• For client query, scan by applying HBase Filter– RowFilter (=, 'regexstring:^src#dst#[^#]*#spt#dpt#proto$')“– See HBase Thrift Filter doc
HBase Rowkey Design – First Attempt
04/11/2023 Copyright 2013 Trend Micro Inc.
RD Style Search Portal
04/11/2023 Copyright 2013 Trend Micro Inc.
• Test on 12 million sample data
• The search performance……
• Since we need to store at least 3 month data for query,The performance might not be good enough…
Performance
04/11/2023 Copyright 2013 Trend Micro Inc.
1.5~2min
• Avoid full table scan– HBase Filters can only helps you to filter out un-wanted data to
client side– On server side, it still need to compare all the rowkeys when
applying filters– set STARTROW and STOPROW
Lesson Leaned
04/11/2023 Copyright 2013 Trend Micro Inc.
• Since HBase is natively designed to store data sorted by rowkey
• It’s fast to scan rows when rowkey prefix specified
– This can only be fast when source ip specified– How about destination ip, port, protocol,…?
Avoid Full Table Scan
04/11/2023 Copyright 2013 Trend Micro Inc.
• Searchable Fields– src: source ip– dst: destination ip– spt: source port– dpt: destination port– proto: protocol– rt: timestamp
• User want to track down suspicious connections– A query at least need to have an IP
Rethink The User Requirement
04/11/2023 Copyright 2013 Trend Micro Inc.
required
– Search on source ip
– Search on destination ip
– Put netflow timestamp into HBase timestamp to leverage HBase TimeRange Scan
– Set VERSION=>2147483647 to avoid collision
HBase Rowkey Design – Second Attempt !
04/11/2023 Copyright 2013 Trend Micro Inc.
• Search other searchable fields by applying Qualifier Filter:– QualifierFilter (=, 'regexstring:^spt#dpt#proto$')
HBase Rowkey Design – Second Attempt !
04/11/2023 Copyright 2013 Trend Micro Inc.
• Searchable Fields– src: source ip specifiy STARTROW/STOPROW– dst: destination ip specify
STARTROW/STOPROW– spt: source port apply qualifier filter– dpt: destination port apply qualifier filter– proto: protocol apply qualifier filter– rt: timestamp specify HBase TimeRange
Check The User Requirement
04/11/2023 Copyright 2013 Trend Micro Inc.
Deliver New Portal
04/11/2023 Copyright 2013 Trend Micro Inc.
• Test on 70 million sample data
• The search performance……
• Enough?– Since malicious connections won’t have large volume, 80% of
query should be responsed in a second
• Duplicate issue:– Since we only store needed fields into HBase, the data volume
is only 150MB/day duplicated 300MB/day– Store 3 month data = 13.5GB duplicated 27GB (GZed)
(record count = 12 Billon)
Performance
04/11/2023 Copyright 2013 Trend Micro Inc.
<1s~1min
• Test on 240 million sample data
• The search performance……
• The query time is robust on 80% query case
Test on Even Large Data
04/11/2023 Copyright 2013 Trend Micro Inc.
<1s~3min
04/11/2023 Copyright 2013 Trend Micro Inc.
Fume ETL Process
Architecture
04/11/2023 Copyright 2013 Trend Micro Inc.
Hbase Thrift
Server
Send Netflow via syslog Query
Data Soruce
Flume Process
04/11/2023 Copyright 2013 Trend Micro Inc.
Flume Spooling Directory Source
Flume file Channel Flume HBase Sink
Serializer
Serializer1. Extract needed fields from Netflow log
To
2. Create Hbase put object for Sink to execute
Data Soruce
Dual Table Write
04/11/2023 Copyright 2013 Trend Micro Inc.
Infosec
Flume Spooling Directory Source
flume.conf…agent1.sinks.sink1.serializer.rowKey = src, dstagent1.sinks.sink2.serializer.rowKey = dst, src
Channel1
Channel2
Sink1
Sink2
Duplicate, Again!
Data Soruce
Step1 • A put trigger the prePut Coprocessor
Step2 • Put to dst table in dst#src format in coprocessor
Step3 • Do regular put to src table in src#dst format
More Elegant Way
04/11/2023 Copyright 2013 Trend Micro Inc.
Infosec
Flume Spooling Directory Source
Channel1 Sink1
Infosec
Data Soruce
src table
dst table
Hook a prePut Coprocessor
04/11/2023 Copyright 2013 Trend Micro Inc.
Experience Sharing& Future Work
• Thrift– Thrift is not the first-class citizen of HBase, for example, thrift do
not support Scan with TimeRange and Version– Do not support New Filters since thrift has it’s own
Filter Language (for example, FuzzyRowFilter)
• Bottle– It won’t be hurt when you delete you web backend code which is
implement by bottle
Experience Sharing
04/11/2023 Copyright 2013 Trend Micro Inc.
• Flume– There is also a Flume Syslogudp Source, but can not work well
with out extra works• 768bytes/per message limitation(fixed in FLUME-2130)• Still has 2048bytes limitation on netty event decoder• Data may loss due to messages concatenated...
– Spooling Directory Source is much more stable
Experience Sharing
04/11/2023 Copyright 2013 Trend Micro Inc.
• Transparent index table to clients– Use coprocessor to hook on the client scan and decide which
table is going to scan
• Make thrift scan support specifying version:– Now I use scan to fetch rows and qualifiers,
then use getVer to fetch different versions(thrift do support “version” on get)
Future Work
04/11/2023 Copyright 2013 Trend Micro Inc.
Questions?
Thank you !