mysql story in poi dedup

Mysql Story in POI Dedup

Outline

• Problem• Proposal• Test & Verify

Problem

MasterDB: 23 million POIDaily Incremental: 1 million POI

Deduping Add

Update

Problem

• ProcessPOI (target)

1) Get Candidate {POI: distance < 100 meter} from Master DBa. Use Grid index

2) Compare target with Candidates

Problem

• DB is time-consuming according to Content Team experience

10ms/POI, 1 million POI need 2.7 hour (DB Query)

100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!

Proposal

• Build Local Cache• Multiple-Thread (Multiple-Boxes, Map-

Reduce)• DB Query and Dedup computation separation• Single SQL Tuning

Single SQL Running: DAL VS JDBC//DALCpPoiWorkDao dao = CpPoiWorkDao.getInstance();List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);

//JDBCStatement statement = connect.createStatement();ResultSet rs = statement.executeQuery("select * from cs_1");

//runningcom.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985com.telenav.content.impl.PoiLoader 0:00:10.969 42985

First Declaration

First Declaration: DAL is slower than JDBC, there are performance loss in DAL

The truth

• DAL need ‘warm up’ (one more query)select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix, current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time, create_time as create_time from active_table where table_set_name=?

JDBC DAL

First run 0:00:04.125 0:00:09.360

2 3187 4797

3 3297 4672

4 3265 4828

5 3297 4828

6 3344 4891

Second SQL Running

select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale from us_ta_1 where node_index in ( ?, ?, … ? )

JDBC DAL

First run 375 1156

2 406 313

3 375 281

4 391 375

5 375 266

6 406 297

First Declaration: DAL is slower than JDBC, there are performance loss in DALFirst Declaration: DAL is slower than JDBC, there are performance loss in DAL

Benchmark Data

• It’s slow, how is it slow ?– Single SQL is smoke test, we want real data

Benchmark Data

• Test Case

• Test Result

•Running 10k POI, for each POI• DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching• IDeDuper.getDuplications(target, candidate) to find matching from candidates

•100 meter•Repeat the test for 3 times

Process Time DB Time Dedup Time Dedup candidate POI size

matched POI Percent

0:01:46, 10ms 4ms 6ms 51 0.63

6387 POI has been matched

Second Declaration

Second Declaration: Dedup is the most important factor in the process, db is not the botteneck

The truth

• DB is fast because of cache

# distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent

100 total 2min30s, 14ms 4 4ms 9ms 80 0.68

1 500 total 30m, 180ms 37 128ms 51ms 474 0.79

2 500 total 11m38s, 69ms 18ms 51ms

37 node in single query each POI need compare with 474 candidates

Second (latter) run is much faster than first run

The truth

• Clean Mysql cache & Restart Mysql– key_buffer_size 500m -> 8 byte– query_cache_size 64m -> 0

• No effect, the db query is still fast. – The first run time can not be reproduced for the

same data set.

The truth

• Clean OS (linux) file cache– echo 3 > /proc/sys/vm/drop_caches

• Test Result


matched POI Percent

0:01:46, 10ms 4ms 6ms 51 0.63

0:22:58.844 (db only) 137ms / (removed) 51 /

30 times slower when OS file cache is cleanedSecond Declaration: Dedup is the most important factor in the process, db is not the botteneckSecond Declaration: Dedup is the most important factor in the process, db is not the botteneck

Mysql Index Preloading

• Mysql Index Preloading– key_buffer_size 4096m– load index into cache us_ta_1 (INDEX_NODEX_INDEX);

• Nearly No effect, the db query is nearly same.

Data file is bottleneck

• It seems key index does not help, the bottleneck is in data file reading (an assumption) ?

• Verify– 1) Reorder 23 million records using Hilbert, let

neighboring POI also adjacent in disk, reduce disk seek times

– 2) Build a new table, each row is <node, POI in the node>, reduce io times for one node POI reading

Data file is bottleneck

• Re-order POI in DB

• Test Result

insert into us_ta_2 (select * from us_ta_1 order by node_index)


matched POI Percent

0:01:46, 10ms 4ms 6ms 51 0.63

First run 0:22:58.844 (db only)

137ms / (removed) 51 /

First run 0:03:10.985(db only)

19ms / 51 /

0:00:46.360 (db only)

4ms / (removed) 51 /

Multiple-Thread

• DB

• DB & Dedup

Process Time(db only)

DB Time Dedup Time Dedup candidate POI size

matched POI Percent

1 Thread 0:03:10.985 19ms / 51 /

4 Thread 0:01:05.406 24ms / 51 /

8 Thread 0:00:38.328 29ms / 51 /


matched POI Percent

1 Thread 0:04:07.125 18ms 5ms 51 0.6387

4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387

4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387

Another assumption

Assumption : Build a local cache, and process POI in Hilbert Curve order would do great help

Cache: <node, POI in the node>

DB Query:Get POI in given nodes

Query:- Pick up nodes which has local cache- DB Query : nodes which has no local cache

Hilbert Curve

give a mapping between 1D and 2D space that fairly well preserves locality.

Hilbert Curve5k POI

DB Ordering Hilbert Curve Ordering

The truth

# distance Total Time

DB Parameters DB Time Dedup candidate POI size

cache hit ratio

first run 100 47s 4 4.7ms 80

100, cache 41s 4 4.1ms 80 5% (1679/40986)

first run 100, cache 48s 4 4.8ms 80 5%

100 41s 4 4.1ms 80

os file cache is not cleaned

Assumption : Build a local cache, and process POI in Hilbert Curve order would do helpsome great

when data is not so sparse

500, cache 37 11ms 474 11%

500 37 18ms 474

Summary

• SQL itself is very simple, no tuning point ?

• Multiple-Thread is necessary to increase throughput– Separate Dedup and DB Query (Dedup is also

time-consuming when candidate size is big)

select * from us_ta_1 where node_index in ( ?, ? , ?...)

Jump out of box

• A new <node, POI> table • No-Sql Storage with spatial support <node, POI>• CoSE to search candidates• Hadoop(Map-Reduce)

Performance Tuning Tips

• Test to verify assumption• Make the environments as close to real as possible– Do not Mock– Do not talk with US DB in CN

• Repeat test to get a coherent result (result can be reproduced)

• Do not miss any exception case (First run is slower than latter)

• Consider both (Mysql) client/server side

mysql story in poi dedup

Technology

process poi

node poi

truth db

problem db

candidates poi

getinstancelist poi

letneighboring poi

poi dedupworkpoidao