treasure data summer internship 2016
TRANSCRIPT
![Page 1: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/1.jpg)
Internship Final ReportSep 30, 2016 Yuta Iwama
![Page 2: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/2.jpg)
Who am I
• Yuma Iwama (@ganmacs)
• Master’s student, The University of Tokyo
• Research: Programming languages (My theme is extending language syntax)
• Group: Chiba Shigeru Group
![Page 3: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/3.jpg)
What I did in summer intern
• Add features and enhancements to Fluentd v0.14.x
![Page 4: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/4.jpg)
What I did in summer intern
• 6 features • Counter API (Not merged yet) • Data compression in buffer plugins and forward plugins • New out_file plugin only for <secondary> section • A CLI tool to read dumped event data • Log rotation • `filter_with_time` method in filter plugins
• 2 enhancements • Optimizing multiple filter calls • Add event size to options in a forward protocol
• Some small tasks
![Page 5: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/5.jpg)
I’ll talk about
• Counter API • Data compression in buffer plugins and
forward plugins • New out_file plugin only for <secondary>
section • A CLI tool to read dumped event data • Log rotation • Optimizing multiple filter calls
![Page 6: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/6.jpg)
I’ll talk about
• Counter API • Data compression in buffer plugins and
forward plugins • New out_file plugin only for <secondary>
section • A CLI tool to read dumped event data • Log rotation • Optimizing multiple filter calls
![Page 7: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/7.jpg)
Data Compression in buffer plugins and
forward plugins
![Page 8: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/8.jpg)
Current buffer plugins and forward plugins in Fluentd• Buffer plugins have data as a string (formatted
with MessagePack or user custom formats)
• Forward plugins send data as a string (format is same as buffer plugins)
• Although data is serialized with MessagePack, its footprint is large
• Current way consumes many memory resources and bandwidth of the network
![Page 9: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/9.jpg)
New buffer plugins and forward plugins• String data in buffer plugins can be compressed
• Forward plugins can send and receive compressed data
• Things to be able to
• Save the bandwidth across the datacenter
• Accelerate the transfer speed and save the time
• Reduce memory consumptions and costs of IaaS (EC2 , etc.)
![Page 10: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/10.jpg)
Implementation
• I used “zlib” in Ruby to implement a compression/decompression method
• It’s hard to work both compressed version and raw version (To solve this problem, I used `extend` in Ruby not to break existing interface)
![Page 11: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/11.jpg)
New out_file plugin only for <secondary> section
![Page 12: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/12.jpg)
Background
• Many users use out_file plugin to dump buffer with <secondary> sections when primary buffered output plugins are failing flush
• But out_file is too complex and has too many features for such purpose
• => We need simple out_file only for <secondary> section just to dump buffer
![Page 13: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/13.jpg)
New plugin: secondary_out_file
• Only four attributes • directory: the directory dumped data saved • basename: the file name of dumped data (default value is
dump.bin) • append: the flushed data is appended to an existing file or
not (default false) • compress: The type of the file ( gzip or txt, default is txt)
• Users can use this plugin only to set directory<match > @type forward ... <secondary> type secondary_file directory log/secondary/ </secondary> </match>
![Page 14: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/14.jpg)
A CLI tool which is used for reading dump data
![Page 15: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/15.jpg)
Background
• Dumped data are created by secondary plugins (e.g. secondary_out_file) when primary plugins are failing flush
• We can't read dumped data because dumped data is binary format(MessagePack) in most case
• => Provide a CLI tool to read dumped data
![Page 16: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/16.jpg)
fluent-binlog-reader
• fluent-buinlog-reader is bundled in Fluent • It reads dumped data and outputs readable
format • Users can use fluent’s formatter plugins as an
output format$ fluent-binlog-reader --help Usage: fluent-binlog-reader <command> [<args>]
Commands of fluent-binlog-reader: cat : Read files sequentially, writing them to standard output. head : Display the beginning of a text file. format : Display plugins that you can use. See 'fluent-binlog-reader <command> --help' for more information on a specific command.
![Page 17: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/17.jpg)
fluent-binlog-reader
$ fluent-binlog-reader head packed.log 2016-08-12T17:24:18+09:00 packed.log {"message":"dummy"} 2016-08-12T17:24:18+09:00 packed.log {"message":"dummy"} 2016-08-12T17:24:18+09:00 packed.log {"message":"dummy"} 2016-08-12T17:24:18+09:00 packed.log {"message":"dummy"} 2016-08-12T17:24:18+09:00 packed.log {"message":"dummy"}
Default format is json format
Using a “csv formatter” to output dumped data
$ fluent-binlog-reader cat --formats=csv -e fields=message packed.log "dummy" ... "dummy"
![Page 18: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/18.jpg)
Log rotation
![Page 19: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/19.jpg)
Background
• Fluentd can’t do log rotation • As the file size of log increases, it becomes
difficult to handle the log file.
• => Fluentd supports log rotation to keep log files down to a manageable size
![Page 20: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/20.jpg)
Log rotation
• Two options • log-rotate-age: The number of old log files
to keep • log-rotate-size: Maximum log file size
• Use serverengine log rotation (Fluentd uses serverengine logger to one’s log)
![Page 21: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/21.jpg)
Optimise multiple filter calls
![Page 22: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/22.jpg)
Background
• If users apply multiple filters to incoming events, Fluentd creates a lot of EventStream object and calls its add method
• => Removing useless instantiations of EventStream and the `add` method calls
![Page 23: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/23.jpg)
Filter 11. Create an EventStream object (1 time) 2. Apply a filter to each event (5 times) 3. Add a filtered event to an EventStream object (5times)
[e1, e2, e3, e4, e5]
If 10 filters are applied 1. call 10 times 2. call 50 times 3. call 50 times
[e1’, e2’, e3’, e4’, e5’]
[e1x, e2x, e3x, e4x, e5x]
Filter n
Current filters
![Page 24: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/24.jpg)
Filter 1
1. Create an EventStream object (1 time) 2. Apply each filters to each event (n * 5 times) 3. Add a filtered event to an EventStream object (5times)
[e1, e2, e3, e4, e5]
if 10 filters are applied 1. call 1 time 2. call 50 times 3. call 5 times
[e1x, e2x, e3x, e4x, e5x]
Filter n+
Constraint: These filters must not be implemented `filter_stream` method
Optimised case
![Page 25: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/25.jpg)
Performance
Tool : ruby-prof ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G31 PROCESSOR: 2.7 GHz Intel Core i5 MEMORY: 8 GB 1867 MHz DDR3
Not optimized Optimised
0.063186 0.051646
1.2 times faster when it is using 10 filters and 1000 events per sec
![Page 26: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/26.jpg)
I’ll talk about
• Counter API • Data compression in buffer plugins and
forward plugins • New out_file plugin only for <secondary>
section • A CLI tool to read dumped event data • Log rotation • Optimizing multiple filter calls
![Page 27: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/27.jpg)
Counter API
![Page 28: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/28.jpg)
Motivations
• To get metrics of Fluentd itself between processes
• To provide counter API to 3rd party plugins
• It is useful to implement counter plugins (e.g. fluent-plugin-datacounter and fluent-plugin-flowcounter)
![Page 29: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/29.jpg)
What’s the counter
• The counter:
• A key-value store
• Used for storing the number of occurrences of a particular event in the specified time
• Provides API to users to operate its value(e.g. inc, reset , etc.)
• shared between processes
![Page 30: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/30.jpg)
Counter
key value
key1 5
key2 1.2
key3 3
CounterProcess 1 inc(key1 => 2)
Process 2 reset(key2)
What’s the counter (cont.)
![Page 31: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/31.jpg)
Counter
Counter
What’s the counter (cont.)
key value
key1 7
key2 1.2
key3 0
Process 1 inc(key1 => 2)
Process 2 reset(key2)
![Page 32: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/32.jpg)
Implementation
• RPC server and client
• All operators should be thread safe
• Cleaning mutex objects for keys
![Page 33: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/33.jpg)
Implementation
• RPC server and client
• All operators should be thread safe
• Cleaning mutex objects for keys
![Page 34: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/34.jpg)
RPC server and client
• Because the counter is shared between processes. We need a server and clients (Store counter values in server and clients manipulate them by RPC )
• I designed RPC server and client for counter • I use cool.io to implement RPC server and client
• cool.io is providing a high-performance event framework for Ruby (https://coolio.github.io/)
![Page 35: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/35.jpg)
API to operate counter values
• init: create new value • reset: reset a counter value • delete: delete a counter value • inc: increment or decrement a counter value • get: fetch a counter value
![Page 36: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/36.jpg)
Implementation
• RPC server and client
• All operators should be thread safe
• Cleaning mutex objects for keys
![Page 37: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/37.jpg)
All operations should be a thread safe
• Counter works in multi threads
• You need to get a lock per keys when you change a counter value
• Counter stores mutex objects in hash (key_name => mutex_object)
![Page 38: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/38.jpg)
How an inc method works
key valuekey1 2
Counter server
client in worker1inc( key1 => 2)
1. Call an inc method
key value
key1 mutex obj
Mutex hash
![Page 39: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/39.jpg)
How an inc method works
key valuekey1 2
Counter server
key value
key1 mutex obj
Mutex hashclient in worker1inc( key1 => 2)
1. Call an inc method 2. Get a lock for a mutex hash
locked
![Page 40: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/40.jpg)
How an inc method works
key valuekey1 2
Counter server
key value
key1 locked
Mutex hashclient in worker1inc( key1 => 2)
1. Call an inc method 2. Get a lock for a mutex hash 3. Get a lock for a key
locked
![Page 41: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/41.jpg)
How an inc method works
key valuekey1 2
Counter server
key value
key1 locked
Mutex hashclient in worker1inc( key1 => 2)
1. Call an inc method 2. Get a lock for a mutex hash 3. Get a lock for a key 4. Unlock a mutex hash
![Page 42: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/42.jpg)
How an inc method works
key valuekey1 4
Counter server
key value
key1 locked
Mutex hashclient in worker1inc( key1 => 2)
1. Call an inc method 2. Get a lock for a mutex hash 3. Get a lock for a key 4. Unlock a mutex hash 5. Change a counter value
![Page 43: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/43.jpg)
How an inc method works
key valuekey1 4
Counter server
key value
key1 unlock
Mutex hashclient in worker1inc( key1 => 2)
1. Call an inc method 2. Get a lock for a mutex hash 3. Get a lock for a key 4. Unlock a mutex hash 5. Change a counter value 6. Unlock a key lock
![Page 44: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/44.jpg)
Implementation
• RPC server and client
• All operators should be thread safe
• Cleaning mutex objects for keys
![Page 45: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/45.jpg)
Mutex objects for keys
• To avoid storing mutex objects for all keys, I implement a cleanup thread which removes unused key’s mutex object (like GC)
• This thread removes mutex objects which are not used for a certain period
![Page 46: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/46.jpg)
Cleaning up a mutex hash
key valuekey1 2
Counter serverkey value
key1 mutex obj
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
![Page 47: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/47.jpg)
Cleaning up a mutex hash
key value
key1 mutex obj
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
1. Start a cleaning thread (once in 15 min)
![Page 48: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/48.jpg)
Cleaning up a mutex hash
key value
key1 mutex obj
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
1. Start a cleaning thread (once in 15 min) 2. Get a lock for a mutex hash locked
![Page 49: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/49.jpg)
Cleaning up a mutex hash
key value
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
1. Start a cleaning thread (once in 15 min) 2. Get a lock for a mutex hash 3. Remove a mutex for an unused key
locked
![Page 50: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/50.jpg)
Cleaning up a mutex hash
key value
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
1. Start a cleaning thread (once in 15 min) 2. Get a lock for a mutex hash 3. Remove a mutex for an unused key 4. Try to get a lock for the same key
If this thread can’t get a lock restore a key-value
locked
![Page 51: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/51.jpg)
Cleaning up a mutex hash
key value
Mutex hash
• If “key1” are not modified for a long period, “key1” may be unused after this
1. Start a cleaning thread (once in 15 min) 2. Get a lock for a mutex hash 3. Remove a mutex for an unused key 4. Try to get a lock for the same key 5. Unlock a mutex hash
![Page 52: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/52.jpg)
Summary
• Add six features and two enhancements to Fluentd v0.14.x
• Counter API is not merged yet
• Other PRs have been merged
![Page 53: Treasure Data Summer Internship 2016](https://reader033.vdocument.in/reader033/viewer/2022052606/588626c21a28ab8f2c8b61eb/html5/thumbnails/53.jpg)
Impression of intern
• The hardest thing for me is to design about counter API(It takes over 1 week)
• I have learned about the development of middleware which is used by many people
• I want to became more careful to code written by myself (typo, description, comment etc.)