parsing binaries and protocols with erlang
DESCRIPTION
Delivered by Bhasker V Kode at foss.in/2009 Official talk page at http://foss.in/2009/schedules/talkdetailspub.php?talkid=17 Erlang 's support for handling binaries and pattern matching make it a great choice for parsing everything from IPv4 packets, to payloads from the Memcached protocol, SWF files, or databases like Tokyo Cabinet. From a functional programming perspective, there are various ways of building these parsers, taking advantage of the concurrent and recursive nature that is inherent to the language and other challenges which have been gathered while validating the storage & retrieval options for our distributed crawler, and submitting patches to projects like Medici & Tora ( erlang based Tokyo Cabinet clients). The talk will also touch upon Tokyo cabinet's support for mapreduce with Lua, and notes from building your own custom formats & our internal mapreduce'esque and caching frameworks used in building a multi-million impression platform utilizing under a gig of RAM per node. Notes on: - trends in disk/memory/bandwidth - why erlang, RAM, binaries - garbage collection in the erlang VM - message passing - use-casesTRANSCRIPT
“Parsing binaries and protocols with erlang ?!”
http://developers.hover.in
Bhasker V Kodecofounder & CTO at hover.in
at foss.inDecember 4th, 2009
Bhasker V Kodecofounder & CTO at hover.in
at foss.inDecember 4th, 2009
“WHY ... ?!”
foss.in/2009 http://developers.hover.in
“BUT I'm BUILDING webapps !?!”
foss.in/2009 http://developers.hover.in
“Everything's quick enough :D”
foss.in/2009 http://developers.hover.in
foss.in/2009 http://developers.hover.in
“doh!”
“ha! ofcourse i knew that...err.... but people scale...that's what they do ..... that's our way out !!! scaling out ...scaling up ...auto scaling even...!!!: O ”
foss.in/2009 http://developers.hover.in
“scale UP ...!more RAM seems to stop those stall those silly CPUunit warnings my hosting provider gives...
bring on those infinite loops & polling crons. RealTimeWeb FTW!”
foss.in/2009 http://developers.hover.in
“scaling OUT , maybe with a distributed filesystemand figure out a way for nodes to talk, and... Replication... and location transparency during weekends... and commodity hardware which i can't pay for ”
foss.in/2009 http://developers.hover.in
More data becoming archival NOT by choice, but forced to.
Not pushed to handling streams of data well ( even hadoop!) #bigdata
If you're not compromising, you're not pushing enough. Disk's loss must be some else's gain. fixedlength eg's at fb, twitter, google
foss.in/2009 http://developers.hover.in
Erlang for RAMon the web is the new
Embedded C
foss.in/2009 http://developers.hover.in
“THE NEWS TODAY. Once popular retro format 'binary' continues to go unnoticed after brief sightings on wallpapers during the matrix trilogy ....”pssst! in files of any mime/content typein db's that accept binaryin RAM, via caching enginescompact for n/w transfer & storagethe answer to unicode
foss.in/2009 http://developers.hover.in
“fine! Binaries are everywhere, disk's are not keeping up, and i've got more cores on my nodes every year.”
foss.in/2009 http://developers.hover.in
“But i'm not still not going near a strict, dynamically typed functional programming language with support for concurrency, communication, and distribution, automatic memory management & supports multiple platforms !!!”
foss.in/2009 http://developers.hover.in
Erlang!!!
overrated ? OR
underappreciated ?
“ [ 87, 84, 70] :O !”
foss.in/2009 http://developers.hover.in
foss.in/2009 http://developers.hover.in
What happens when you start a erlang shell . SMP did'nt exist before erlang build R11 ('06)
“ahh... so processes are pseudo threads in the erlang VM that are light weight & the base of erlang programs having their own heap or message inbox & are meant for message passing erlang primitaves. Also the developer can configure how many cores are used based on # of schedulers, which run process's.foss.in/2009 http://developers.hover.in
foss.in/2009 http://developers.hover.in
Max of 1024 schedulers can be set => your erlang src today should utilize box's upto 1024 cores
Let M= msgs to random usersLet N= 100,000 usersRoute M msgs to right N users !typical onenode approach : for i to M for j to N if match, add_update
actor approach: N concurrent processes listening to all msgs As new msg arrives, msg pass to all N pidsin each concurrent process: if match, add_update
foss.in/2009 http://developers.hover.in
foss.in/2009 http://developers.hover.in
3 papers to rule them all & 1 garbage collection method to free them!
foss.in/2009 http://developers.hover.in
3 papers to rule them all & 1 garbage collection method to free them!
foss.in/2009 http://developers.hover.in
3 papers to rule them all & 1 garbage collection method to free them!
foss.in/2009 http://developers.hover.in
EUREKA!!! we have a winner
“ahh... so this is what the no shared memory in erlang, or light weight process's being garbage collected easily since they dont have references to data in each other's process heap, & messages copied or shared based on it's size, likelihood of reuse and also optimized for binary. tellmemore!!”
foss.in/2009 http://developers.hover.in
“How do you spawn a process?”
foss.in/2009 http://developers.hover.in
“Where can you spawn a process?”
foss.in/2009 http://developers.hover.in
“Can a spawned process talk back to the callee?”
foss.in/2009 http://developers.hover.in
“Can a spawned process listen as long as i want it to?”
“Can a spawned process stop listening when I want it to?”
“Can a spawned process spawn more processes?”
foss.in/2009 http://developers.hover.in
“So though erlang gives a library called OTP & a db called mnesia for making life easier you can parse or create binaries easily, make clientserver programs, distributed rpc calls, tailrecursive servers, message/priority queue's for flowcontrol, talk to ports and other lang's, or create any data structure explicitly (a) inmemory (b)ondisk of any connected node!foss.in/2009 http://developers.hover.in
“show me the demo's”● Process related
– Message queue's , Client – server– RPC , Timeouts
● Binary
– Binary pattern matching, Parse swf/mp3 for metadata– Networking, comm. with C, Tokyocabinet client eg.
● Process + Binary!
– Building a production ready inmemory CDN consistently faster than Am4z0n cl0udfr0nt, in stagesopen & gzip < concat js's < inmemory < streaming?
foss.in/2009 http://developers.hover.in
“Binary pattern matching ?”
<<Value:Size/TypeSignednessEndianismunit:Unit>>
<<1:32>> = <<0,0,0,1>.<<1:32/unsigned-little>> = <<1,0,0,0>.<<_:8,“mnesia”/binary>> = <<”Amnesia”>>.
So <<Bin>> could be unicode characters ( English, hindi, tamil ) or JPG's or http headers or basically segments of binaries
NewBinary=<<Segment1,Segment2>>.
foss.in/2009 http://developers.hover.in
summary of tech at hover.in● LYME stack since ~dec 07 , 4 (1) nodes (64bit 4GB)● python crawler + associated NLP parsers, index's now
in tokyo cabinet, inverted index's in erlang 's mnesia db with binaries of 5 diff indian languages + multiple contenttypes, cpu timesplicing algo's, priority queue's for heatseeking algo, flowcontrol, caching engines, cyclic queues, mapreduces with nonblocking gathers, headlessfirefox for thumbnails, patches to tokyocabinet client 'medici'
● Beta in Jan 09, 1 million hovers/month in May'09● 24 developers + several interns across ~2 years
foss.in/2009 http://developers.hover.in