commit2015 kharchenko - python generators - ext
TRANSCRIPT
Maxym Kharchenko & m@ team
Writing efficient Python code with pipelines and generators
Agenda
Style
Efficiency
Simplicity
Pipelines
Python is all about streaming (a.k.a. iteration)
Streaming in Python# Listsdb_list = ['db1', 'db2', 'db3']for db in db_list: print db
# Dictionarieshost_cpu = {'avg': 2.34, 'p99': 98.78, 'min': 0.01}for stat in host_cpu: print "%s = %s" % (stat, host_cpu[stat])
# Files, strings file = open("/etc/oratab")for line in file: for word in line.split(" "): print word
# Whatever is coming out of get_things()for thing in get_things(): print thing
Quick example: Reading records from a file
def print_databases(): """ Read /etc/oratab and print database names """
file = open("/etc/oratab", 'r')
while True: line = file.readline() # Get next line
# Check for empty lines if len(line) == 0 and not line.endswith('\n'): break
# Parsing oratab line into components db_line = line.strip() db_info_array = db_line.split(':') db_name = db_info_array[0] print db_name
file.close()
Reading records from a file: with “streaming”
def print_databases(): """ Read /etc/oratab and print database names """ with open("/etc/oratab") as file: for line in file: print line.strip().split(':')[0]
Style matters!
Ok, let’s do something useful with streaming We have a bunch of ORACLE listener logs
Let’s parse them for “client IPs”
21-AUG-2015 21:29:56 * (CONNECT_DATA=(SID=orcl)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.107.137.91)(PORT=43105)) * establish * orcl * 0
And find where the clients are coming from
First attempt at listener log parserdef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []
with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)
return client_hosts
First attempt at listener log parserdef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []
with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)
return client_hosts
MEMORY WASTE!
Stores all results until
return
BLOCKING! Does NOT
return untilthe entire log is processed
Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []
with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)
return client_hosts
Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []
with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)
return client_hosts
Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """
with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic>
yield hostAdd this !
Generators in a nutshelldef test_generator(): """ Test generator """
print "ENTER()"
for i in range(5): print "yield i=%d" % i yield i
print "EXIT()"
# MAINfor i in test_generator(): print "RET=%d" % i
ENTER()yield i=0RET=0yield i=1RET=1yield i=2RET=2yield i=3RET=3yield i=4RET=4EXIT()
Nongenerators in a nutshelldef test_nongenerator(): """ Test no generator """ result = []
print "ENTER()"
for i in range(5): print "add i=%d" % i result.append(i)
print "EXIT()"
return result
# MAINfor i in test_nongenerator(): print "RET=%d" % i
ENTER()add i=0add i=1add i=2add i=3add i=4EXIT()RET=0RET=1RET=2RET=3RET=4
Generators to Pipelines
Generator(extractor)
1 secondper record
100,0001st: 1 second
100,000
Generator(filter: 1/2)
2 secondsper record
Generator(mapper)
5 secondsper record
50,0001st:5 seconds
50,0001st:10 seconds
Generator pipelining in Pythonfile_handles = open_files(LISTENER_LOGS)log_lines = extract_lines(file_handles)client_hosts = extract_client_ips(log_lines)
for host in client_hosts: print host
Open files
Extract lines
ExtractIPs
Filenames
Filehandles
Filelines
ClientIPs
Generators for simplicitydef open_files(file_names): """ GENERATOR: file name -> file handle """
for file in file_names: yield open(file)
Generators for simplicitydef extract_lines(file_handles): """ GENERATOR: File handles -> file lines Similar to UNIX: cat file1, file2, … """
for file in file_handles: for line in file: yield line
Generators for simplicitydef extract_client_ips(lines): """ GENERATOR: Extract client host """
host_regex = re.compile('\(HOST=(\S+)\)\(PORT=')
for line in lines: line_match = host_regex.search(line) if line_match: yield line_match.groups(0)[0]
Developer’s bliss:simple input, simple output, trivial
function body
Then, pipeline the results
But, really …
Open files
Extract lines
IP -> host
name
Filenames
Filehandles
Filelines
Clienthosts
Locate files
Filter db=orcl
Filter proto=
TCP
db=orcllines
db=orcllines
db=orcl& prot=TCP
Extractclients
ClientIPs
Clienthosts
Db writer
Clienthosts
Textwriter
Why generators ?
Simple functions that are easy to write and understand
Non blocking operations: TOTAL execution time: faster FIRST RESULTS: much faster
Efficient use of memory
Potential for parallelization and ASYNC processing
Special thanks to David Beazley …
For this: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
Thank you!