lazy data using perl
DESCRIPTION
Your data's lifecycle is an important part of making it usable. Lazy data is loaded once, when it is needed, and does not need to be checked every time it is used. This talk describes using Trampoline objects, subroutines and state variables to initialize the data structure only once.TRANSCRIPT
Virtuous LazynessFor Your Data
Doing it once andknowing you've done it.
Steve LembarkWorkhorse [email protected]
Data wants to be free,it also wants to be very expensive.
● Data you never use is free; accessing gets expensive.● Managing data's cost requires controlling its lifecycle.● Mismanaging the lifecycle causes problems:
● Caching table lookups in forked httpd crashes your database.● Preprocessing messages files kills your startup times.● Reading XML configuration data you never use wastes most of
your testing budget.● Your tests fail because of a connection failure to a database
server that you don't use in the test.
● Fixing these problems requires using lazy data.
False Lazyness
● We have all see the two most common data management stratagies:● You load everything at startup to avoid checking it every
time it is used.● You check everything every time before using it to avoid
loading it up front.
● Both approaches ignore important knowledge:● You know when the data is needed.● You know that the data was loaded.
One alternative: Scalar Cache
● $cache ||= read_cache_values;● Seems nice: You know that $cache is populated.● This requires dereferencing a hashref throughout the
program, which is expensive.
● What you'd rather do is just use $cache{ $foobar } without having to check it every time.
● Even checking if( ! %cache ) is expensive – arrays are much cheaper to check.
True Lazyness:Do Something Once
● Truly Lazy data means loading you data when you need it and knowing that it is loaded.
● Perl gives us – of course – more than one way:● Trampoline objects and subroutines.● “state” variables, introduced in v5.10.
● Trampolines are flyweight objects – data structures or subroutines that transform themselves when used.
● State variables are assigned only once, at runtime, the first time they are used.
Follow The Bouncing Object
● Object::Trampoline – flyweight data you don't know isn't there.
● These delay calling the “real” constructor until the object is actually used.● Your constructor gets called before the first method call.● At that point it can cache, parse, or compute whatever it
needs.
● Spreads the cost of loading data set each over the lifetime of a process.
Example: Delay Expensive XML
● Use an initializer to read and parse the XML.● The Object::Trampoline calls your constructor once
to transform the object the first time you call a method.
● Parsing the XML is pushed off until you actually use the data.
● Requires using a hashref – and possibly methods – to access the data.
package XML::Message;use Object::Trampoline;...
sub new{ Object::Trampline->install( 'XML::Message', $path );}
sub install{ my $error = &construct;
$error->initialize( @_ )}
sub initialize{ my ( $err, $path ) = @_;
%$err = %{ XMLin $path => @lots_of_args };
$err}
# calling translate bounces the trampoline exactly once
sub translate { … }
Tramploline Subroutine
● Similar to a trampoline object: A portion of the code runs once and replaces itself.
● Oneshot code initializes the cache, which can now be a simple hash.
● Symbol::qualify_to_ref make this painless in Perl.● An anonymous sub manages the cache.● A named subroutine loads the cache, replaces itself with
the manager, and redispatches to the manager.
The Simplest Version● Minimal code
includes:● Closedover cache,● Subref to permenant
cache manager,● Initial subroutine.
● The initial subroutine initializes, installs, and redispatches.
my %foo_cache = ();
my $handler= sub{
my $foo = shift;
%foo_cache{ $foo }Or die “Bogus foo: '$foo' unknown”
};
sub do_something{
%cache = initialize_foo_cache;
my $ref= qualify_to_ref 'do_something';*$ref = $handler;
goto &$handler}
BEGIN Blocks Are Cleaner
● The block isolates cache, ref, and sub variables; allows recycling the ref.
● This is also rather amenable to installation by module.
BEGIN{
my $name = 'foo_handler';my $ref = qualify_to_ref $name;my %cache = ();
my $handler= sub{
%cache{ $_[0] } or die ...};
*$ref= sub{
%cache = init_the_cache;
*$ref = $handler;
goto &$handler}
}
Sub::Trampoline
● Aside from the actual assignment, init code is identical.
● Simply pass the name, manager, and init assignment.
● The module can call $init, replace itself.
● Caller defines the cache and handler.
sub install_trampoline{
my ( $name, $init, $mgr ) = @_;
my $caller = caller;my $ref= qualify_to_ref $name, $caller;
*$ref= sub{
$init->();
&$ref = $mgr;
goto &$mgr}
}
Using Sub::Trampoline use Sub::Trampoline;
my %cache1 = (); my @cache2 = ();
my $cache1_mgr = sub { $cache1{ $_[0] } or croak "Unknown '$_[0]'" };
my $cache2_mgr = sub { first { $_ eq $_[0] } @cache2 } };
my $init_cache1 = sub { %cache1 = select_from_hell } sub init_cache2 { @cache2 = XMLin $nasty_messy_xml_struct }
install_trampoline( subone => $init_cache1, $cache1_mgr ); install_trampoline( known => \&init_cache2, $cache2_mgr );
my $value1 = subone $key1;
if( known $value ) { … } else { carp “Unknown: '$value'” }
True OneShot: Empty $handler
● If you want to run something exactly once but don't know where it might be called initially:
my $manager = sub(){};
● You can also substitute a trampoline object with a constructor that does the work and no methods.
● Calling the object once constructs it, after which the classes constructor can stub itself.
● Useful for sharing the cache variable: the init populates it once and stubs itself to do nothing more.
Cycling The Cache
● There are times when you want to purge and reinitialize the cache
● A trampoline object with populate and use subs that flipflop can handle this easily.
● Reassigning a trampoline reinitializes the cache:
$cache= Object::Trampoline->init_cache( $class => @argz )if $age > $time_max;
v5.10 Introduced “state” Variables
● Scoped like a lexical.● Assinged once at runtime.● Maintain value within a single lexical context
throughout the program.● Assign the cache or assign a flag variable with the
side effect of populating the cache.● Currently supports only scalars.
Obvious case: Assign the cache.
● Assign the cache at runtime:sub cache_mangler { state $cache = init_cache; … }
● $cache will be assigned at runtime, the first time cache_mangler is called.
● The value will be retained between calls.● Catch: $cache is only available within
cache_mangler, not outside of it.
use v5.10;
my %cache = ();
sub init{
%k or %k = …;}
sub foo { state $y = init; … }sub bar { state $z = init; … }
…
my $foo = foo 'bletch';my $bar = bar 'blort';
Initialize a Shared Cache
● Subs may want to share a cache.
● $y and $z are assigned at most once per executison when foo or bar are called.
● The sanity check in init only needs to be handled at most per subroutine.
Summary
● True lazyness includes managing data.● Preloading it all or testing it at each step are not lazy.● Object::Trampoline provides one way.● Trampoline subroutines offer another approach.● v5.10 introduced state variables which provide a few
ways to initialize something once.