pygrametl

1
The code has a few functions defined in the top. After the functions, the pygrametl Dimension, FactTable, and Source objects are created. Using these objects, the main method only req dimension which is slowly changing and snowflaked. bulkloader=pgcopybulkloader, bulksize=5000000) # Data sources - change the path if you have your files somewhere else downloadlog = CSVSource(file('./DownloadLog.csv', 'r', 16384), delimiter='\t') testresults = CSVSource(file('./TestResults.csv', 'r', 16384), delimiter='\t') inputdata = MergeJoiningSource(downloadlog, 'localfile', testresults, 'localfile') def main(): for row in inputdata: extractdomaininfo(row) extractserverinfo(row) row['size'] = pygrametl.getint(row['size']) # Convert to an int # Add the data to the dimension tables and the fact table row['pageid'] = pagesf.scdensure(row) row['dateid'] = datedim.ensure(row, {'date':'downloaddate'}) row['testid'] = testdim.lookup(row, {'testname':'test'}) facttbl.insert(row) connection.commit() if __name__ == '__main__': main()

Upload: ramanavg

Post on 30-Nov-2015

29 views

Category:

Documents


3 download

DESCRIPTION

pygrametl

TRANSCRIPT

Page 1: pygrametl

7/28/13 pygrametl

www.pygrametl.org 1/1

The code has a few functions defined in the top. After the functions, the pygrametl Dimension, FactTable, and Source objects are created. Using these objects, the main method only requires 10 lines of code to load the DW. Note how easy it is to fill the page

dimension which is slowly changing and snowflaked.

lookupatts=['testname'], prefill=True, defaultidvalue=-1)

datedim = CachedDimension( name='date', key='dateid', attributes=['date', 'day', 'month', 'year', 'week', 'weekyear'], lookupatts=['date'], rowexpander=datehandling)

facttbl = BulkFactTable( name='testresults', keyrefs=['pageid', 'testid', 'dateid'], measures=['errors'], bulkloader=pgcopybulkloader, bulksize=5000000)

# Data sources - change the path if you have your files somewhere elsedownloadlog = CSVSource(file('./DownloadLog.csv', 'r', 16384), delimiter='\t')

testresults = CSVSource(file('./TestResults.csv', 'r', 16384), delimiter='\t')

inputdata = MergeJoiningSource(downloadlog, 'localfile', testresults, 'localfile')

def main(): for row in inputdata: extractdomaininfo(row) extractserverinfo(row) row['size'] = pygrametl.getint(row['size']) # Convert to an int # Add the data to the dimension tables and the fact table row['pageid'] = pagesf.scdensure(row) row['dateid'] = datedim.ensure(row, {'date':'downloaddate'}) row['testid'] = testdim.lookup(row, {'testname':'test'}) facttbl.insert(row) connection.commit()

if __name__ == '__main__': main()