heritrix 3: librarian features bnf proposal march 2015
TRANSCRIPT
Heritrix 3: librarian features
BnF proposalMarch 2015
Context
• Follow up of our NetarchiveSuite workshop in Tallinn:– https://sbforge.org/display/NAS/2015+Workshop+Conclusion
• Identified work packages: – tests– template migration– implementation of important but missing curator
features for common operations in Heritrix 3• BnF will further describe use cases, share them with
the community for feedback and implement the following features as a minimal Heritix UI add-on
From H1…
… to H3
Common curator operations
• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job
configuration)• View or delete frontier URIs
Search crawl.log (NASC61)
• Add a page with the same layout but with 2 additional form fields:– Regular expression:– Show matches: 1000 (default # of matching URIs)– Action => Display URIs (reversed order by default)
• Possibility to refresh display (F5)
Draft UI for « Search crawl log »
Display URIs
Status + job ID Home
ForwardReversed
Matching lines: 1000
Lines: displaying 1-1000 out of 12345
Lines: displaying 1-1000 out of 12345
Common curator operations
• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job
configuration)• View or delete frontier URIs
Add filter on current job (DecideRule) (NASC60)
• Not necessary to view active filters that were included from job start (NASC59)
• Add a page containing a rejectTemporarily area working with the following parameters:– Decision: REJECT– List-logic: OR– Regexp-list : empty at job start, free textarea which can be
manually edited and sorted (440 px wide, 20 lines)– Action => Save: save current filters and activate them for
current job
Draft UI for « Add filter on current job »
Status + job ID Home
All URIs matching any of the following regular expressions will be rejected from the current job.
Regular expressions: Save
Common curator operations
• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job
configuration)• View or delete frontier URIs
Change domains/hosts budget
• Works with queue-total-budget and quota-enforcer systems
• Add a page containing:– a list of domains/hosts (in domain alphabetical order)– their associated budget value (which can be edited)– only those which budget is not set by default – and a form field to add a new domain/host
Draft UI for « Change domains/hosts budget »
Status + job ID Home
Save
Budget defined in job configuration: queue-total-budget of 100 000 URIs .
bnf.fr 140 000 ina.fr 139 000cnc.fr 139 500
Budgets of following domains/hosts have been changed in the current job:
New domain/host: toto.fr – 130 000
Save
Save
Save
Common curator operations
• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job
configuration)• View or delete frontier URIs
View or delete frontier URIs (NASC56 + NASC57 + NASC58)
• Add a page containing 2 form fields:– Regular expression:– Show matches: 1000 (default # of matching URIs)– Action A => Display URIs: displays the matching
URIs, the # of matching URIs and gives the possibility to view the next bloc of matching URIs
– Action B => Delete URIs: delete matching URIs and indicates the # of matching URIs
Draft UI for « View or delete frontier URIs »
Status + job ID Home
URIs: displaying 1-1000 out of 12345
Matching lines: 1000
URIs: displaying 1-1000 out of 12345
Pause the job first to view frontier
search
search
Job configuration add filter – change budget
Comparaison with BAnQ