using jhove2 for policy assessment of files

Post on 12-Jan-2016

21 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Using JHOVE2 for Policy Assessment of Files. Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room. Agenda 13:30-16:30. What is JHOVE2 ? Characterization of digital objects Validation vs Assessment - PowerPoint PPT Presentation

TRANSCRIPT

Using JHOVE2 for Policy Assessment of Files

Richard AndersonCode4LibCon Preconference

2/7/2011

http://code4lib.org/conference/2011/schedule#preconf13:30-16:30 : Persimmon Room

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

JHOVE2 is …

… a project to develop a next-generation open source framework and application for format-aware characterization

… a collaborative undertaking of the California Digital Library (CDL), Portico, and Stanford University

… a two year grant from the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP)

“What? So what?”

Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object

– Identification

– Feature extraction

– Validation

– Assessment

Determining the presumptive format of a digital object based on suggestive extrinsic hints and intrinsic signatures

Reporting the intrinsic properties of an object significant for classification, analysis, and planning

What's new in JHOVE2?

Processing of multi-file objects as well as embedded objects inside files

Recursive processing of containers objects

Plug-in Format Modules

Buffered I/O

Internationalized output

Clean APIs and modern design patterns

Je ne sais quoi !

API design idioms

Separation of concerns

– Annotation and Reflection confluence.ucop.edu/display/JHOVE2Info/Background+Papers

Inversion of Control (IOC) / Dependency Injection

– Martin Fowlermartinfowler.com/articles/injection.html

– Spring Frameworkwww.springsource.org/

Project HomeDomain name

– http://jhove2.org/

Code Repository– https://bitbucket.org/jhove2/main/wiki/Home

• Public Wiki/Documentation• Browse/Clone Source Code• Download Release Packages• Changeset History• Issue Tracking

Mailing lists– JHOVE2-Announce-L@listserv.ucop.edu– JHOVE2-Techtalk-L@listserve.ucop.edu

JHOVE2 Documentation

Complete documentation

– User’s guide

– Architectural overview

– Module specifications

– Programmer’s guide

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

Characterization

Validation vs. AssessmentValidation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification

– To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules

Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules

– Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files

Format Specifications

Format Specification

JPEG 2000 JP2 (ISO/IEC 15444-1), JPX (ISO/IEC 15444-2)

PDF PDF 1.0 – 1.7, ISO 3200-1, PDF/A-1 (ISO 19005-1), PDF/X-1 (ISO 15920-1), -1a (ISO 15930-4), -2 (ISO 15930-5) -3 (ISO 15930-6)

TIFF TIFF 4 – 6, Class B, F, G, P, R, Y, TIFF/EP (ISO 12234-2), TIFF/IT (ISO 12639), GeoTIFF, Exif (JEITA CP-3451), DNG

UTF-8 ASCII (ANSI X3.4)

WAVE BWF (EBU N22-1997)

Validation vs. AssessmentValidation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification

– To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules

Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules

– Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files

Putting it another way …

Assessment is the evaluation ofa source unit's

reportable properties against a set of

policy-based rules

Assessment is the evaluation ofa source unit's

– File (UTF-8)– File with embedded ByteStream(s)

(TIFF with ICC profile)– Aggregate (Directory, ZIP ) – ClumpSource (ShapeFile)

reportable properties against a set of

policy-based rules

Assessment is the evaluation ofa source unit's reportable properties

– Format Identification– Features – Validity

against a set of policy-based rules

Assessment is the evaluation ofa source unit's

reportable properties

against a set of policy-based rules– Is the item acceptable?

– Is there a preservation risk?– What level of preservation service?– Should we flag object for future action?

Practical Applications of Assessment

• Ingest workflows

• Migration workflows

• Digitization workflows

• Publishing workflows

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

Running JHOVE

jhove2.sh –d Text –o outfile.txt myfile.xmlDisplay format choices are: Text (default), JSON, and XML.

File argument can be any of:– Filename– Directory name– URL– Set of space-delimited filepaths

http://bitbucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide.pdf

JHOVE2 Output options

• Input File– xml-schemaLocation-cannot-resolve.xml

• Text– text-output.txt

• XML– xml-output.xml

• JSON– json-output.txt

FileSource:

Path: E:\samples\xml\schema-sample.xml

Size (byte): 9516

LastModified: 2010-10-12T11:55:29-06:00

SourceName: schema-sample.xml

StartingOffset (byte): 0

JHOVE2 Output

Format Identification

PresumptiveFormats:

PresumptiveFormat {FormatIdentification}:

NativeIdentifier {I8R}:

Namespace: PUID

Value: fmt/101 PRONOM Identifier

JHOVE2Identifier {I8R}:

Namespace: JHOVE2

Value: http://jhove2.org/terms/format/xml

...

PRONOM Format Registryhttp://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=638

Name Extensible Markup LanguageVersion 1.0Other names XML (1.0)Identifiers PUID: fmt/101

Apple Uniform Type Identifier: public.xmlMIME: text/xml

Classification Text (Mark-up)Description The Extensible Markup Language (XML) is a general

purpose markup language for creating other, special purpose, markup languages, and is a simplified subset of SGML. …

Agent used for Identification

Module {DROIDIdentifier}:

SignatureFile: …/DROID_SignatureFile_V20.xml

Version: 2.0.0

ReleaseDate: 2010-09-10

WrappedProduct:

Name: DROID

Version: 4.0.0

ReleaseDate: 2009-07-23

...

DROIDhttp://sourceforge.net/projects/droid/ DROID (Digital Record Object Identification) is an automatic

file format identification tool. It is the first in a planned series of tools developed by The National Archives under the umbrella of its PRONOM technical registry service

XML Module Module {XmlModule}:

SaxParser:

Parser: org.apache.xerces.parsers.SAXParser

XmlDeclaration:

Version: 1.0

Encoding: UTF-8

Standalone: no

RootElement:

Name: mets

Namespace: http://www.loc.gov/METS/

XML Module (namespaces) NamespaceInformation:

NamespaceCount: 2

Namespaces:

Namespace:

URI: http://www.loc.gov/METS/

Declarations:

Prefix: [default]

SchemaLocations:

SchemaLocation:

Location: http://www.loc.gov/standards/mets/version15/mets.xsd

Namespace:

URI: http://www.loc.gov/mix/v10

Declarations:

Prefix: mix

XML Module (cont)

ValidationResults:

ParserWarnings {ValidationMessageList}:

ValidationMessageCount: 0

ParserErrors {ValidationMessageList}:

ValidationMessageCount: 0

FatalParserErrors {ValidationMessageList}:

ValidationMessageCount: 0

isWellFormed: true

isValid: true

Format Modules from JHOVE2 Team

ICC color profileJPEG 2000PDFSGMLShapefile

TIFFUTF-8WAVEXMLZip

JHOVE2 can identify (by DROID) many more formats than it can validate (by modules)

Other Module Development3rd party development activities

– NetCDF and GRIB modules (Wegener Institute)

– Integration with DuraCloud (DuraSpace)– ARC module (Bibliothèque nationale de France)– WARC, JPEG, GIF modules (CDL, hopefully ;-)

Possible development efforts– Additional format modules– Configuration GUIs– JHOVE2-as-a-service– Integration with DAITTS, DSpace, Fedora, FITS, etc.

Suggestions, volunteers and funders welcome

AssessmentModule Module {AssessmentModule}:

AssessmentResultSets:

AssessmentResultSet:

RuleSetName: XmlRuleSet

RuleSetDescription: RuleSet for Xml Module

ObjectFilter: org.jhove2.module.format.xml.XmlModule

BooleanResult: true

AssessmentResults:

AssessmentResult:

RuleName: XmlValidityRule

RuleDescription: Is the XML file acceptable?

BooleanResult: true

NarrativeResult: Acceptable

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

JHOVE2 Abstractions

• Source Unit• Module• Reportable• Reportable Property• Message

Source UnitA formatted object about which characterization information can be meaningfully reported

– Unitary File e.g. UTF-8 text file File inside of a container e.g. TIFF inside a Zip Byte stream inside a file e.g. ICC inside a TIFF

– Aggregate Directory Directory inside of a container Clump e.g. Shapefile File set e.g. command line arguments

For purposes of characterization, directories, file sets, and clumps are considered format types

Source Interface (Java)

public Set<FormatIdentification> getPresumptiveFormats() {return presumptiveFormatIdentifications;

}public List<Module> getModules() {

return this.modules;}public List<Source> getChildSources() {

return this.children;}

Format Module• implements Parser• implements Validator • Implements Reportable• Imports org.jhove2.annotation.ReportableProperty

public long parse(JHOVE2 jhove2, Source source, Input input) {// extract features and //fill in the reportable properties fields

. . . }

Reportables

A Reportable is a named set of properties– Reportables correspond to Java classes

– Including classes for sources and modules

Also define reportables for the major conceptual structures inherent to a format

– JPEG 2000: Box

– TIFF: IFH, IFD, IFD entry (“tag”)

– UTF-8: Character stream, character

– WAVE: Chunk

Reportable Interfacepackage org.jhove2.core

public interface Reportable { public I8R getReportableIdentifier(); public String getReportableName(); public void setReportableName(String name);}

public abstract class AbstractReportable implements Reportable{ protected I8R reportableIdentifier; protected String reportableName;}

A reportable class implements the Reportable marker interface

ReportablePropertiesA ReportableProperty is a named, typed value

– org.jhove2.annotation.ReportableProperty – Unique formal identifier– Data type

Scalar or collection Java types, JHOVE2 primitive types, or JHOVE2 reportables

– Typed value– Description of correct semantic interpretation– Properties correspond to fields

ReportableProperty AnnotationEach reportable property is represented by a field and accessor and mutator methodsThe accessor method must be marked with the @ReportableProperty annotation

public class MyReportable implements Reportable{ protected String myProperty;

@ReportableProperty(order=1, desc= “description”, ref= “reference”) public String getMyProperty() { return this.myProperty; }

public void setMyProperty(String property) { this.myProperty = property; }}

Wave Reportable Properties

chunks[ ]

formatChunkNotBeforeDataChunkMessage

missingRequiredFormatChunkMessage

missingRequiredDataChunkMessage

missingRequiredFactChunkMessage

isValid

childChunks[ ]hasPadByteidentifierisValidsize

UTF-8 Reportable Properties

byteOrderMark

c0Characters

c1Characters

codeBlocks

eOLMarkers

invalidCharacters[ ]

isValid

numCharacters

numLines

numNonCharacters

c0Controlc1ControlcodeBlockcodePointcodePointOutOfRangecoverageinvalidByteValuesisByteOrderMarkisC0ControlisC1ControlisNonCharacterisValidsize

XML Reportable Properties

Fields for the reportable properties protected String saxParser = "org.apache.xerces.parsers.SAXParser"; protected XmlDeclaration xmlDeclaration = new XmlDeclaration(); protected String xmlRootElementName; protected List<XmlDTD> xmlDTDs; protected HashMap<String,XmlNamespace> xmlNamespaceMap; protected List<XmlNotation> xmlNotations; protected List<String> xmlCharacterReferences; protected List<XmlEntity> xmlEntitys; protected List<XmlProcessingInstruction> xmlProcessingInstructions; protected List<String> xmlComments; protected XmlValidationResults xmlValidationResults ; protected boolean wellFormed ;

Getter methods for reportable propertiesimport org.jhove2.annotation.ReportableProperty;

@ReportableProperty(order = 1, value = "Java class used to parse the XML")

public String getSaxParser() { return saxParser; } @ReportableProperty(order = 2, value = "XML Declaration data") public XmlDeclaration getXmlDeclaration() { return xmlDeclaration; } @ReportableProperty(order = 3, value = "Name of the document's root element") public String getXmlRootElementName() { return xmlRootElementName; }

Messagesif (position == start && ch.isByteOrderMark()) { Object [] messageParms = new Object [] {position};

this.bomMessage = new Message(Severity.INFO,Context.OBJECT,"org.jhove2.module.format.utf8.UTF8Module.bomMessage",messageParms );

}

Messages

• Messages are reportable properties– Unique identifier

info:jhove2/message/…– Context

Process Condition arising from the process of characterization

Object Condition arising in the object being characterized

– Severity Error Warning Info

– Internationalizable

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

http://code4lib.org/conference/2011/schedule#preconf

Assessment rules

Assertions (logical expressions) based on

– Presence/absence of a property– Constraints on property values– Combinations of properties/values

Predicate Logic

• Rules use a construct whose basic structure looks like this:

If (condition)

Then (consequent)

Else (alternative)

http://en.wikipedia.org/wiki/Conditional_(programming)

ConditionA condition is defined by a

universal or existential qualifier “for all” “for any”¬ “not any”

and an arbitrary set of predicates {ALL_OFF | ANY_OF | NONE_OF}

(predicate) (predicate) ...

http://www.csm.ornl.gov/~sheldon/ds/sec1.6.html

Predicate

Each predicate is a string containing a boolean expression

xmlDeclaration.standalone == 'yes'

These assertions take the form:property relation value

Supported relational operators include:

== != < > =< =>

contains

exists ( != null)

XML Assessment rule

If ANY_OF validity == true ;

(validity == undetermined) and (wellFormed == true)Then AcceptableElse Not acceptableEnd If

JPEG 2000 Assessment Rule

If ALL_OF validity == true;

exists(colourBox);

exists(resolutionBox.capture)Then AcceptableElse Not acceptableEnd If

Wave Assessment rule

If ALL_OF validity == true ;

exists(broadcastWaveExtensionChunk) ;

waveFormatChunk.nSamplesPerSec == 96000 ;

waveFormatChunk.nBitsPerSample == 24Then AcceptableElse Not acceptableEnd If

TIFF Assessment rule

If ANY_OF validity == true ;

((ifd.messages contains ‘offsetNotByteAligned’) or (ifd.messages contains ‘dateNotWellFormed’))Then AcceptableElse Not acceptableEnd If

Rules Engines

• JSR 94: JavaTM Rule Engine APIhttp://jcp.org/en/jsr/detail?id=94

• Rule Engines Overviewhttp://jadex-rules.informatik.uni-hamburg.de/xwiki/bin/view/Resources/Rule+Engines

• Top 10 Java Business Rule Engineshttp://blog.taragana.com/index.php/archive/top-10-java-business-rule-engines/

• Introduction to Droolshttp://www.intltechventures.com/presentations/2008-01-26-Introduction-to-Drools.pdf

Expression Languages• Predicates (conditions) are evaluated using an domain-specific

language that supports scripted examination of Java objects

• MVEL (MVFLEX Expression Language)

http://mvel.codehaus.org/• OGNL (Object-Graph Navigation Language)

http://www.opensymphony.com/ognl

• Groovyhttp://groovy.codehaus.org/

• Open Source Expression Languages in Javahttp://java-source.net/open-source/expression-languages

http://www.java-opensource.com/open-source/expression-languages.html

Assessment Module at work public void assess(JHOVE2 jhove2, Source source) throws JHOVE2Exception { /* Assess the source unit. */ this.configInfo = jhove2.getConfigInfo(); List<Module> modules = source.getModules(); for (Module module : modules) { assessObject(module); this.getModuleAccessor().persistModule(this); } assessObject(source); this.getModuleAccessor().persistModule(this);

}

AssessObject Method private void assessObject(Object assessedObject) throws JHOVE2Exception {

String objectFilter = assessedObject.getClass().getName();List<RuleSet> ruleSetList = getRuleSetFactory()

.getRuleSetList(objectFilter);if (ruleSetList != null) { for (RuleSet ruleSet : ruleSetList) {

if (ruleSet.isEnabled()) { AssessmentResultSet resultSet =

new AssessmentResultSet();assessmentResultSets.add(resultSet);

resultSet.setRuleSet(ruleSet); resultSet.fireAllRules(assessedObject);

} } }

Fire Off the Rules

Sequence Diagram

Identification

Feature extraction

Assessmemt

Agenda 13:30-16:30

• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules

Assessment Configuration• Lists of properties for a Module can be generated

using the ReportableInstanceTraverser utilityUSAGE: java -cp CLASSPATH

org.jhove2.app.util.traverser.ReportableInstanceTraverser fully-qualified-class-name output-file-path {optional boolean should-recurse(default true)}

• wave-property-list.txt

• tiff-module-properties.txt

Assessment Configuration• Rules are configured using ARules utility

– Utility developed by CDL to create rule set in XML– Future plans: a GUI

• ARules output is a Spring config fle

ARules configurationruleset XmlRuleSet enabled org.jhove2.module.format.xml.XmlModule

desc Ruleset for XML module

rule XmlStandaloneRule enabled

desc Does XML Declaration specify standalone status?

cons Is Standalone

alt Is Not Standalone

quant all

pred xmlDeclaration.standalone == "yes"

rule XmlAcceptableRule enabled

Desc Is the XML status acceptable?

cons Acceptable

alt Not Acceptable

quant any

pred valid.name() == "True"

pred (valid.name() == "Undetermined") && (wellFormed.name() == "True")

RuleSet Spring Bean <!-- RuleSet bean for the XmlModule --><bean id="XmlRuleSet" class="org.jhove2.module.assess.RuleSet"

scope="singleton"> <property name="name" value="XmlRuleSet"/> <property name="description"

value="RuleSet for Xml Module"/> <property name="objectFilter"

value="org.jhove2.module.format.xml.XmlModule"/> <property name="rules"> <list value-type="org.jhove2.module.assess.Rule">

<ref local="XmlStandaloneRule"/><ref local="XmlValidityRule"/>

</list></property><property name="enabled" value="true"/>

</bean>

Rule Spring Bean<!-- Rule bean for evaluating validity value --><bean id="XmlValidityRule"

class="org.jhove2.module.assess.Rule" scope="singleton"> <property name="name" value="XmlValidityRule"/> <property name="description"

value="Is the XML validity status acceptable?"/><property name="consequent" value="Acceptable"/> <property name="alternative" value="Not Acceptable"/> <property name="quantifier" value="ANY_OF"/><property name="predicates"> <list value-type="java.lang.String">

<value><![CDATA[ valid.toString() == 'true' ]]</value><value><![CDATA[ (valid.toString() == 'undetermined') &&

(wellFormed.toString() == 'true') ]]></value> </list></property><property name="enabled" value="true"/>

</bean>

Spring Config Filesconfig│ └───spring │ └───module ├───aggrefy │ jhove2-aggrefy-config.xml │ ├───assess │ jhove2-assess-config.xml │ jhove2-ruleset-xml-config.xml │ ├───digest │ jhove2-digest-config.xml │ ├───display │ jhove2-display-config.xml │ ├───identify │ jhove2-display-config.xml

Assessment Output

Results stored as new characterization properties

Rule evaluation output includes – Rule's name and brief description– Boolean value of the condition that was evaluated– Text value of the consequent of alternative– Details of the predicate evaluation results

Assessment Output ExampleModule {AssessmentModule}:

AssessmentResultSets: AssessmentResultSet:

RuleSetName: XmlRuleSet RuleSetDescription: Ruleset for XML module

ObjectFilter: org.jhove2.module.format.xml.XmlModule BooleanResult: false AssessmentResults:

AssessmentResult: RuleName: XmlStandaloneRule RuleDescription: Does XML Declaration specify standalone status? BooleanResult: false NarrativeResult: Is Not Standalone AssessmentDetails: ALL_OF { xmlDeclaration.standalone == "yes" =>

false; } AssessmentResult: RuleName: XmlAcceptableRule RuleDescription: Is the XML status acceptable? BooleanResult: true NarrativeResult: Acceptable AssessmentDetails: ANY_OF { valid.name() == "True" => true;(valid.name( )

== "Undetermined") && (wellFormed.name() == "True") => false; }

Actionable Outcomes?

– Assessment outcome is informational data– Surrounding workflows may utilize assessment

results to guide control mechanism– JHOVE2 provides API, but does not initiate actions

Assessment Enhancements• Assessment Config file editing

– Make it easier for a non-programmer to edit– Editing should be bullet-proofed if possible

• GUI User interface– Presents a GUI treeview that lists reportable properties in a navigable

hierarchy.

• Sanity checking– Pre-test config files to ensure compatability

• Command-line invocation of the sanity checker• Run check whenever installed modules have been changed

– Also have robust reporting in case property is missing

JHOVE2 Community

Wiki– http://jhove2.org/– https://bitbucket.org/jhove2/main/wiki/Modules

Mailing lists– JHOVE2-Announce-L@listserv.ucop.edu– JHOVE2-Techtalk-L@listserve.ucop.edu

top related