metadata extractors, content transformers & renditions neil mc erlean

40
Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Upload: aldous-barker

Post on 28-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Metadata Extractors, Content Transformers & Renditions

Neil Mc Erlean

Page 2: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Who am I?

Lead Engineer in the Services Team

4 years at Alfresco (since 3.2)

Previously worked on•Hybrid Sync•Alfresco in the Cloud•Various services/components

•Transformers & Extractors•REST APIs•Actions & Behaviours and more…

Ex-astrophysicist (of which more later)

Page 3: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Talk content

What data is in your content?

How does Alfresco get at it?

What does Alfresco do with it?

How can you use these features?

Introductory material•no prior knowledge assumed

Page 4: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Talk content - Breaking it down

Your content & its metadata

Alternative renditions of your content

Overviews of the 3 services

Java Foundation APIs. JavaScript.

Configuring & extending Alfresco.

All code samples available as runnable tests - download from the website.

Page 5: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

#1 Metadata Extraction

Page 6: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

#2 Content Transformation

Alfresco uses them to produce

•images (thumbnails)•plain text (indexing)•inter-Office transforms

Also generally useful

Page 7: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

#3 Rendition Service

• Very similar to transformations

• More general service

• More than just content to content

Page 8: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

How do these components work?

Mostly by leveraging existing OSS Java libs•Notably Apache Tika

Some external OS processes too•OpenOffice.org (OOo), LibreOffice•ImageMagick•pdf2swf (swftools)

Some bespoke impls e.g. zip - txt

‘embedded’ thumbnails/previews iWorks, Office

Page 9: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

General Considerations

CPU, memory

In process vs. out of process vs. Remote CPU

Selection of ‘best’ extractor/transformer

Stay for Andy Hunt’s talk for Support’s troubleshooting tips

Page 10: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Metadata Extraction

Page 11: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

#1 Metadata Extraction

• Triggered on content creation or update.• or on demand

• ‘Best’ available extractor obtained from MetadataExtracterRegistry.

• This Extractor pulls out the metadata.• Format depends on the extractor lib/impl.• key/value pairs

• These data are mapped onto the Alfresco content model• configurable mapping.

<ExtractorClass>.properties

Page 12: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Metadata extraction - JavaMetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”,

MetadataExtracterRegistry.class);

ContentReader reader =

contentService.getReader(nodeRef,

ContentModel.PROP_CONTENT);

MetadataExtracter extractor = registry.getExtracter(reader.getMimetype());

Map<QName, Serializable> props =

new HashMap<QName, Serializable>();

extractor.extract(reader,

OverwritePolicy.EAGER, props);

Page 13: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Overwrite Policy – when re-extracting

• EAGER• extracted value is not null

• PRUDENT• db property doesn’t exist or is null or “” (+

above)• CAUTIOUS

• existing property == undefined

Page 14: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

<ExtractorClass>.properties mappingnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0

author=cm:author

title=cm:title

#Note need to escape ‘:’ in key name

geo\:lat=cm:latitude

geo\:long=cm:longitude

Page 15: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Mapping properties

• Can map extracted key-value onto multiple content properties

• Can ignore extracted key-values i.e. not map.

Page 16: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Metadata extraction - JavaScript

var action = actions.create('extract-metadata'); action.execute(nodeRef);

Page 17: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Ways to customise & extend

• Customisation of existing extractors• Define new mappings – to an existing or a

new content model.• Adding new extractors

• Identify 3rd party lib that can read the binary file

• Or write your own code to do this• Extend

AbstractMappingMetadataExtracter• Or write a Tika plugin• Define metadata mappings

• org.alfresco.repo.content.metadata

Page 18: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Recap

• Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model.

• Support for many MIME types

• Metadata insertion coming• it’s on HEAD but currently disabled• also maps metadata tags to cm:taggable

• “Best” extractor selection covered below

Page 19: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Content Transformers

Page 20: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Out of the box transformers• text, html, xml• Microsoft Office (doc & docx formats)• OpenDocument Format• iWorks (Keynote, Pages, Numbers)• Images• Shockwave Flash (SWF)• RFC822 email, Outlook .msg email• Adobe PDF, Illustrator, PSD• Electronic publication (epub)• Rich Text (RTF)• MP3• Archives (ZIP, tar)• Many more

Page 21: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Available transformers

• No ‘graph’ of transform paths/mime types

• Spring beans extend “baseContentTransformer”

• They implement isTransformable(from, to)

• They can be• simple (A to B)• ‘complex’ (A to C, via B)• failover (A to B, A to B…)• overlapping (multiple beans for same

path)• dynamically un/available (e.g. OOo)

Page 22: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

/api/service/mimetypes webscript

http://localhost:8080/alfresco/service/mimetypes

•MIME types

•Metadata Extractors

•Content Transformers

•As services come and go (OOo), entries may disappear

Page 23: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

/api/service/mimetypes webscriptapplication/vnd.openxmlformats-officedocument.presentationml.presentation - pptx

Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter

Transformable To:

application/pdf = Using a Direct Open Office Connection

application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

application/x-shockwave-flash = Complex via: application/pdf

image/jpeg = Complex via: application/pdf

image/png = Complex via: application/pdf

text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer

Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

Page 24: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

“Best” transformer selection

• Alfresco prefers• available transformers (obviously)• ‘explicit’ transformers• previously fast transformers*

• Alfresco doesn’t understand the output quality• pass/fail• fast/slow

* past performance is not a guide to future performance.

Page 25: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Content Transformation - JavaContentTransformerRegistry registry =

appContext.getBean("contentTransformerRegistry”);

ContentReader reader = contentService.getReader

(nodeRef, ContentModel.PROP_CONTENT);

ContentWriter writer = contentService.getWriter

(targetNode, ContentModel.PROP_CONTENT, true);

writer.setEncoding("UTF-8”);

writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

// Now have a reader & writer ready to go

Page 26: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Content Transformation – Java ctd.ContentTransformer transformer =

registry.getTransformer

(MimetypeMap.MIMETYPE_ZIP,

reader.getSize(),

MimetypeMap.MIMETYPE_TEXT_PLAIN, null);

transformer.transform(reader, writer);

Page 27: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Content Transformation - JavaScript

var action = actions.create('transform');

action.parameters["destination-folder"] = node.parent;

action.parameters["assoc-type"] =

"{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc-name"] =

node.name + "transformed";

action.parameters["mime-type"] = "text/plain";

action.execute(testNode);

Page 28: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Config: Transformer Filtering/Debugging

• org.alfresco.service.cmr.repository.

TransformationOptionLimits

• timeouts, size limits, page limits• content.transformer.OpenOffice.

mimeTypeLimits.txt.pdf.maxSourceSizeKBytes=5120

• org.alfresco.repo.content.TransformerDebug

• contextual logging

Page 29: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Extending

• Follow the Alfresco patterns• org.alfresco.repo.content.transform

• Remember the chains

• Remember the subsystems• ImageMagick• OpenOffice

• Remember the Enterprise variants• JodConverter

Page 30: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Recap

• Many transformations & paths possible• No graph

• Can be expensive in CPU/memory

• Transformation to text = free indexing

• No link between source & transformed content• Thumbnails are children of their source

nodes• Bespoke behaviours ensure thumbnails are

updated

Page 31: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Renditions

Page 32: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Renditions

• A more general feature than transformers

• Although with a strong overlap• Thumbnails are renditions• Previews are renditions

• Not all renditions are thumbnails/previews

Page 33: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Renditions

• Flexible location

• Always associated to their source node.• Child nodes of their source node.• Child nodes of another folder node.

• Updated when their source updates.

• Can be disabled with marker aspect• rn:preventRenditions• See ‘preventRenditions’ spring bean to

register other ‘unrenditionable’ content classes

• Can reflect the content and/or metadata of their source node.

Page 34: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Standard rendition engines

• reformat redirects to vanilla transforms

• image image manipulation parameters

• freemarker run some FTL against source content

• xslt run XSLT on (XML) source node

• composite rendition series [reformat, crop]

Page 35: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Persistence of Rendition Definitions

1. Create Rendition Definition

2. Set parameter values on it

3. Execute it against a source node

• Definitions can be persisted

• Useful for complex or commonly used• RenditionService.save(), .load()

• Saved into Alfresco’s Data Dictionary

Page 36: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Renditions - JavaNodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");

RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine");

renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128);renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);

Page 37: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Renditions - JavaScriptvar renditionDef = renditionService

.createRenditionDefinition("cm:cropResize”,

"imageRenderingEngine");

renditionDef.parameters["destination-path-template”]

= "/Company Home/Cropped Images/${name}.jpg";

renditionDef.parameters["isAbsolute"] = true;

renditionDef.parameters["xSize"] = 50;

renditionDef.parameters["ySize"] = 50;

renditionService.render(testNode, renditionDef);

var renditions = renditionService.getRenditions(testNode);

Page 38: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Recap

• Renditions == Transformations++

• More complex, more powerful

Page 39: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean
Page 40: Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

End