scripting eprints. light on syntax object->function(arg1, arg2) incomplete designed to give...
TRANSCRIPT
Scripting EPrints
Light on syntax object->function(arg1, arg2)
Incomplete Designed to
give you a feel for the EPrints data model introduce you to the most significant
objects how they relate to one another their most common methods
act as a jumping off point for exploring
About This Talk
EPrints modules have embedded documentation
Extract it using perldoc perldoc perl_lib/EPrints/EPrint.pm
Finding Documentation
EPrints 3.0
This talk based on EPrints 2.3 series 3.0 API still being finalised
tidies up object hierarchy resolves some of 2.3’s naming clashes
lots of extra functionality but core data model remains the same
EPrints 3.0 is fully back-compatible 2.3 scripts will work with EPrints 3.0
1. Data EPrints, Users, Documents, Subjects,
Subscriptions
2. Data collections DataSets, MetaFields
3. Searching your data SearchExpressions
4. Scripting your archive Archives, Session
Roadmap
EPrints, Users, Documents, Subjects, Subscriptions
1. Data
Data Model Sketch
EPrint
Data Model Sketch
EPrint
Document
Document
HTMLHTML
HTML
all documents
Data Model Sketch
EPrint
Document
Document
User
HTMLHTML
HTML
owner
all documents
Data Model Sketch
EPrint
Document
Document
User
HTMLHTML
HTML
owner
owned eprints
all documents
EPrint
Data Model Sketch
EPrint
Document
Document
User
Subscription Subscription
HTMLHTML
HTML
owner
owned eprints
all documents
subscriptions
EPrint
Data Model Sketch
EPrint
Document
Document
User
Subscription Subscription
HTMLHTML
HTML
Subject
owner
owned eprints
all documents
subscriptions
EPrint
Data Model Sketch
EPrint
Document
Document
User
Subscription Subscription
HTMLHTML
HTML
Subject
Subject
Subject
parent
child
owner
owned eprints
all documents
subscriptions
EPrint
Data Model Sketch
EPrint
Document
Document
User
Subscription Subscription
HTMLHTML
HTML
Subject
Subject
Subject
parent
child
owner
owned eprints
all documents
posted eprints
subscriptions
EPrint
EPrint
An EPrint object represents a single deposit in your EPrints archive
has some metadata fields has one or more documents is owned by a user
EPrint
new(session, id) create an EPrint object for an existing
deposit create(session, dataset, data)
create a new EPrint object
More on sessions and datasets later!
Creating EPrints
EPrint is a subclass of DataObj DataObj provides common methods
for accessing metadata rendering XHTML output
Introducing DataObj
Inherited from DataObj
get_id get_url(staff)
get the URL of an EPrint e.g. URL to the abstract page of an eprint in
the archive if staff is true then returns the URL to the
staff view, which shows more detail get_type()
get the EPrint type e.g. article, book, thesis, conference
paper...
get_value(fieldname) get the value of the named field
set_value(fieldname, value) set the value of the named field Remember to call commit() to make
changes in database! is_set(fieldname)
true if the named field has a value
Inherited from DataObj
remove() erase the eprint and any associated
records/files from the database and filesystem
this should only be called on EPrints in the "inbox" or "buffer" datasets
commit() commit any changes made to the
database datestamp()
set the last modified date to today
EPrint Methods
move_to_deletion() transfer the eprint to the deletion dataset should only be called on eprints in the
archive dataset
See also: move_to_inbox() move_to_buffer() move_to_archive()
Moving EPrints Around
generate_static() generate the static abstract page for the
eprint in a multi-language archive this will
generate a page in each language
Rendering EPrints
render_citation(style) create an XHTML citation for the EPrint if style is set then use the named citation
style defined in citations-en.xml
render_citation_link(style) as above, but citation is linked to the
EPrint’s abstract page
Rendering - Inherited from DataObj
render_value(fieldname, showall)
get an XHTML fragment containing the rendered version of the value of the named field
in the current language if showall is true then all languages are
rendered usually used for staff viewing (checking) data
Rendering - Inherited from DataObj
Rendering Tips
Most rendering methods return XHTML
but not a string! XML Node objects
DocumentFragment, Element, TextNode...
In your scripts, build a document tree from these nodes
e.g. node1->appendChild(node2) then flatten it to a string Why? It’s easier to manipulate a tree
than to manipulate a large string
XML Node objects are not part of EPrints
XML::DOM or XML::GDOME libraries explore these libraries using perldoc
XHTML is good for building Web pages
but not so good for command line output!
use tree_to_utf8() extracts a string from the result of any
rendering method tree_to_utf8( eprint->render_citation)
More Rendering Tips
get_user() get a User object representing the user
to whom the EPrint belongs get_all_documents()
get a list of all the Document objects associated with the EPrint
We will look at these objects next...
Navigating to Related Objects
User
A User object represents a single registered user
Also a subclass of DataObj inherits metadata access methods
get_url get_type get_value set_value is_set
inherits rendering methods render_citation render_citation_link render_value
Also has commit and remove inherited from DataObj in 3.0
new(session, id) create a User object from an existing
user record user_with_email(session, email) user_with_username(session, username)
create_user(session, access_level)
create a new User
Creating Users
User Accessors
get_editable_eprints() get a list of EPrints that the user can edit
get_owned_eprints(dataset) get a list of EPrints owned by the user in
the dataset is_owner(eprint)
true if the user is the owner of the EPrint get_subscriptions()
get a list of Subscriptions associated with the user
A single document associated with an eprint
may actually contain one or more physical files
PDF = 1 file HTML + images = many files
Another subclass of DataObj
Document
new(session, docid) create a Document object from an
existing record create(session, eprint)
create a new Document object for the given EPrint
Creating a Document Object
get_eprint() get the EPrint object the document is
associated with local_path()
get the full path of the directory where the document is stored in the filesystem
files() get a list of (filename, file size) pairs
Document Accessors
get_main() set_main(main_file)
get/set the ‘main’ file for the document e.g. if the document is multipage HTML with
images, the main file needs to be set to the top index.html file
when rendering document links, EPrints always links to the main file in the document
set_format(format) sets the document format
Main File and Format
Adding Files to Documents
upload(filehandle, filename) uploads the contents of the given file
handle adds the file to the document (using the
given filename) add_file(file, filename)
adds a file to the document (using the given filename)
file is the full path to the file
upload_url(url) grab file(s) from given URL in the case of HTML, only relative links
will be followed add_archive(file, format)
add files from a .zip or .tar.gz archive remove_file(filename)
remove the named file from the Document
Adding Files to Documents
A single subject from the subject hierarchy
Another subclass of DataObj
Subject
new(session, subjectid) create a Subject object from an existing
subject create(session, id, name, parent, depositable)
create a new Subject depositable specifies whether or not
users can deposit eprints in the subject
Creating Subjects
children() get a list of Subjects which are the
children of the subject get_parents()
get a list of Subjects which are the parents of the subject
subject_label(session, subject_tag)
get the full label of a subject, including parents
Subject Accessors
count_eprints(dataset) get the number of eprints associated
with the subject posted_eprints(dataset)
get a list of EPrints associated with the subject
Subject Accessors
render_with_path(session, topsubjid)
get a DocumentFragment containing the subject path
example of a subject path:H Social Sciences > HD Industries. Land use.
Labor > HD28 Management. Industrial Management
Rendering Subjects
A stored search which is performed every day/week/month on behalf of a user
get_user() get the User who owns the subscription
Another subclass of DataObj
Subscription
new(session, id) create a Subscription object from an
existing subscription create(session, userid)
create a new Subscription object for the given user
Creating Subscriptions
send_out_subscription() search for new items matching the
subscription settings email them to the user owning the
subscription
Processing Subscriptions
DataObj Hierarchy
So Far..
We’ve looked at individual data objects
but an EPrints archive holds many eprints and documents, has many registered users etc.
how do we access them collectively? We’ve seen the get_value and set_value methods for metadata
but an archive’s metadata is configurable so how do we know what metadata fields
an EPrint, User etc. has? how do we access properties of the fields?
DataSets and MetaFields
2. Data Collections
A collection of data items Tells us all the possible types in the
collection e.g. EPrints may be article, thesis
Tells us the fields in each type e.g. article has title, authors, publication... e.g. conference_item has title, authors,
event_title, event_date..
Can also tell us all the fields that apply to a dataset
title, authors, publication, event_title..
Dataset
ArchiveMetadataFieldsConfig.pm fields in each dataset additional system fields defined in
EPrint.pm, User.pm etc.
metadata-types.xml types in each dataset fields that apply to each type
Dataset Configuration
Datasets in EPrints
archive EPrints that are live in the main archive
buffer EPrints that have been submitted for editorial
approval deletion
EPrints that have been deleted from the archive inbox
EPrints which users are still working on eprint
All EPrints from archive, buffer, deletion and inbox
user all registered Users
subject all Subjects in the subject tree
document the Documents belonging to all EPrints
in the archive subscription
the Subscriptions which Users have requested
Datasets in EPrints
id() get the id of the dataset
count(session) get the number of items in the dataset
get_item_ids(session) get a list of ids of the items in the
dataset
DataSet Accessors
Datasets and MetaFields
Many Dataset methods return MetaField objects
A MetaField is a single field in a dataset tells us properties of the field
e.g. name, type, input_rows, maxlength, multiple etc.
configured in ArchiveMetadataFieldsConfig.pm but not the field value
the value is specific to the individual EPrint, User etc.
e.g. eprint->get_value(“title”)
get_name() get the field name
get_type() get the field type
get_property(name) set_property(name, value)
get/set the named property to the given value
MetaField Methods
MetaField Type Hierarchy
has_field(fieldname) true if the dataset has a field of that
name get_field(fieldname)
get a MetaField object describing the named field
DataSet Accessors
get_fields() get a list of MetaFields belonging to the dataset
get_types() get a list of all types in the dataset e.g. EPrint types: article, book, book_section,
conference_item, monograph, patent, thesis, other
e.g. User types: user, editor, admin get_type_name(session, type)
get a string containing a human-readable name for the specified type in current language
DataSet Accessors
get_type_fields(type) get a list of MetaFields belonging to the
given type get_required_type_fields(type)
get a list of the MetaFields which are required for the given type
field_required_in_type(field, type)
true if given field is required in given type
DataSet Accessors
render_name(session) get an XHTML fragment containing the
name of the dataset in the language of the current session
render_type_name(session, type) get an XHTML fragment containing the
name of the given type in the language of the session
Rendering DataSets
render_name(session) get an XHTML fragment containing the
name of the field in the current language e.g. from phrases-en.xml: <ep:phrase ref="eprint_fieldname_title"> Title</ep:phrase>
Rendering MetaFields
render_input_field(session, value)
get some XHTML containing input controls that will allow a user to input data to the field
value is the default value
Rendering MetaFields
render_help(session, type) get some XHTML containing help text for a user
inputting some data for the field if an optional type is specified then specific
help for that type will be used if available e.g. from phrases-en.xml:
<ep:phrase ref="eprint_fieldhelp_title">The title of the item.</ep:phrase>
<ep:phrase ref="eprint_fieldhelp_title.book">The title of the book, usually found on the title page.</ep:phrase>
Rendering MetaFields
We know how to access data objects in EPrints
EPrint, User, Document ...
We know how to access collections of these objects
Datasets MetaFields
Now, how do we search for items?
So Far...
Searching Your Archive
SearchExpressions
The conditions of a single search new(data)
create a new search expression from the given data
se = new SearchExpression( session => session,dataset => dataset,custom_order => “title” )
sorted by title, ascending
SearchExpression
add_field(metafield, value) add a new search field with the given
value (search text) to the search expression
if the search field already exists in the search expression, its value is replaced
Adding Search Fields
Example: full text search searchexp->add_field(
dataset->get_field(“title”),“routing”,“IN”,“ALL” )
Adding Search Fields
Example: full text search matches word in title OR abstract searchexp->add_field(
[ ds->get_field(“title”),dataset->get_field(“abstract”) ],
“routing”,“IN”,“ALL” )
Adding Search Fields
Example: date range search searchexp->add_field(
dataset->get_field(“date”),“2000-2004”,“EQ”,“ALL” )
Adding Search Fields
serialise() get a text representation of the search
expression, for persistent storage from_string(string)
unserialises the contents of string but only into the fields already existing in
the SearchExpression
Serialising Searches
render_description() get some XHTML describing the current
parameters of the search expression render_search_form(help)
render an input form for the search expression
if help is true then this also renders the help for each search field in current language
Rendering SearchExpressions
Carry out a search using: perform_search()
The results can then be accessed: count()
get the number of results get_records(offset, count) get_ids(offset, count)
get a list of DataObjs (e.g. EPrint, User) representing the result set, or just their ids
optionally specify a range of results to return from result set using count and offset
Processing Results
map(function, args) using get_records to get results uses a
lot of memory if there are 1000s of results
apply the function to each result without overhead
function is called with args: (session, dataset, dataobj, args)
the DataSet object also has a map function
creates a SearchExpression over dataset sets allow_blank = 1
passes args to searchexp->map
Processing Results
Aside: Lists in EPrints 3.0
In EPrints 3.0, searches return a List ordered collection of DataObjs
In fact, any 2.3 function which returns a list (array) of DataObjs returns a List in 3.0
list->reorder( neworder ) list->union( list2 ) list->intersect( list2 ) list->remainder( list2 )
map over items in the list even arbitrarily constructed ones
Scripting Your Repository
Archives and Sessions
One EPrints installation can host multiple archives
An Archive object is a single EPrints archive
access archive-specific configuration
Don’t confuse the Archive object with the archive DataSet!
archive->get_dataset(“archive”) renamed Repository in 3.0
Archive
get_id() get the id string of the archive.
get_conf(key, subkeys) get a named configuration setting probably set in ArchiveConfig.pm get_conf( "stuff", "en", "foo" )
Archive Accessors
call(cmd, params) calls the subroutine named cmd specified
in the archive configuration (ArchiveConfig.pm etc.) with the given parameters and returns the result
can_call(cmd) true if the named cmd exists in the
archive configuration
lets you delegate processing to “user” space
Calling Archive Subs
Session
Not a session in the traditional Web sense not stateful (although it might be in future!)
3.0 introduces cookie-based authentication global object which provides access to:
current language generic rendering functions CGI parameters (input from forms etc.) http request
Always create a session object at the beginning of your script
don’t forget to terminate it at the end
new(mode, param) set mode to 0 for online session (CGI
script) uses language from cookie, http headers, or
default language
set mode to 1 for offline session (cmd line script)
param is the id of archive, uses default language
terminate() terminate session, performing necessary
cleanup
Creating & Ending a Session
Web Page Building Blocks
make_doc_fragment() create an empty XHTML document fill it with things!
make_text(text) create an XML TextNode
make_element(name, attrs) create an XHTML Elementmake_element("p", align => "right")<p align=”right” />
render_link(uri, target) create an XHTML linklink = session-> render_link("foo.html", "frame1")
link->appendChild(session-> make_text("Foo"))
<a href=”foo.html” target=”frame1”>Foo</a>
Web Page Building Blocks
Many methods for building input forms, including:
render_form(method, dest) render_option_list(params) render_hidden_field(name, value) render_upload_field(name) render_action_buttons(buttons) ...
Web Page Building Blocks
build_page(title, body) wraps your XHTML document in the
archive template send_page()
flatten page and send it to the user
Web Pages
change_lang(langid) change the session language to the given
language ID phrase(phraseid, inserts)
get given phrase (as a string) in the current language
looks up phraseid in language-specific phrase file
e.g. phrases-en.xml
lets look at an example of the inserts parameter...
Language
Language
html_phrase(phraseid, inserts) render an XHTML phrase in the current
languagesession->html_phrase( 'link_to_google', link => session-> render_link(“http://www.google.com”))
gets phrase link_to_google from phrases-en.xml
<ep:phrase><ep:pin ref="link">Search Google</ep:pin></ep:phrase>
<a href="http://www.google.com">Search Google</a>
User Input
have_parameters true if parameters (POST or GET) were passed
to the CGI script (e.g. from an input form) param(name)
get the value of a parameter passed to the CGI script
get_action_button() get the id of the button the user pressed
client get the name of the user’s browser
Navigating the API