ctio / dpps / archive — 2006 · 2019-12-01 · ctio / dpps / archive — 2006 andrew cooke1...

38
CTIO / DPPS / Archive — 2006 Andrew Cooke 1 Alvaro Ega˜ na 2 September 2006 1 [email protected] 2 [email protected]

Upload: others

Post on 22-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

CTIO / DPPS / Archive — 2006

Andrew Cooke1 Alvaro Egana2

September 2006

[email protected]@egana.net

Page 2: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Abstract

This is a collection of papers that illustrate our work over the last year.

The first — Spring, Mule, Maven: Lightweight SOA with Java — is a ‘recipe’ for implementing services.Our aim is flexible, adaptable software, exploiting third party solutions only when we understand why andhow they will be used, keeping external interfaces separate from our core ‘business logic’. It’s not that wewant to re–invent every wheel, but we are tired of being bound to wheels that we don’t understand, rollingin directions we didn’t choose, at speeds we cannot control1.

Next comes a draft of our ADASS paper — FITS Files and Regular Grammars: A DMaSS Design CaseStudy — largely an exercise in stating the obvious. Which is not necessarily a bad thing; sometimes it’sworth taking the time to understand what is intuitive. For us, this paper is important because we aretrying to find a balance between the ‘craft’ of programming and its theory.

A Tiny Workflow in Spring and A Simple, Lazy, Expression Evaluator are shorter papers, describingtwo small libraries we created to help simplify our work2. The workflow helps structure our services; thisdescription should throw some light on the ideas in the first paper. The evaluator was a ‘bit of fun’, throwntogether in a couple of days; the ease with which something like that can be implemented in (recent) Javais worth noting.

Working in Chile, for a department still struggling to position itself within the VO, we sometimes feelthat both our colleagues and the community as a whole have little idea of what we do. This is one smallattempt to change that. It was written (mostly) in our own time which, we felt, gave us the liberty toexpress our own views, even when they differ from those of our co–workers.

Andrew & Alvaro

P.S. If there’s a common style or attitude across these papers, and if it strikes a chord with anyone else, andif you are that someone else, and if, in addition, you have some idea of how we could build a communityaround it, or have ideas or experiences3 you’d like to share, or, even, a couple of jobs you’d like to offer,then, please, drop us a line. Cheers.

1We try to avoid metaphors like this in the paper itself.2Please don’t think that’s all the code we’ve written this last year!3Or advice on proper use of the comma.

Page 3: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

2

Page 4: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Contents

1 Spring, Mule, Maven: Lightweight SOA with Java 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.3 Our Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.4 Road–map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.2 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Spring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.3 Mule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.4 Maven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.1 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 Naming Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.4 Commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.6 Pattern: Nesting Facades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5.1 Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5.2 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.4 Philosophical Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 FITS Files and Regular Grammars: A DMaSS Design Case Study 152.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 FITS Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 NOAO DMaSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.4 Types and Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Reintroducing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.4 Graph Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.5 Semantics, Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3

Page 5: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

2.3 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Recognition Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Reality Check: Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.3 Categories, Orthogonal Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.4 Separation of Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.6 Reality Check: Field Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 Loading and Remediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Duplicate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.1 Non-Unique Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.2 Hierarchical Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 A Tiny Workflow in Spring 233.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Pseudo–Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 DB Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.1 XML Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.2 Verbosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Lack of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.4 Persistent State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.5 Pluggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 A Simple, Lazy, Expression Evaluator 294.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 The Remediation Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2 Remediation Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.3 Configuration via Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.1 Domain Specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.2 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.5 Generics, Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.1 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.3 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A CTIO / DPPS / Archive Is... 35A.1 Andrew Cooke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.1.1 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.1.2 Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.2 Alvaro Egana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2.1 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2.2 Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4

Page 6: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Chapter 1

Spring, Mule, Maven: LightweightSOA with Java

Authors. Andrew Cooke, Alvaro Egana.

Abstract. We show how Spring and Mule can helpimplement services within a Java–based SOA (Ser-vice Oriented Architecture). The presentation has apractical emphasis, based on our own experience.

The approach is minimal and incremental, build-ing around ‘core’ business logic implemented in‘plain’ objects (POJOs). Using careful design andchoice of technology we extend this, focusing on mes-saging.

1.1 Introduction

1.1.1 Aims

This paper introduces an ‘implementation recipe’,summarised in section 1.5.1. We are implementing aSOA system; this is the paper we wish we had readbefore we started.

Our approach is minimal and incremental. At itscore is the service’s business logic1, implemented in‘plain’ Java (POJOs), with an associated interface.

Some services depend on others. This is in-evitable. Rather than obscure this dependency, wemake it explicit; dependency is expressed througha direct call to the sub–service’s business interface.This allows easy verification of the system, via bothcompilation and ‘integration tests’2.

With careful design and choice of technology wecan then add further functionality without influenc-ing the core. This paper focuses particularly on mes-saging, but we believe that the approach is general;

1The ideas that the service embodies, no matter what sup-porting technology is involved.

2In–memory tests of multiple services without messaging

that one should start with the simplest system nec-essary, understand it, and then incrementally extendit.

1.1.2 Scope

Our ideas rely on the language having a (static) typesystem. In section 1.3 we recommend Java tech-nologies and we will use Java language terminologythroughout the paper, but a similar approach shouldbe possible with C++ or C#.

1.1.3 Our Background

We have been working with Mule and Spring atCTIO (Cerro–Tololo Inter–American Observatory),helping develop the DMaSS (Data Management andScience Solutions) platform (which includes the NSA(New Science Archive)). This is a (incomplete)Java–based set of services that can be assembled tocreate an archive for astronomical data3.

Before this, we have worked with variousJ2EE server–based systems (WebLogic, WebSphere,JOnAS and JBoss). In comparison, Spring and Muleapplications ‘feel’ much more flexible; the frame-works are less intrusive and the separation betweenlogic and support services is much clearer. However,we can also ‘retro–fit’ the lessons we have learnt backinto a J2EE server environment: we have writtenmany services that can be deployed in either.

1.1.4 Road–map

We present the basic ideas behind our approach insection 1.2; section 1.2.1 motivates all the subse-

3While we use DMaSS for our examples, and owe muchto discussions with our colleagues at NOAO, this paper is apersonal project; it is not an official NOAO publication anddoes not necessarily reflect how NOAO implements software.

5

Page 7: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

quent work. In section 1.3 we explain how certaintechnologies can simplify the approach. Section 1.4is an extended example that shows how a single ser-vice can be deployed in a variety of different scenar-ios. Our ‘recipe’ is summarised in section 1.5, wherewe also discuss future work.

1.2 Interfaces

1.2.1 Introduction

‘Programming to an interface’ is a standard idiomwithin the Java community; interfaces4 are used todefine a framework which is then implemented withset of classes.

We use interfaces to:— Document the components.— Verify that components are compatible.— Delay the choice of implementation.Verification is implemented by the compiler,

which checks that the code is consistent. This isimpossible in untyped languages and difficult whencomponents are completely isolated.

While static verification is important, it is not suf-ficient for a reliable system; tests are important too.We will show how the consistent use of interfaceshelps improve testing.

Delaying implementation choices until assembly isthe ‘dependency injection pattern’ (a.k.a. ‘inversionof control’) popularised by Spring[1] and adopted byEJB3[2].

These ideas are complementary: delaying imple-mentation helps manage inter–dependencies betweenpackages which, in turn, enables verification; verifi-cation assures that delayed choices are safe choices.

The rest of this paper shows how these ideas canbe applied to SOA.

1.2.2 Services

Verification requires that each service ‘know’ the in-terface provided by other services; but we must avoidcircular chains of references, or services cannot becompiled separately. There are several solutions tothis problem: restrict communication between ser-vices to avoid cycles; use a hub/spoke architecture;separate interface and implementation; forgo verifi-cation.

For simple systems, the first of these may be suf-ficient. Otherwise, separating interface and imple-

4The general software engineering idea of an interface isrepresented in Java by an ‘interface’ construct, which is equiv-alent to a purely abstract class.

mentation is preferred, since it allows verificationwithout restricting topology.

So we require each service to have a clear, simple‘business interface’5 that depends only on basic Javaclasses and stand–alone libraries6. Service businessimplementations then depend only on business in-terfaces to services (their own, and any services theydepend on).

Maven (section 1.3.4) will automatically order thecompilation if packages are structured in this way.

Note that we are relying on the ability to delay thechoice of implementations. If Service A calls ServiceB, it is compiled against B’s interface, alone. B’simplementation is compiled separately and injectedlater, at run–time.

1.2.3 Communication

So far we have not addressed messaging. This is notan oversight: messaging should influence neither thebusiness interface nor the core implementation.

Requiring that business interfaces and implemen-tations be independent of messaging brings severaladvantages: it makes it easy to test multiple ser-vices; allows different messaging technologies to beused; and leads to an implementation in which addi-tional functionality, like caching, is cleanly separatedfrom the main business logic.

However, if we are building a distributed system,we obviously cannot ignore messaging completely.We must be able to send, route, and receive mes-sages.

Sending Messages Messaging is sent via a client.The client is a facade that implements the standardbusiness service interface, but delegates implemen-tation, via messaging, to a remote server.

Receiving Messages A server is responsible forreceiving messages, unpacking them, and callingan embedded (injected) service implementation. Itmust also return the result back to the client.

The server implements the ‘communications inter-face’. This interface is exposed by the service via thecommunications system. Some callers may call ad-dress this interface directly, rather than using theclient.

5In the next section we will introduce a distinct ‘commu-nications interface’.

6Some libraries may depend on others, but none must de-pend on any service.

6

Page 8: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Routing Messages We separate message routingfrom the client and server. The messaging technol-ogy must be capable of separate configuration.

So, in summary, each service should have a ‘com-munications’ package that provides client–serverfunctionality. The client is a facade that implementsthe service’s business interface, packages argumentsinto messages, dispatches them to the messagingsolution, and receives and unbundles the response.The server performs the reverse functions; receiv-ing messages, unbundling them, calling an embeddedimplementation, receiving, packaging and returningthe result.

We can then inject, into any user of a service, ei-ther the direct business implementation, or the com-munications client. In this way we can choose, atdeploy time, whether or not the two services are sep-arated by messaging.

Note that the service’s communications packagedoes not depend on the business implementationduring compilation (or vice–versa, of course).

Technology Independence As much as possible,neither client nor server should assume a particulartechnology. The client accepts a simple interface (in-jected at run–time) for sending a serialisable Javaobject. The server is a simple class7 that acceptsand returns a serialisable object; the main processingis handled by an embedded service implementationwhich, again, is injected at run–time.

Both servers and clients have a simple, regularstructure that allows easy generalisation of relatedfunctionality, like caching.

Common Messaging Common patterns in mes-saging can be abstracted to a separate library. Thisshould be restricted to have no dependencies onother packages (except libraries).

Typical message classes include carriers for a ‘pay-load’, an empty ‘acknowledge’, and carriers that in-clude either an exception or a payload.

Any object serialised by messaging (not just thegeneric messaging classes, but also payloads) musthave an appropriate class in both client and server.Following the guidelines above guarantees this, butit is important to understand that the constraint alsoapplies to chained exceptions.

7The user of ‘server’ may be misleading here. The respon-sibility for listening to ports, queues, etc, is left to whatevermessaging solution is used (in section 1.3.3 we recommendMule). Our ‘server’ receives a ‘message’ when a method thattakes a serialisable Java object is called.

Figure 1.1: The communications stack: service Asends a message to service B.

Two Interfaces? Services have two interfaces:what we have called the business and communica-tions interfaces. The two are closely related and thequestion naturally arises: which is more fundamen-tal?

We are most concerned here with relatively closedsystems where many inter–operating services arewritten in Java. In such a situation, the businessinterface dominates8. However, the communicationsinterface should not be ignored, as it documents animportant logical component of the system whichwill likely become more important as the systemmatures, growing stronger connections with exter-nal processes.

Layered Architecture The description above de-scribes each service as though it has a client–serverstructure. However, the same components fit withina layered framework, as shown in figure 1.1. Ourclient–server distinction is then a natural way to sep-arate the implementation of the network software.The common messaging classes describe a commonprotocol shared by all services.

This could be emphasised by splitting the commu-nications package into four: interface; client; server;messages. This minimises the network software re-quired for any particular service deployment.

1.2.4 Testing

Delayed choice of implementation allows services tobe assembled in–memory for testing, even if they aredeployed separately; instead of injecting the commu-nications client, the implementation is used directly.

This allows ‘integration tests’ to be run on de-veloper’s machines, without the need for complexdeployment and test harnesses.

8In a later paper we will address the automatic generationof the entire communications package.

7

Page 9: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Acceptance tests, using a full deployment withmessaging, are still necessary, but the simpler in-tegration tests allow most cross–service bugs to bediagnosed and corrected more quickly.

Another advantage of the emphasis on interfacesis that dummy implementations of sub–componentscan be written and injected as needed.

1.3 Technologies

1.3.1 Java

The approach outlined here was developed withJava.

A similar approach might be possible with C++,but we can see little justification for doing so: Javaincludes memory management, a safer type system,and a wider choice of tools for this type of applica-tion; the advantages of C++ apply more to applica-tions with restrictive hardware requirements.

Microsoft’s .net platform is a possible alternative,but then it would be wiser to follow the implemen-tation route assumed by their proprietary tools.

Ruby, Python and Perl require a different ap-proach since their type systems do not support thecompile–time verification that motivates many of ourrecommendations.

More modern, statically typed languages (ML,OCaml, Haskell) are another option, but we are notconvinced that they have sufficient third–party sup-port (particularly for messaging). One promisingcandidate is Scala[3], which compiles to the JVM(Java Virtual Machine) and so can interoperate withJava–based tools.

1.3.2 Spring

Spring is a rather large application framework. Herewe are concerned only with the inversion of controlcontainer it provides (although it may also be usefullater for transaction management, ORM, and thepresentation layer).

The Spring container allows us to assemble ap-plications from the classes in the implementationpackage. It largely eliminates the need for factoriesand singletons in the code. Instead, at run–time,the Spring configuration defines how instances areconstructed. This works extremely well with theinterface–based approach we have described, sincemuch of the boilerplate code is provided ‘for free’.

1.3.3 Mule

Mule[4] can be considered as ‘Spring for messaging’.In as non–invasive a manner as possible it connectsJava objects with messaging technologies.

In particular, it can create Java objects from aSpring description and, when an appropriate mes-sage is received, it can call an instance method, pass-ing the message as an argument. Similarly, it cantake the value returned by the method and treat itas a response method. This is why our ‘server’ codecan be plain Java objects (POJOs).

It also provides a standard, generic interface toa wide variety of other messaging systems. So, forexample, it can interoperate with REST (or evenemail). It can also build synchronous messagingfrom an asynchronous transport like ActiveMQ[5].

1.3.4 Maven

We use Maven v2[6] to manage the build process. Itsrigid approach makes it easy to track dependenciesbetween packages.

We suggest using a flat ‘logical’ structure, with allpackages being children on a single top–level parent9.Each package is then configured separately, the par-ent is used only to set a few global parameters andallow whole–project compilation, testing, etc.

This does not mean that the packages themselvesneed to be in a flat directory structure. Instead, wesuggest grouping by service. So the parent directorycontains a sub–directory for each project; each ofthose contains further directories for each package(interface, implementation, communication, deploy,etc). A further child of the parent directory containsall the libraries, etc.

Guidelines Maven works just fine, as long as youfollow these simple rules:

• Within a project, use the standard directorystructure.

• Use many, many projects (Maven does not pro-vide functionality at a resolution lower thanprojects; you can easily use Maven to combinethem later if you need to).

• The projects can be located wherever you like(so you can group related projects in a singledirectory).

9If the auto–generated website is important then a twolevel system that groups packages by service may be prefer-able; even then it is simplest to keep each package as inde-pendent as possible.

8

Page 10: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Naming Business

Communication Business

Metadata Business

Interface

NamingException

«interface»

Naming

Implementation

NamingImpl«interface»

State

«interface»

Formatter

InMemoryState

HexFormatter

Naming Communications

Interface

«interface»

MessageSender

Implementation

MuleSender

Client

Server

Interface

NamingServer

NamingClientNamingResponse NamingRequest

Libraries

Message Protocol

GenericResponse GenericRequest

Interface

PersistentState

«interface»StateManagement

Library XLibrary Y

Implementation

HibernateManagement

«interface»

NamingTarget

«realize»

«realize»

«realize»

«realize»

«realize»

«realize»

«realize»

«realize»

«realize»

«realize»

Figu

re1.2:

Nam

ing

Serv

icestru

cture

(with

relatedpackages,

inclu

din

gm

essaging

and

apossib

ledep

en-

den

cyon

Metad

ataServ

iceto

persist

state).D

ashed

,grey,

thick

lines

arecom

pile–tim

edep

enden

cies.

9

Page 11: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

• Use a simple bash script to create the top levelPOM (Project Object Model; a Maven con-figuration file) from those living in any sub–directory.

• Trust Maven. If something is hard, you are ei-ther doing it the wrong way (try splitting yourwork into more projects) or your code’s high–level structure is poor.

1.4 Example

1.4.1 Recapitulation

In the previous sections we advocated splitting eachservice’s business code into two separate packages:

Business Interface: A description of the service,in terms of simple Java language and libraryobjects.

Business Implementation: A direct implemen-tation of the business logic that conforms to thebusiness interface and refers to other servicesvia their own business interfaces. Spring canbe used to inject appropriate instances at run–time.

In addition, for communication, we recommendeda separate package (whose components might be sep-arated) containing:

Communication Interface: The interface ex-posed directly through the communications ser-vice. This includes the messages, based on acommon protocol library.

Communication Client: A facade that imple-ments the business interface but sends messagesto an implementation of the communications in-terface (the server).

Communication Server: An adapter that con-verts the business interface to the ‘messagefriendly’ communication interface. Spring canbe used to inject a business implementation;Mule can transfer messages between client andserver.

1.4.2 Naming Service

The Naming Service provides unique, system–wideidentifiers. Its interface is a single method, StringgetNewName(), that returns a new value on each call.

1.4.3 Structure

The Naming Service and its main dependencies aresketched in figure 1.2.

The business interface package containsthe Naming interface described above andNamingException, which is the exception thrownby the interface on error.

The main class in the business implementationtakes two interfaces: Source which provides a se-ries of long values; Formatter which converts longto String.

Different implementations of these inter-faces can be deployed for different behaviours.InMemoryState is a class with a AtomicLong

instance field, whose value is incremented andreturned (obviously this functionality is too simplefor a reliable, distributed system, but it will servefor testing). HexFormatter formats the number inhexadecimal.PersistentState is a second, more sophisticated

implementation of State that uses the MetadataService to persist values. This service is not shownin detail in the figure, but illustrates how one servicecan depend on another.

The communications interfaces package containstwo serialisable message objects: NamingRequest

is empty; NamingResponse contains either aNamingException or a String (the name).The communications client takes an interface,MessageSender to which it sends the request andreceives a response. The communications servertakes a Naming instance which it calls when itreceives the request.

1.4.4 Commentary

Figure 1.2 illustrates how our approach leads to asystem with the following properties:

• Anything can depend on a library.

• Anything except a library, or the Communica-tions Service and Message Protocol Library, candepend on any other service’s interface.

• Nothing can depend on a service’s business im-plementation package.

• Nothing can depend on a service’s communica-tion package.

• Only the communication packages depend onthe packages related to the CommunicationsService.

10

Page 12: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

«execution environment»

JVM

Caller

Naming Business Implementation

Naming Business Interface«realize»

Figure 1.3: Local deploy of the Naming Serviceshowing call and dependency.

Together, these guarantee that the core businesslogic of each service remains isolated from technol-ogy choices, from other service implementations, andfrom the majority of implementation details.

The communications package bundles together in-terface (including associated messages), client, andserver. Both client and server code are deployedwhen only one is required; the client and server areboth very simple, ‘thin’ classes, but they could beseparated if required.

It is important to understand that server and ser-vice are not synonyms. The service is the logicalcore; the server is a practical detail necessary forthe messaging technology we have chosen. Othertechnologies might (but not necessarily; the code isreasonably general) require different servers, but theservice itself would remain unchanged.

The description of the Naming Service businessimplementation and interface above (one exception,a handful of interfaces and a similar number ofclasses) is practically complete. It is difficult to con-vey quite how simple this code is. It contains noreferences to J2EE, container, or messaging technol-ogy; nor does it contain any singletons or factories(despite the clear need for a singleton State). Yetit can be deployed to give unique names to severaldifferent callers distributed across a network; this isaddressed in the next section.

1.4.5 Deployment

The service can be deployed in a number of ways,relative to some calling service:

• As part of a composite service, running withinthe same JVM as the caller.

• As a remote server, called from Java.

Machine 1

Caller

Naming Communications

Client

Naming Business Interface

Communications Implementation

Communications Interface

Machine 2

Mule

Mule

Naming Communications

Server

Naming Business Implementation

Naming Business Interface

«realize»

«realize»

«realize»

Figure 1.4: Remote deploy of the Naming Serviceshowing remote call and dependencies.

11

Page 13: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

• As a remote server, called by a non–Java service.

• Embedded within a J2EE server.

The choice of deployment method should dependon the particular circumstances; typically depend-ing on the underlying communications service beingused, the degree of reliability required, etc.

Composite Service in JVM. For the Java–based examples here, the user code is always com-piled against the Business Interface Package.

For a composite service, the caller will expect animplementation of Naming, which is injected. Thebusiness interface and implementation packages areused. This is shown in figure 1.3.

Remote Server, Java. The same user code asin the previous section can be configured to calla remote service by injecting the NamingClient.This will, in turn, require a suitable MessageSenderimplementation (typically MuleSender) configuredwith the correct Mule endpoint. Mule must be con-figured to route the endpoint correctly to the server.

The Naming Service server process is aNamingServer (from the communications Package)which is injected with a Naming implementation.See figure 1.4.

Note that the Naming business implementationpackage is not needed on the caller machine.

Remote Server, Non–Java. In the previous sec-tion the Naming service server process was called byJava code via NamingClient. The actual messag-ing work was done by Mule (possibly over an appro-priately configured delegate transport layer). Non–Java code (or Java code written separately from oursystem) can also call the server (more exactly, theNamingTarget interface) using any messaging imple-mentation that inter–operates with Mule.

J2EE Server. It is easy to write adapters that al-low services to be deployed in J2EE servers as state-less session beans. We place the necessary facadesin a separate package and use EJB3 annotations toinject the implementation.

We wrote a MessageSender implementation thatcan be deployed in a J2EE server and extended Muleso that it can call EJBs in the server10.

10This extension has been submitted to Mule and, hope-fully, will be included in the next official release.

1.4.6 Pattern: Nesting Facades

The variety of deployment options described aboveis possible because several different implementationsof the same interface are available, most of whichcan be injected with a further instance of the sameinterface. Injection gives two advantages: first, com-pilation depends only on interfaces (no circular de-pendencies); second, deployment decisions need notappear in the source code (the same jars can be de-ployed in different scenarios without recompiling).

A surprising amount of functionality can be com-posed in this way, especially if the interface is simple.It is important to note that each facade implemen-tation can have a more complex interface, which canbe configured by Swing (so various parameters canbe ‘fine–tuned’ during deploy); only the common ser-vice interface needs to be kept simple.

1.5 Conclusions

1.5.1 Recipe

• Use interfaces to document and verify the code,and to delay the choice of implementation.

• Separate interface and core implementation intoseparate packages.

• Require that interfaces and core implementa-tions ignore communications.

• Unit test.

• Place communications classes in a new package,with separate client and server functionality.

• Make the communications client implement thesame interface as the basic service.

• Choose the appropriate service implementation(direct or via the messaging client) at deploytime.

• Compose services in memory (without messag-ing) for further ‘integration’ testing.

• Use the messaging server directly to handle ‘ex-ternal’ connections.

• Test deployed services (with messaging) asmuch as possible.

• Java, Spring, Mule and Maven make this ap-proach easy.

12

Page 14: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

1.5.2 Simplicity

The implementation package contains the businesslogic — and nothing little else. This makes the ap-proach here future–proof: no matter what happens,the important ‘logic’ of the service is cleanly isolatedand ready for re–use.

Spring and Mule provide the extra structure nec-essary to convert the business logic into a practicalservice. Spring imposes the structure that wouldnormally require singleton and factory boilerplate.Mule adds messaging.

It is hard to understand just how simple the result-ing code is. Despite having used the techniques de-scribed here we still find ourselves writing code thatis too complex; in a later iteration we will delete fac-tories and singletons we had earlier written throughhabit.

1.5.3 Future Directions

Asynchronous Communication. The descrip-tion above has focused on a system in which com-munication is synchronous.

Mule can construct synchronous messaging overan asynchronous transport, but this does not addressthe more important problem of making the callerprocess reliable: if service A calls service B via Muleusing a reliable transport, then some instance of Awill11 eventually receive a reply; if A has restartedin the meantime, however, it will not contain thecorrect state to process the response.

To solve this problem we must either persist thestate of the caller while waiting for a response, orrestructure the system into smaller steps which com-municate asynchronously. The former of these canbe seen as a way of automating the latter.

The standard approach to persisting state is to usea workflow which manages persistence and the inter-face with the communications system. This could beadded to our Tiny workflow[7], but it would prob-ably make more sense to move to a more complexthird party workflow / framework.

This conclusion — that we may need to move to amore sophisticated technology in the future — doesnot alarm us. We do not claim to have solved everyproblem. More importantly, the approach describedhere is so simple, and commits so little to any onetechnology, that the transition should be easy12, par-

11In theory; in practice Mule will fail unless explicitly con-figured for asynchronous messaging, for the reasons describednext.

12With the possible introduction of continuations in thenext release of Java we may find a more lightweight solution.

ticularly if existing services use the Tiny workflow[7]to keep their structure simple.

SOAP. Separate from the discussion about asyn-chronous communication above, we want to com-ment briefly on SOAP and web services. The servicestructure outlined here is very similar to that usedby web services (our ‘client’ matches the client codegenerated from the WSDL). And Mule can interfaceto external SOAP services.

However, our most painful experience so far hasbeen trying to interface to an external web service.We think this was for the following reasons: poorcommunication between two different groups duringdevelopment; lack of experience with SOAP on ourpart; shifting standards and incompatible implemen-tations.

JBI. Java Business Integration (JBI)[8] usesSOAP and is supported by recent versions of Mule.This is another possible route for future develop-ment.

1.5.4 Philosophical Purity

And yet... The approach above seems to be a poorapproximation to a Service Oriented Architecture:if the services need each other’s interface to com-pile, then why not build a monolithic (but modular)system?

Perhaps the best defence of our approach is thatwe can build a ‘monolithic’ system in a way thatis so modular that the transition to ‘real’ SOA isrelatively painless. We (1) have described some ap-proaches to this transition (interfacing with external/ non–Java code via Mule; JBI; persistent state); (2)can often limit this transition only to isolated parts(often the ‘edges’) of our system; (3) believe that theadvantages of this approach — stronger guaranteeson consistency and simpler tools — are perfect forinitial development and, with (1) and (2), appropri-ate for the ‘core’ of many mature systems.

If SOA is more than fashion, it embodies some in-sight about engineering systems. We hope to respectthose insights without needing to wear an unneces-sary hair shirt. This is the best compromise we havefound.

1.6 Acknowledgements

The criticism of others on the NOAO NSA team hashelped us improve these ideas and motivated us towrite this paper: Sonya Lowry has tried very hard

13

Page 15: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

to explain SOA to us; Evan Deaubl was particu-larly helpful with Maven and packaging issues; with-out Phil Warner we would never have understoodhow these ideas can adapt to J2EE servers; BrianThomas at UMD pushed the limits of XML and,with Ping Huang, helped us hone our arguments for(sometimes) re–inventing the wheel.

14

Page 16: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Chapter 2

FITS Files and Regular Grammars:A DMaSS Design Case Study

Authors. Andrew Cooke, Alvaro Egana, SonyaLowry.

Abstract. The NOAO DMaSS (Data Manage-ment and Science Solutions) Platform has alreadypassed through several development iterations andincludes a Database (DB) Loader service. Experi-ence using that service has been mixed.

Recent discussion of the proposed Metadata ser-vice focused on ‘vocabularies’. We realised thatwe could treat FITS headers as a formal languageand that this language was described by a regulargrammar. It then became clear that part of theDB Loader’s configuration was, in fact, a regularexpression (expressed in several files of rather ver-bose Spring XML!) Furthermore, a separate (andequally confusing) configuration was now clearly re-lated. And we could extend things further: the samedata structures could help detect ‘almost duplicate’data reliably.

Reviewing our results in terms of the system archi-tecture, we suggest moving newly identified respon-sibilities to appropriate services. The end result is acleaner design, better integrated within our architec-ture, and in which we have much more confidence.What started as an idle theoretical curiousity yieldedvery practical results.

2.1 The Problem

2.1.1 FITS Headers

Scientific data often consist of two components: acollection of ‘measurements’ (‘science data’; e.g. animage) and a ‘description’ of those values (‘sciencemetadata’).

In the most common file format, FITS (Flexi-ble Image Transport System[9]), the science meta-data are stored as name/value pairs (grouped in‘headers’). Apart from multi–extension (MEF) files(which we do not believe significantly complicate theproblem, and which we will ignore here), the meta-data have very little structure: the names form asimple set, with no repetition or nesting, and almostno ordering.

2.1.2 NOAO DMaSS

The NOAO DMaSS[10] is a set of services that canbe deployed as an archive for astronomical data.Early iterations already include a DB Loader ser-vice, responsible for loading science metadata intoan SQL database.

Experience with the DB Loader has identifiedproblems with the configuration, which is dis-tributed across several different sets of files: aworkflow[7], which identifies particular data prod-ucts (for example: images observed on a certaincamera/telescope); a mapping of metadata namesto database fields; and a description of the database.Only the last of these is generated automatically, theothers must be carefully constructed by hand1.

2.1.3 Vocabularies

Recent discussion within NOAO has focused onthe concept of ‘vocabularies’ — the types of sci-ence metadata expected for different data products.NOAO manages many instruments/telescopes, and

1This service currently loads database fields (SQL tablecolumns) directly. Changing the system to bind values in a setof objects, managed by the Metadata service, is anticipatedin the design, but would not simplify the problem addressedhere.

15

Page 17: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

the DMaSS is intended as a general system, so wemust be able to manage a wide variety of vocabular-ies.

Since science data enter the system as FITS filesthe range of vocabularies reflects the range of head-ers we must handle. Further complexity is added bythe evolution of vocabularies over time and modifi-cations/additions to the metadata by various data–handling systems.

2.1.4 Types and Values

Faced with a wide range of vocabularies, we neededa way to structure our problem. One source of com-plexity was the interplay between types and values:the set of names that a FITS header might containdepended on the values associated with some of thenames.

While this is completely obvious to anyone whohas worked with FITS data, it is easy to see howit can become difficult to manage in a software sys-tem. The same dependency within a naive OO de-sign would give objects whose class changes depend-ing on the values contained. While this is possiblewith predicate classes[11] it cannot be expressed di-rectly in a language with a fairly simple, static typesystem like Java or C++2.

On the other hand, this didn’t seem like a ‘hard’problem. After all, lots of people write programsthat manage FITS data! We were tempted to ig-nore the issue and simply ‘do the work’, but thenstumbled across an argument that led, in a sequenceof simple steps, to a solution that combines decentengineering with a consistent logical framework.

2.2 The Analysis

2.2.1 Formal Languages

In the following sections we use a few ideas from For-mal Languages[12], but we do not rigorously proveany results; our aim is to use the theory to help illu-minate a practical design. While we do not expect‘real life’ to behave in such an idealised manner, wedo hope that the argument will help us understandhow best to deal with exceptions and irregularities.

This theory is not particularly obscure; it is proba-bly familiar as the basis for parsing, regular expres-sions, etc. We will use the following ideas: that a

2This mixing of types and values also appears to be relatedto Russell’s paradox in set theory.

language3 is composed of an alphabet (a list of words)and a grammar; that grammars define how words arecombined to produce valid sentences; that grammarscan be categorised in various ways; and that perhapsthe simplest and most common of these is a regulargrammar, whose sentences can be described by reg-ular expressions.

2.2.2 Simple Model

The metadata in a FITS header are, for the mostpart, unordered. We start by imposing an arbitraryorder (e.g. sorting their names alphabetically).

To further simplify our initial model we will as-sociate each metadata name with a different letterand ignore the associated values. This gives us ouralphabet. Valid headers, sorted and represented inthis manner, form sentences.

With these assumptions, example FITS headersmight be abcde, abpqrs, or xyz.

We can match those examples using the regularexpression (ab(cde|pqrs)|xyz). Since we couldwrite a similar expression for any header (sentence)in this model we have shown that they are describedby a regular grammar.

This result is not very surprising. The next levelof complexity is a context free grammar. These arecharacterised by nested structures, but FITS headersare not that complex (they lack any idea of ‘names-paces’ for example).

2.2.3 Reintroducing Values

We assumed that the we could represent metadata asletters. This is reasonable if we do not care about thevalue. However, there are at least two cases whichrequire more care.

First, as mentioned in section 2.1.4, there are somemetadata whose values affect the choice of vocabu-lary. For example, the instrument type. In suchcases we must associate a different word in the al-phabet with each distinct option. If originally weassociated the name INSTR with the word a we mustnow introduce distinct words for each type of in-strument. INSTR=’MOSAIC’ might be mapped to a1,INSTR=’ODI’ to a2, etc.

Second, we might want to validate the values. Itis trivial to extend the argument to include regu-lar expressions for each value, but perhaps valida-tion requires a more complex function. Technically

3We use italicised text to indicate that the word is beingused in a formal manner, as the meanings do not coincideexactly with common use.

16

Page 18: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

a grammar is generative — it can be used to con-struct valid sentences — but in our case we are usingthe grammar only to recognise (not construct) validsentences. In such cases, we can use an arbitrarypredicate as a word, providing we do not use anyinformation from ‘outside’ the value in question.

The second case may also apply to the first. Theremight be some metadata that affected the choice ofvocabulary via a complex function. Provided thisfunction takes only the single metadata value as anargument we can extend our alphabet with a suitablepredicate4.

2.2.4 Graph Structure

The regular expression (ab(cde|pqrs)|xyz) fromsection 2.2.2 was constructed by grouping commonprefixes. Matching a FITS header against the ex-pression involves identifying a path through a tree,starting at the root (leftmost node) and ending atone of the leaves (e, s or z). We can consider the reg-ular expression as a ‘decision tree’ whose leaf nodesclassify FITS headers.

This tree–like structure is intuitive and easy torepresent in software. And the leaf nodes suggesta process for identifying vocabularies. In general,however, regular expressions can be more complex— they are equivalent to finite state machines. So wewould like to understand whether this tree structureis (1) sufficient and (2) practical.

A state machine can be represented as a directedgraph (DG) whose edges are transitions from onestate to another. Each edge is associated withmatching a word in the language. A DG may con-tain cycles, but we have already noted that words inour sentences do not repeat (every name in a FITSheader is unique; there is a single namespace). Socycles (repetition) do not seem to be necessary.

Without cycles, we have a directed acyclic graph(DAG). This is still more general than a tree; thebranches of a tree cannot ‘re–join’. How signifi-cant is this lack of expressivity for trees? For finitelength sentences we can enumerate all possible pathsthrough a graph, giving a tree. So for any DG wecan always construct a tree–like regular expressionthat identifies the language (at the expense of dupli-cation of information). Since DAGs are a subclassof DGs the argument also applies there. A tree issufficient, but not optimal.

We will return to this later when we discuss ‘or-thogonal vocabularies’.

4In extreme cases we might even group several metadatatogether and treat them as a single value (e.g. the WCS).

2.2.5 Semantics, Ontology

Our initial analysis assumed (section 2.2.2) an alpha-betical ordering of names. Optional metadata namesnear the start of the alphabet will introduce struc-ture in our tree–like regular expression (they will addunnecessary branching near the root). Our grammaris very sensitive to ‘noise’ in the FITS header.

We have not relied on the type of ordering in any ofour arguments above and are therefore free to chooseone to address this problem. We will impose an or-der based on our intuition about ‘importance’. Theintroduction of a semantic ordering will give a morecompact representation5.

This is (again) an argument that sounds opaquewithin our semi–formal approach, but which is ‘obvi-ous’ in practice. We are suggesting that the regularexpression should start with metadata common to allvalid FITS files6. Next should be ‘important’ fields(INSTR, for example, and then perhaps OBSTYPE)that identify different ‘types’ of FITS files.

In this way, our decision tree reflects the meaningwithin the metadata; the leaf nodes classify FITSheaders within some ontology.

2.3 The Solution

2.3.1 Recognition Trees

We call the semantic decision tree described above a‘recognition tree’. Recognition trees are a tool to or-ganise our knowledge of the structure of FITS head-ers and classify science metadata.

The application of a recognition trees to a FITSheader mirrors the use of a regular expression tomatch a sentence. Starting at the tree root we re-peatedly select a child node based on matching thescience metadata. Eventually we either fail to match— the FITS header is malformed or we have an in-complete tree — or we arrive at a leaf node.

Each edge in the tree corresponds to a predicatethat tests science metadata. Each predicate specifiesa single name and an optional test on the associatedvalue. The test need not be a regular expression,but must base its decision only on the value given.

The structure of a recognition tree is influencedby our knowledge of the semantics associated witheach predicate. This allows us to associate leaf nodeswith ontological concepts (e.g. to classify the FITSfile as a certain data product).

5This is, in some sense, the definition of a ‘useful’ ontology.6Later (section 2.3.4) we will restrict the use of this tree

to identifying vocabularies; metadata common to all can thenbe discarded.

17

Page 19: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

2.3.2 Reality Check: Workflow

The workflow in the current DB Loader is, in ret-rospect, a simple recognition tree. The first edgesare tests on the INSTR metadata value and the leafnodes identify different data products (Mosaic skyflats, generic bias frames, etc).

This workflow is a hand–written, somewhat ad–hoc, Spring XML configuration. Our analysis showsthat we can safely (ie without risk of using too re-strictive a representation) move this definition toa tree–like structure in the database. The empha-sis will change from ‘DB Loader configuration’ to amore specific ‘Vocabulary Service’.

2.3.3 Categories, Orthogonal Vocab-ularies

The introduction of a separate service, any associ-ated improvement in the user interface, and a betterunderstanding of the role of this information, will alllead to a more detailed recognition tree with a finergranularity in the ontology. Rather than a simple‘Mosaic sky flat’, we will have categories like ‘Mo-saic sky flat using WCS projection XXX with addi-tional information from v1.2 of the Kitt Peak DataHandling System’.

For many applications this is ‘too much informa-tion’. We must provide broader classes that grouprelated leaf nodes. A practical system will proba-bly support many different, overlapping categories.These different categories will probably correspondto different parts of our data model.

This fine level of detail in the leaf nodes is alsodriven partly by the use of a tree. We noted in sec-tion 2.2.4 that a DAG allowed branches to re–join.The structure we have chosen leads to a proliferationof branches.

An alternative solution to this problem might beto separate orthogonal issues into separate trees. Forexample, consider WCS–related header properties.We may want to detect different WCS representa-tions, even though these variations have little bear-ing on the classification into different data products.This could be separated into a second, WCS–specificdecision tree.

This separation appears to be possible even whenthere is some overlap of concerns. For example,we might include WCS coordinate type in both theWCS vocabulary and the science data product vo-cabulary (where it would be checked for consistencywith imaging or spectral data).

Or perhaps all three approaches (categories, or-thogonal trees and DAGs) are complementary. This

is still an open problem — we will probably imple-ment a single tree with categories initially and thenrevisit the issue in a later development iteration.

2.3.4 Separation of Concerns

There is a clear parallel between the Vocabulary ser-vice and a parser. Both use grammars to recogniseand validate data. A parser also returns values. Thisraises the questions ‘how much validation should theVocabulary service do?’ and ‘to what extent shouldthe Vocabulary service be responsible for extractingvalues from data?’

To answer these questions we need to consider theVocabulary service within the context of the systemarchitecture. Our architectural definition is:

A Vocabulary is a translation between the systemmetadata (as surfaced by the appropriate service)and an external syntax. The translation may be ineither or both directions.

It is easy to imagine examples (e.g. when the ex-ternal syntax is conversational English, and the vo-cabulary service is used to provide ‘helpful names’for system metadata) where the Vocabulary serviceis little more than a table that associates the appro-priate syntax with system metadata fields.

So a possible separation of concerns would be touse the recognition tree to select a vocabulary7 for agiven FITS file and then provide translation throughappropriate cross–references.

These ‘translations’ are a natural point at whichto store additional functionality that is specific to ametadata value: validation instructions; Java typeused for encapsulation, etc.

The Vocabulary service is then used in two stages.First, the relevant vocabulary is recognised. Second,the appropriate vocabulary is provided, with appro-priate translation information provided for each ele-ment in the data model8

This reduces the amount of detail needed in therecognition tree — only properties at branch nodesare required — and will reduce the complexity dis-cussed in the previous section.

2.3.5 Implementation

Figure 2.1 sketches a possible implementation of theideas described above. The recognition tree is on

7We group vocabularies that are related by a single recog-niser within a ‘vocabulary family’.

8One limitation of this analysis is that it only handles datawith a name/value structure; however, this is itself a broadclass.

18

Page 20: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Vocabulary

Vocabulary Family

Recognition Tree

«interface»Node

Property NameValue Predicate

Leaf Node

System Ontology (Data

Model)

Data Model Class

Predicate Node Data Model Field Name

Validation Java Class

Translation

Translation Set Bean Path

Data Model Value

1

1..*

1..*

1

0..* 1

describes

1

1..*

1

1..*

selects

+children

1..* {ordered}

1

1..*

1

1

1..*

wraps

1..*

1

1

0..*

+superclass 0..*

1

1..*

names

1

1..*names

1

1

1..*

1

1

1..*

Figure 2.1: A possible implementation of the ideas discussed here. Details of the data model are intention-ally vague.

HIERARCH ESO DET WIN1 STRX = 3 / Lower left pixel in X

HIERARCH ESO INS FILT1 NAME = ‘OIII/3000’ / Filter name

HIERARCH ESO OBS NAME = ‘NGC 1275’ / Observation block name

Figure 2.2: Example ESO “hierarchical” headers.

19

Page 21: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

the left; the vocabulary in the middle; and the datamodel (part of the system metadata) on the right.

A Vocabulary is composed of ‘translation sets’that correspond to classes in the data model; abean path9 identifies the appropriate class and, to-gether with the field name for a particular transla-tion, would allow dynamic access to values.

This solution will probably evolve further. For ex-ample, it is probably incorrect to assume that thereis always a 1–to–1 connection between the metadataitems in the external syntax and those in the internalmodel.

2.3.6 Reality Check: Field Map

The ‘translations’ correspond to the mapping fromscience metadata to database (or object) fields thatis configured in the DB Loader (section 2.1.2).

Our experience with the DB Loader suggests thatmany translation sets will be shared between severalvocabularies.

2.4 Conclusions

2.4.1 Vocabularies

We started with very little understanding of how toapply the concept of vocabularies to FITS headerproperties. We now see that each distinct combi-nation of ‘interesting’ metadata names identifies aseparate vocabulary. Since FITS files vary widelywe have many vocabularies and need an efficient ap-proach to managing them. This work separated intotwo related steps.

First, we will use a recognition tree to iden-tify the vocabulary in question. In practice wemay split vocabularies into several orthogonal (sub-)vocabularies; we can use separate recognition treesfor each, or we can use a single tree whose leaf nodesare categorised by vocabulary.

Second, we structure vocabularies to match ourdata model. For each field in the data model,we store appropriate information in a ‘translation’.These are grouped in ‘translation sets’ that mirrorthe classes in the data model.

2.4.2 Loading and Remediation

In retrospect, the code we developed in an earlieriteration to remediate data and load the database,

9This discussion assumes that the data model is instan-tiated as JavaBeans[13]. A bean path is a dotted sequenceof names that select a path through linked objects (typicallytranslated to a sequence of calls to ‘getter’ methods).

already contained the structures we describe above.However, we had a poor understanding of the logicthat fixes the underlying relationships.

For example, we constructed (the equivalent in-formation to) vocabularies with a text–based con-figuration that ‘imported’ related files. The importmechanism was general and poorly constrained. Wenow understand that by restricting the configurationto reflect the structure in the data model (and mov-ing this information into the database) we can makethe relationship between vocabulary and data modelexplicit, avoid inconsistencies, and simplify the man-agement of these relationships.

2.4.3 Duplicate Data

An interesting additional use for the ideas identifiedabove is in the detection of ‘practically duplicate’data: we would like to be able to identify duplicatecopies of the ‘same’ data, even when some headervalues change. To do this reliably we must distin-guish between metadata which are meaningful, andthose which are ‘noise’ (e.g. time of entry into a datatransport system).

One solution to this would problem would be tomark certain translations as ‘meaningful’ and useonly the science metadata that they select. Fur-thermore, one way of defining ‘meaningful’ is to usemetadata names that appear in the recognition tree.

2.4.4 Theory

We are interested in how best to use ideas from‘theoretical’ computer science in practical program-ming10. While we see many interesting ideas we arealso acutely aware that we must avoid self–indulgenthand–waving that gives no practical results. Sincewe feel that this analysis was productive we aresomewhat encouraged to try a similar approach inthe future.

One reason for the success here, we believe, wasan iterative approach to development. Earlier codehelped identify problems and motivate our work. Itwas also reassuring to find concrete support for ournew ideas — albeit in a rather rough form — alreadyimplemented in our software.

10As is probably obvious, we are somewhat sensitive aboutour approach, feeling trapped between theoreticians, who willsee very little theory here, and those programmers who believeevery problem can be solved by using someone else’s toolkit.

20

Page 22: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

2.4.5 Problems

A further advantage of this analysis is that we canidentify possible problems by checking our assump-tions against reality. One obvious difficulty is thehandling of FITS properties which are ‘sequentialfragments’. This occurs with WCS[9] encodings,for example, which exceed the maximum length fora single header field11. The solution in this caseis probably a pre–processing step that generates asingle, concatenated value when parsing incomingmetadata (and a corresponding fragmentation stepwhen generating a header).

2.5 Addendum

This section was added after my ADASS talk in Oc-tober 2006. It addresses questions raised by the au-dience.

2.5.1 Non-Unique Fields

Rob Seaman pointed out that some files contain du-plicate fields; it seems that these are badly formedfiles from non-compliant software/bugs. This raisesan issue I did not address — verification.

The information stored in the vocabulary servicecan help do three things: identify headers; verifyheaders; extract information from headers. Only thefirst of these is covered in any detail here. Rob’s ex-ample shows that verification should occur beforeidentification (and implies that some remediationmay also be necessary, if these files are to be in-gested).

2.5.2 Hierarchical Headers

Dick Shaw raised the (non–standard) use byESO of “Hierarchical Keywords”. The ESOdocumentation[14] gives the example shown in fig-ure 2.2.

While it is true that a regular grammar cannothandle indefinite hierarchies, the scheme chosen byESO is sufficiently restricted that it would workwithin the framework described here. It is only nec-essary to treat the whole name (including spaces) asa single string.

11Based on punched cards

21

Page 23: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

22

Page 24: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Chapter 3

A Tiny Workflow in Spring

Author. Andrew Cooke.

Abstract. I define a simple workflow withinSpring, give some examples, and discuss its use.While is works well for prototyping and structuringsimple flow–based processes, it doesn’t scale well ifthe control becomes complex.

3.1 Motivation

A workflow is a software tool that separates ‘control’,‘state’ and ‘process’. A workflow engine is often a‘composition language’ used to express the high levelfunctionality of a system.

There are many complex, powerful workflow en-gines available, but their associated cost makes themattractive only at the higher levels of a system. SinceI have just defined workflows in terms of ‘high levelfunctionality’ this might seem reasonable.

However, ‘high level’ is a relative judgement (andthe definition of ‘business logic’ is more flexible thanmany people pretend). I felt a lightweight workflowwould be a useful tool that could help clarify thestructure of many systems1.

3.2 Design

Figure 3.1 is a class diagram for the Tiny Workflowlibrary.

The design exploits the ‘nesting facade’pattern[18]: a workflow contains a Flow; manyimplementation of that interface can contain furtherembedded Flow instances.

The most abstract view of a workflow presents twoseparate ideas: decisions and processes. These sug-gest Test and Action interfaces. To reconcile this

1Ad-hoc ‘workflow–like’ classes emerge naturally inSpring–based systems. All I do here is provide a little morestructure.

with the pattern, I make Flow inherit from both andgive appropriate semantics: the Action in a Flow iscalled if the Test returns true2.

A useful practical class is Pair which binds a Test

with an Action. I can then develop separate de-cisions and process, combining them with Pair asneeded. Alternatively, instead of aggregation, I canuse inheritance: Null is a Flow that can be sub-classed as required (by default the test is Always

true and the Action does nothing).Implicit in the discussion so far is the idea of

State. This single interface is used both as the basisfor decisions, and as the data to be processed. This isnot an interface, but a generic parameter to almostall classes. In this way, the workflow is type–safe,but not intrusive. Test interface’s main method totake a State and returns a boolean; Action takesand returns State.

I can also define Test instances that are injectedwith other tests (nesting facades again). The librarycontains the obvious Or and And, as well as Always

and Never.The library also contains a set of composite Flow

instances. Sequence takes a list of Flow instancesand calls them in turn. First also takes a list, butstops running through the contents as soon as a Flowwhose Test returns true is found (and the corre-sponding Action invoked).

Finally, the workflow includes some support forconverting incoming data to the internal State rep-resentation: Normalizer and InputFilter.

3.3 Examples

3.3.1 Pseudo–Code

Figures 3.2 and 3.3 show the same ideas expresedin a Java–like syntax and as Tiny workflow compo-

2Assuming that the workflow has ‘reached’ the flow inquestion.

23

Page 25: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

State

Workflow

Named

«interface»

Action

Named

«interface»

Test

NamedBase

State

Or

NamedBase

State

Always

NamedBase

State

And

NamedBase

State

Never

NamedBase

State

Not

NamedBase

State

Debug

NamedBase

State

Null

NamedBase

State

Throw

NamedBase

State

FlowList

State

Null

State

Refuse

State

TryCatchFinally

State

Null

State

InputFilter

«interface»

Normalizer

«interface»

Flow

NamedBase

State

Pair

State

First

State

Sequence

Figure 3.1: The basic structure of the Tiny Workflow.

24

Page 26: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

if (testA(state)) {

try {

if (! testB(state)) {

state = processB(state)

} else if (testC(state)) {

state = processC1(state)

state = processC2(state)

state = processC3(state)

} else if (testD(state)) {

state = processD(state)

} else {

throw

}

} catch {

state = processE(state)

}

}

Figure 3.2: A fragment of pseudo–code; see figure 3.3for the equivalent workflow structure.

nents (for clarity some components have been omit-ted, typically where default values are used).

The advantages of structuring code like this in-clude:

• The ability to restructure the logic without re-compiling the software.

• Explicit state allows serialisation (see dicussionin [18]).

• Clear separation of decisions and process (oftenconfused in OO code).

• Use of a standard, simple interface allows com-position of generic components.

3.3.2 DB Loader

Figure 3.4 shows part of a fairly complex flow (fur-ther loaders have been omitted for clarity). TheFirst element will select the first sub–flow whosetest is successful. By ordering the loaders from mostto least specific the most complex that is consistentwith the data will be chosen. If none are chosean exception is thrown; this, or any other excep-tion thrown by a failing loader, is then caught and ageneric loader called.

Figure 3.5 is part of the test formosaicSkyFlatImageDataProductLoader. Threedifferent tests are combined with Or.

(This system used the XML–based representa-tion of state discussed in section 3.4.1, hence thetiny.xml namespace.)

Figure 3.3: A workflow structure that correspondsto the pseudo–code in figure 3.2.

25

Page 27: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

3.4 Experience

In general, my experience with the Tiny Workflowhas been positive. It was simple to develop, is easyto understand, and has proved to be very flexible.However, a number of issues restrict its use. It is nota replacement for a ‘real’ workflow, but is a usefuladditional tool.

3.4.1 XML Tests

An early iteration forced the state to conform toa specific interface, which provided an XML ‘view’of the contents. This had the advantage that testscould be written using XPath, and an XPath–basedTest could be used in all workflows. Often, no ad-ditional, state–specific tests were needed.

However, this approach had some drawbacks. Im-plementation details were exposed in the workflowconfiguration (this may have been due to poor de-sign of my XML). Maintaining consistency betweenthe state and the XML was expensive: either thestate was XML, in which case actions were ineffi-cient, or the state was Java objects, in which casetest were inefficient. Also, XPath is often not verycompact or readable; I found myself writing state–specific tests to provide a more friendly interface tothe user.

I eventually refactored the code so that the statewas generic; the XML–aware interfaces were then(trivially) re–implemented as more specific versionsof the generic base.

3.4.2 Verbosity

XML is not very compact. Spring configurationis not very compact, even for XML. Writing com-plex control flow logic in Spring XML soon becomesrather messy. To some extent this can be managedby standard programming practices (in particular,separating the workflow into a hierarchy of processesand placing each level in a different file), but eventhat soon becomes inadequate.

Tiny was used successfully in a variety of services,but when it was most useful — when the flexibilitywas really necessary — I switched to a different con-figuration mechanism once I understood the issuesinvolved[15]. This was not really a failure, since thelibrary had proved to be very useful, but did showthat the logic scales poorly.

3.4.3 Lack of Features

This workflow provides a way to structure relatedprocesses, but it does not have any other work-flow feature. In particular, it does not provide amechanism for persisting state. Sub–flows cannot bestopped/started in response to reliable asynchronousmessaging, or to provide continuity across systemfailures.

3.4.4 Persistent State

If this workflow were able to persist state itwould simplify the interface with asynchronousmessaging[18].

The next release of Java is rumoured to containsupport for continuations, but the Tiny is so simplethat this would be possible to add by hand; Stateand the call stack would need to be serialisable3.Extending State to track the call stack would beone approach.

In addition, some mechanism would be needed tointerface this functionality with the messaging user,but again, at first glance, this does not seem to be ahard problem.

3.4.5 Pluggability

Where Tiny really shines is in assembling simple,composite services. Combined with the approachdescribed in [18], the Flow interface makes buildingmodular, composite services, whose sub–services canbe configured independently, trivial. Sometimes it isnot even necessary/appropriate to expose the work-flow externally; the internal structure works well anda wrapper Java class can simplify configuration.

3The simple structure Tiny imposes on the code is whatmakes this so simple; there is no need to store any furtherinformation from the Java stack and heap.

26

Page 28: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

<bean name="loaderFlow"

class="edu.noao.nsa.util.workflow.tiny.xml.flow.TryCatchFinally">

<property name="action">

<bean class="edu.noao.nsa.util.workflow.tiny.xml.flow.First">

<property name="name" value="loadFlow.action"/>

<property name="flows"><list>

<ref bean="mosaicDomeFlatImageDataProductLoader"/>

<ref bean="mosaicSkyFlatImageDataProductLoader"/>

<ref bean="mosaicBiasImageDataProductLoader"/>

<ref bean="mosaicReducedObjectImageDataProductLoader"/>

<ref bean="mosaicRawObjectImageDataProductLoader"/>

...

<bean class="edu.noao.nsa.util.workflow.tiny.xml.flow.Pair">

<property name="action">

<bean class="edu.noao.nsa.util.workflow.tiny.xml.action.Throw">

<property name="message" value="No match with specific loaders"/>

</bean></property>

</bean>

</list></property>

</bean></property>

<property name="catch">

<ref bean="genericDataProductLoader"/>

</property>

</bean>

Figure 3.4: A flow for loading data; the sub–flows are ordered from most to least specific and only the firstthat succeeds is used.

<bean name="_obsTypeSkyFlat" class="edu.noao.nsa.util.workflow.tiny.xml.test.Or">

<property name="name" value="_obsTypeSkyFlat"/>

<property name="tests"><list>

<bean class="edu.noao.nsa.util.properties.tiny.PropertyTest">

<property name="property" value="OBSTYPE"/>

<property name="value" value="SFLAT"/>

</bean>

<bean class="edu.noao.nsa.util.properties.tiny.PropertyTest">

<property name="property" value="OBSTYPE"/>

<property name="value" value="SKYFLAT"/>

</bean>

<bean class="edu.noao.nsa.util.properties.tiny.PropertyTest">

<property name="property" value="OBSTYPE"/>

<property name="value" value="SKY FLAT"/>

</bean>

</list></property>

</bean>

Figure 3.5: A composite test for FITS header values in a remediation pipeline.

27

Page 29: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

28

Page 30: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Chapter 4

A Simple, Lazy, Expression Evaluator

Author. Andrew Cooke.

Abstract. I explain the motivation for, design of,and experience using a simple domain-specific lan-guage in Java.

4.1 The Problem

4.1.1 The Remediation Service

The NOAO DMaSS (Data Management and ScienceSupport) platform includes a Remediation Service.Separate instances of this service can be individuallyconfigured to perform fairly complex ‘processing’ oftextual data. For example, one might be used toconstruct values for a general science data modelgiven data from instrument-specific FITS headers1

for different instruments.The Tiny Workflow[7] provides a suitable frame-

work for configuring remediation. Continuing withthe example above, it would allow different ‘flows’ tobe defined for each instrument, selecting the appro-priate flow depending on the science data received.

Data within the remediation workflow are repre-sented as name/value pairs. A simple library (sim-ilar to Java’s Properties class) supports this ab-straction, adding immutable values (both per-value,to protect system data, and per-collection, to allowmore efficient processing).

Given this framework, the work of implementingthe remediation service was reduced to writing var-ious ‘actions’ that, when appropriately configured,modified the workflow data.

4.1.2 Remediation Actions

Initial analysis suggested that a sufficiently flexiblesystem could be constructed from a set of very simple

1FITS headers contain astronomy image metadata in asimple name/value format.

actions:

• Require that a given set of properties (names)is present.

• Rename a set of properties.

• Provide default property values.

• Rewrite property values using regular expres-sions.

These different actions could be composed withinthe workflow (which included the ability to test forvalues by evaluating XPath expressions against anXML representation of the properties) to constructprogressively more complex procedures.

Although these simple actions were easy to imple-ment and configure individually, two problems soonbecame clear.

First, the Spring workflow configuration explodedinto a mass of unsightly XML. I consider myselfXML–tolerant, but felt that the verbosity involvedin configuring non-trivial processes was excessive.

Second, I had forgotten the need to concatenatevalues. While this could be fixed this with anotheraction, I was concerned that the set of actions wasgoing to continue to expand, making configurationeven more complex and opaque.

4.1.3 Configuration via Programs

One solution to the problems described above is toprovide a single, more powerful ‘general action’ forprocessing data — ie. provide access to a program-ming language. The configuration for the actionthen becomes a small program in its own right.

Various languages are available to provide ‘script-ing’ in Java (e.g: Groovy; Javascript in the nextmajor JDK release). However, these might causefurther problems.

29

Page 31: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Introducing a general language provides a lot of(unnecessary) functionality at one point in a lay-ered architecture. There is a temptation to useit to solve unrelated problems. For easy mainte-nance/operation I would like to restrict the func-tionality available. For example, high–level controlshould remain in the workflow rather than being im-plemented in the same script used to manipulate val-ues2

There is also a cost involved in using externalpackages. This is typically dismissed as unimportantcompared to the ‘obvious expense’ of implementinga language from scratch. However, I felt otherwise.

4.2 The Solution

4.2.1 Domain Specific Languages

In recent years there has been an explosion of in-terest in Embedded / Little / Domain Specific Lan-guages (DSLs), particularly in the functional lan-guage programming community. This ‘movement’has recognised that:

• Implementing a simple language is not difficultor expensive.

• Languages targeted at particular problemsmake good user interfaces.

• Careful language design can make the interfaceeasy to control (it can be as restrictive or ex-tensible as needed).

Although much of the literature emphasises the(undeniable) advantages of functional programminglanguages in supporting this paradigm, I felt that asimilar approach should also be possible in Java.

4.2.2 Design Choices

A typical DSL builds on existing language support(e.g. combinator libraries, macros) or uses a dedi-cated recursive descent parser. Unfortunately, nei-ther of these approaches would work in this case:Java does not have simple runtime compilation orthe kind of features (e.g. high order functions) neces-sary for embedding; a typical hand-written recursivedescent parser would be tediously verbose.

I considered using a parser library (GI[16] is verynice) but instead decided that it would be simpler

2I later realised that the workflow is closely related to infor-mation in the Vocabulary Service[15]; this separation helpedus exploit that commonality.

to use a prefix syntax (like Lisp / Scheme) whoseparsing is trivial.

Initial requirements involved only function evalu-ation. Ignoring function and variable definition re-moved the need for scoping or mutable data. Nordid I need support for lists (at first this seems irrel-evant, given the syntax, but it simplifies parsing byeliminating quotation).

However, some control flow was necessary. Learn-ing from the behaviour of ‘?:’ (taken from C lan-guage syntax) in some IRAF[17] tasks, which evalu-ates both sides of the branch, it seemed simplest toavoid such problems by using lazy evaluation (termsare calculated only as they are required).

I also chose untyped semantics (a.k.a. ‘dynamic’or ‘tagged’ types), assumed arbitrary dynamic casts,and did not consider operator overloading.

Examples of the resulting syntax can be seen insection 4.2.6.

4.2.3 Parsing

The design described above is so simple it hardlywarrants being called a ‘language’ - it is almost iden-tical to the calculator example (terms and expres-sions) given in programming textbooks.

Program text is lexed using a simple state ma-chine (a single Java class with a case statement overan Enum state). Single character look–ahead is suf-ficient; I used nio.CharBuffer’s ‘mark’ mechanismto provide this. There is a token for ‘open’, ‘close’,‘eof’ and each atomic type (including ‘name’ for un-quoted text).

The stream of tokens is assembled into an AST us-ing a simple recursive descent parser that consumestokens and returns AST nodes. This was imple-mented as a single class (the productions are pri-vate methods). Since the JVM does not have tailcall optimisation I needed to worry about stack use;it is proportional to the maximum nesting depth inan expression which, for the intended use, is not aproblem.

4.2.4 Evaluation

Evaluation is equivalent to a traversal of the AST,selecting (evaluating) only those nodes required bythe semantics. This is not explicit in the im-plementation (there is no tree walker, for exam-ple). Instead, sub nodes are evaluated via theNode.evaluate(Namespace) method. Each nodeimplementation is responsible for deciding whichchild nodes to evaluate in turn.

30

Page 32: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Types are identified (‘tagged’) by the subclasses ofthe AST Node class. The common superclass guar-antees that future extension to include first classfunctions and lists will be painless (both already ex-ist as Node subclasses, but the current operationalsemantics do not expose their dynamic creation tothe user).

Functions and variables are provided via aNamespace interface. Typically, variables are readfrom the Properties interface described earlier,while functions include a standard collection thatprovides basic functionality. There is also an adapterfor eager functions (the user programs to an interfacethat is provided with pre-evaluated arguments) sothat simple functions for specific applications can beadded without understanding how lazy argumentsare implemented.

4.2.5 Generics, Reflection

I was pleasantly surprised at the level of integrationpossible between the language being implementedand the Java host. Figure 4.1 shows almost all thecode necessary to implement literal values, includingthe base classes, support for dynamic casts and typeerrors, and the implementation class for string liter-als. Generics and reflection help us write code thatcan be used for all (literal) types while also help-ing leverage the support for these types in the hostlanguage.

The dynamic cast method (as) is a good exam-ple3. It is used in the example in figure 4.2.

For those not familiar with Java, the importantaspects of figure 4.1 are that (1) most of the logicfor literal types (everything except type conver-sions) is in a single type–safe generic class, Literal;(2) the code for dynamic type conversions is stati-cally checked and type–safe; (3) the runtime checkfor type errors (in the ‘new’ language, not Java)is made in a generic class (a static member ofTypeException) using reflection. The ease withwhich a ‘dynamic’ language can be implemented in‘static’ Java is surprising4.

3Although this implementation makes the addition of later‘third party’ types difficult, since casts are encoded in eachsource class; a ‘from’ approach would be better, but withan uglier syntax, if this kind of extension were an importantrequirement.

4I believe similar code would be possible in C#

4.2.6 Examples

The ‘If ’ Function

Figure 4.2 shows the implementation of a functionthat is equivalent to an ‘if statement’ in eager lan-guages. For example

(if false (/ 1 0) (+ 1 2))

will return 3 — there will be no division by zeroerror because the if function does not evaluate thesecond argument if the first is false (true and false

are parsed as literal boolean values).The first argument is evaluated, cast to a boolean,

and used to select the first of second argument.The evaluate method would be called from the s–expression, which itself implements Node.evaluate

by evaluating the list head (typically a Name whoseNode.evaluate retrieves the function from theNamespace), casting to a LazyFunction, and pass-ing the list tail as arguments.

Basic Actions

Section 4.1.2 listed basic remediation actions. Here Ishow how the function evaluator can reproduce thatfunctionality.

Exceptions in the evaluator are implemented asJava exceptions in the implementation. The evalua-tor is structured so that the evaluator automaticallyinherits the correct semantics.

Required name

(require name1 name2 ...)

The require function throws an exception if there isno binding for any of its arguments. If all argumentsare bound it returns () (the empty list).

Rename values

new-name = old-name

The simple evaluator does not bind values, so thesyntax above, and the associated responsibility forassigning a new value, belong to the calling package.

Default values

name = (default name "value")

The default function evaluates and returns the firstargument. If an exception is thrown during evalu-ation and the function has a further argument, the

31

Page 33: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

public abstract class BaseNode implements Node {

public abstract Node evaluate(Namespace namespace) throws Exception;

public <Cast extends Node> Cast as(Class<Cast> clazz) throws TypeException {

if (clazz.equals(StringLiteral.class)) {

return TypeException.assertType(new StringLiteral(toString()), clazz);

} else {

return TypeException.assertType(this, clazz);

}

}

}

public abstract class Literal<Internal> extends BaseNode {

private Internal value;

public Literal(Internal value) {this.value = value;}

public Node evaluate(Namespace namespace) throws Exception {return this;}

public Internal getValue() {return value;}

public String toString() {return getValue().toString();}

}

public class StringLiteral extends Literal<String> {

public StringLiteral(String value) {super(value);}

public <Cast extends Node> Cast as(Class<Cast> clazz) throws TypeException {

if (clazz.equals(BooleanLiteral.class)) {

return TypeException.assertType(new BooleanLiteral(getValue()), clazz);

} else if ( ... ) {

...

} else {

return super.as(clazz);

}

}

}

public class TypeException extends EvaluationException {

public TypeException(String msg) {super(msg);}

public static <Type extends Node> Type assertType(Node node, Class<Type> clazz)

throws TypeException {

if (! clazz.isAssignableFrom(node.getClass())) {

throw new TypeException("Expected " + clazz.getSimpleName()

+ ", but got " + node.getClass().getSimpleName());

}

return clazz.cast(node);

}

}

Figure 4.1: Generic annotations and reflection simplify integration between language and Java host.

32

Page 34: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

public class If extends LazyFunction {

public Node evaluate(NodeList arguments, Namespace namespace)

throws Exception {

Iterator<Node> args = arguments.iterator();

if (! args.next().evaluate(namespace).as(BooleanLiteral.class).getValue()) {

args.next();

}

return args.next().evaluate(namespace);

}

}

Figure 4.2: Implementing ‘if’ as a lazy function.

exception is discarded and the next argument eval-uated.

Note that only the expression enclosed by () eval-uated by the library described here. The final assign-ment to a value is the responsibility of the caller.

Regular expressions

(regexp firstword "^\\s*(\\w+).*" "$1")

(regexp lastword ".*?(\\w+)\\s*$" "$1")

(regexp firstletters "(\w)\w*\s*" "$1,")

(regexp dropcommas "(.*),$" "$1")

firstname = (firstword DTPI)

lastname = (lastword DTPI)

initials = (dropcomma (firstletters DTPI))

Regular expressions are not supported by the eval-uator, but they show how easy it is for the caller toadd extensions. Here regexp is a function, providedby the caller, that defines a Perl 5 regular expressionin the caller’s state as a side-effect. The caller thensupplies the compiled expression via Namespace forthe next call to the evaluator.

As before, assignment is the responsibility of thecaller. The caller is also free to schedule evaluationso that, for example, the regular expressions are de-fined only once, when the system starts.

4.3 Conclusions

4.3.1 Experience

The initial implementation of this ‘language’ (lexer;parser; interpreter; unit tests) took about 20 hours(two 10 hour days). As a result, the configurationfor remediation was simplified significantly, using atool whose power is limited, as required, but whichcould be easily extended in the future.

A careful choice of syntax and semantics allowedus to exploit techniques developed (or rediscovered)

with significantly more powerful tools. Implementa-tion time was short, allowing easy integration intoour iterative (agile) development process. I be-lieve that many of the approaches emerging from‘academic’ functional programming (an emphasis ondeclarative approaches; little languages; a realisationthat powerful modern languages make some heavy-weight libraries obsolete) fit well with the agile ap-proach.

Embedding a ‘dynamic’ language in ‘static’ Javawas surprisingly easy (section 4.2.5). Similar ideascan be seen in the elegant generic form handlingwithin the Spring MVC framework. In my opin-ion the power of generics and reflection are too oftenoverlooked by Java programmers.

The main drawback to this approach is raw execu-tion speed. A third party language that compiles tobytecode would be many orders of magnitude faster.For this application — evaluating simple expressionsduring remediation — this is not an issue; if it be-comes important later then this solution is, of course,carefully encapsulated within a small number of in-terfaces, and easily replaced.

4.3.2 Optimisation

As an experiment, I modified the system to cacheknown results if they were constant. Since I make noassumptions about purity (in particular, functionscould be implemented to have side effects and vari-ables mutable) this was not completely trivial; how-ever, it was simpler than I expected (one evening’swork).

The modification consisted of: adding Thunk withthe same interface as Node, but the ability to cachea value; extending NodeList to wrap Node instancesin a Thunk; changing the return type of evaluateto return both a Node and an indication whether itcould be cached (a signal to the surrounding Thunk);

33

Page 35: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

extending functions to include the necessary logic forpropagating ‘constness’.

In addition, to complete the work, I would haveneeded to make a distinction between pure and im-pure functions, and fixed and mutable variables (notdifficult; this distinction is already present in theProperties library, but the code there would need tobe made more generic, to handle functions as well asstrings).

This work was discarded, however, because itseemed (i) too complex for any expected gain (inparticular, intermediate objects were generated anddiscarded for each evaluation) and (ii) I could not seehow to make the changes either optional or cost–freewhen used in an impure context.

4.3.3 Philosophy

Finally, I also feel that there is a hidden value in worklike this: it gives real pleasure in what is otherwise anincreasingly mechanical and uninvolving profession5.Happy programmers are better programmers.

5Populated by technicians who use ‘never re–invent thewheel’ to justify over–complex solutions with fragile depen-dencies on rapidly evolving third–party systems; the conse-quent lack of knowledge about basic programming techniquesmakes their warning a self–fulfilling prophecy.

34

Page 36: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Appendix A

CTIO / DPPS / Archive Is...

A.1 Andrew Cooke

A.1.1 Contact

Email [email protected]@noao.edu

Phone +56 51 205 286

A.1.2 Interests

Functional Languages. Language research, par-ticularly on declarative / functional languages, is afertile source of ideas, many of which help illumi-nate both SOA and agile processes. For example,the importance of state and type systems influencedalmost every paper in this collection.

Mid–Level / Lightweight Design. Somewherebetween ‘architecture’ and ‘algorithms’ there’s a re-gion where important decisions on implementationmust be made. The paper on SOA focuses on thisarea.

I’m interested in how decisions can be made in aprogressive way, so that we can iteratively developsystems without heading down expensive one–waystreets. By proceeding stepwise, we can understandthe problems and the relevant solutions. Once un-derstood, the right approach is often simple.

Remote Agility. I would love to understand howagile development can be made to work across con-tinents.

A.2 Alvaro Egana

A.2.1 Contact

Email [email protected]@noao.edu

Phone +56 51 205 325

A.2.2 Interests

Distributed Systems. Many arguments can bemade in favour of distributed systems, but I thinkthis phrase summarises most of them: don’t put allyour eggs in one basket. This was the motivation to,first, consider messaging–based designs and, later, tomove our attention to SOA.

35

Page 37: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

36

Page 38: CTIO / DPPS / Archive — 2006 · 2019-12-01 · CTIO / DPPS / Archive — 2006 Andrew Cooke1 Alvaro Egan˜a2 September 2006 1andrew@acooke.org 2alvaro@egana.net. Abstract This is

Bibliography

[1] Spring Application Frameworkhttp://www.springframework.org/

[2] JSR 220: Enterprise JavaBeans 3.0http://jcp.org/en/jsr/detail?id=220

[3] The Scala Programming Languagehttp://scala.epfl.ch/

[4] Mulehttp://mule.codehaus.org/

[5] ActiveMQhttp://www.activemq.org/site/home.html

[6] Apache Maven Projecthttp://maven.apache.org/

[7] A. Cooke 2006; A Tiny Workflow in Spring(chapter 3)http://www.acooke.org/andrew/papers

[8] JSR 208: Java Business Integrationhttp://jcp.org/en/jsr/detail?id=208

[9] FITS Documentationhttp://fits.gsfc.nasa.gov/fits documentation.html

[10] S Lowry 2006 (ADASS, in preparation)

[11] C, Chambers 1993; Predicate Classes, LectureNotes in Computer Sciencehttp://citeseer.ist.psu.edu/chambers93predicate.html

[12] D. Grune, C. Jacobs 1990; Parsing Techniques— A Practical Guidehttp://www.cs.vu.nl/ dick/PTAPG.html

[13] JavaBeanshttp://java.sun.com/products/javabeans/

[14] The ESO Data Interface Definitionhttp://archive.eso.org/dicb/

[15] A. Cooke, A. Egana, S. Lowry 2006; FITSFiles and Regular Grammars: A DMaSS De-sign Case Study (chapter 2)http://www.acooke.org/andrew/papers

[16] GI (Generic Interpreter):http://www.csupomona.edu/ carich/gi/

[17] IRAF (Image Reduction and Analysis Facil-ity):http://iraf.net

[18] A. Cooke, A. Egana 2006; Spring, Mule,Maven: Lightweight SOA with Java (chap-ter 1)http://www.acooke.org/andrew/papers

37