an overview of my phd research

Analyzing & visualizing spreadsheets Felienne Hermans (@felienne)

Analyzing & visualizing spreadsheets Felienne Hermans (@felienne)

In this slidedeck I present an

overview of my PhD research. I

recently defended my dissertation

titled ‘Analyzing and visualizing Spreadsheets’

In this slidedeck I present an

overview of my PhD research. I

recently defended my dissertation

titled ‘Analyzing and visualizing Spreadsheets’

This one!

Bridging the gap

Funny story: I wasn’t hired to

research spreadsheets at all. When

I started my PhD project, I was

supposed to research the gap between business users and programmers.

Users

Programmers

To research this gap, I started by studying business in practice

What surprised me, is that this gap

wasn’t that big, it was more like a

small creek than a huge cliff.

Some programmers were heavilly

involved in business, and even more

interesting: some business guys were

doing serious programming.

Programmers

Users








In Excel!

Programmers

Users








In Excel!

So I looked into some previous work

on the impact of spreadsheets on

business.

Programmers

Users

95% of all U.S. firms use spreadsheets for financial reporting

90% of all analysts in industry perform calculations in spreadsheets

50% of spreadsheets form the basis for decisions

Importance can grow over time

When studying the impact of

spreadsheets, we found that they

do not become important

overnight. As processes change, spreadsheets can become key

company assets over time.

Nobody sets out to create a mission

critical spreadsheet, they “just happen”

This is a simple spreadsheet for many users

Furthermore, spreadsheets can become surprisingly complex.

And, spreadsheet exist ‘under the radar’

Another interesting property of

spreadsheets is that they often live

‘under the radar’:

There is no list of spreadsheets, no

one keeps track of what sheets are

needed for what report and some

spreadsheets do not have a clear owner.

Only 33% of spreadsheets has a manual

Finally, spreadsheets are lacking

documentation. In only one third of

spreadsheets we found ‘documentation’ (i.e. Some sort of

explanation on how to use the

spreadsheet) Technical

documentation, explaining why a

spreadsheet was designed as it is, was hardly ever found.

Complex spreadsheets without documentation can lead to serious errors

You can imagine the combination

of all the above facts:

• Spreadsheets are important

• They are complex

• They lack documentation

is a potential recipe for disaster.

And indeed, those errors happen

The European Spreadsheet Risk Interest Group (Eusprig.org) collects horror stories

Estimated loss: 10 billion dollars a year

We interviewed spreadsheet professionals

Once I had studied related

spreadsheet work and the horror

stories from Eusprig, I wanted to gain a deeper understanding of

spreadsheet problems in practice.

So I interviewed 27 spreadsheet professionals at the Dutch Robeco

bank.

We interviewed spreadsheet professionals

Once I had studied related

spreadsheet work and the horror

stories from Eusprig, I wanted to gain a deeper understanding of

spreadsheet problems in practice.

So I interviewed 27 spreadsheet professionals at the Dutch Robeco

bank.

I asked only two questions (a semi-

structured interview) to obtain an overall view of spreadsheet

problems:

What annoys you?

And what makes you happy?

Financial professionals spend 2 days a week working with Excel

From the interviews, we learned the

following facts

Spreadsheets can have a long life, 5 years on average

Average sheet is used by 12 different people

There is a gap! Between importance and treatment.

Then I concluded that there is an

interesting gap that needs

bridging:

the gap between how important

spreadsheets are and how well

they are treated.

So how could this gap be bridged?

It looks like software in the 70s!

Let’s summarize the problems

around spreadsheets again:

• They lack documentation

• They contain errors

• They stay alive for several years and are used by several people

• They are complex

Does this remind you of

something?

It reminded me of the problems in

the early days of software

Hence, we tried to bridge this gap with methods from software engineering.

Spreadsheet users lack great tool support

If you compare the tooling of

spreadsheet developers with that

of software developers, the difference is clear.

Modern IDEs (like Visual Studio)

have all kinds of build-in tools to help you build software in a

responsible way: debugging,

testing, analyzing and visualizing

are accessible at the click of a

button.

Compare this to a spreadsheet environment, like Excel. Lots of

support to create a spreadsheet,

with fonts and colors and borders,

but none of the helpful tools to

build a maintainable spreadsheet.

We did not start coding immediately

However tempting, we did not start to build a spreadsheet IDE

immediately. Instead, we looked

at the results of the interviews, to

find the most pressing information

need that spreadsheet users had.

Most important problem: support for understanding spreadsheets was missing

To address this information need

specifically, we developed our tool Breviz.

This tool visualizes the

dependencies among worksheets, depicted as rectangles with arrows

drawn between them. The thicker

the arrow, the more connections

there are.

Example: In worksheet ‘POA

Project’ formulas are placed that refer to cells in ‘ProjectTeam’

We went back to practice

With our tool, we went back to practice, to see whether it really supported spreadsheet users.

Turned out, it did. Some of the

responses of users:

“This diagram reminds me of what I had in mind when building”


responses of users:

This remark is interesting: apparently, this spreadsheet user

did do some modeling before

building a spreadsheet.

“This diagram reminds me of what I had in mind when building”


responses of users:

A clear sign that we were on the right track!

“This makes my job 10 times easier”

This work was published at ICSE 2011

http://dl.acm.org/authorize?414064

However, unexpected things also

happened. Not all spreadsheets looked as well structured as this

one.

Let’s look at some of them:

Here, pink blocks represent

worksheets outside of the spreadsheet. So this spreadsheet

gathers information from over 20

other worksheets and combines

this information.

Users diagnosed with the diagrams

We found that, due to the diversity on the diagrams, users started to

judge spreadsheets based on their

dataflow diagrams.

We therefore formalized this

feeling users had into ‘smells’ at

the design level.

These spreadsheet smells turned out to be very similar to code

smells as defined by Fowler.

Consider for instance the ‘feature envy’ smell. This occurs when a

method from class B refers to

many fields outside its own class.

This method envies all the cool fields that A has, hence the name.

Consider for instance the ‘feature envy’ smell. This occurs when a

method from class B refers to

many fields outside its own class.

This method envies all the cool fields that A has, hence the name.

Easy to see how this smell could

be defined on spreadsheets,

where a formula in worksheet B could be overly interested in cells

on worksheet A.

We added support in Breviz for

detecting and visualizing these

inter-worksheet code smells.


Next, of course, we went back to

practice, to see how users felt

about the detected smells.

“That should be improved”

Results showed that users

understoond why certain

constructions were qualified as

smelly.

“That should be improved”

Results showed that users

understoond why certain

constructions were qualified as

smelly.

“This must be confusing for others”

Published at ICSE 2012

http://dl.acm.org/citation.cfm?id=2337275

However, new problems were to be

discovered. We found that, once

the structure of the spreadsheets

had been understood and

validated, complex formulas still got in the way of understanding spreadsheets.

This led us to the idea of formula smells

Again, we took our inpiration from the smells that Fowler defines in his

canonical book on refctoring.

Published at ICSM 2012

http://www.felienne.com/?p=394

In a recent extention of the paper, we also suggest refactorings

corresponding to smells.

This formula, for instance, contain

the same subformula twice. Extracting this subformula into a

seperate cell will improve

readbility.


And again... A look in practice

We found that cloning (i.e. Copy

pasting) in spreadsheets was a problem. If data is copy-pasted,

updates will not be propagated to

the copies and that might lead to

errors.

Based on existing work in clone

detection in source code, we

developed an algorithm to detec

clones.

Clone visualization was added to

our visualization, indicated with a dashed arrow. After all, when data

is copy-pasted between

worksheets, there is a dependency

between those worksheets (albeit a different one than a formula link)

To validate our algorithm, we

performed a case study at the distribution centre of the South

Dutch food bank. There, they

process 100.000 kilos of food per

month, and keep track of that with spreadsheets.

We were able to detect 61 near-

miss clones, of which 25 were

actual errors.

Because of our analysis, this

distrubution centre is now running

error-free spreadsheets!

To be published at ICSE 2013

http://www.felienne.com/?p=2338

And this paper concluded my PhD

thesis.

I will continue to work on spreadsheet analysis for at least

five more years at Delft University of

Technology, so in the remaining

few slides, I’ll line out what I will be

working on in the future.

Remember spreadsheets stay in

business for 5 years and are used

by 12 people during their life span?

This makes it interesting to consider

‘spreadsheet evolution’ and study

how spreadsheets are created.

Visual Basic Analysis

In our current visualization and analysis technique, we only

consider formulas.

However, spreadsheets also allow for code to interact with data and

formulas (VBA code in Excel).

By analyzing this, we could make

our analysis more complete and interesting.

Spreadsheet testing

Finally, we want to research how spreadsheet users test. One might

think that spreadsheet users do not test, but this is not true.

In our previous studies, we often saw formules like this one. Here,

nothing is really calculated.

Instead, some sort of validation is

performed: if ‘find zone’!W3 is

smaller than 0, we are not interested in the value.

When we could extract these type

of formulas, we could use them to test the spreadsheet.

Analyzing and visualizing spreadsheets Felienne Hermans

Thanks for reading about the

research adventure I was enjoying the past 4 years!

If you want to know more, have a

look at my blog: www.felienne.com

If you are intrested in collaborating,

please send me an Email [email protected] or a tweet @felienne