12 vs of big data governance - university of...
TRANSCRIPT
1
The 12 Vs of Big Data provide a Governance framework of questions the should
be of value in ensuring that our IT and Big Data and Analytics projects deliver
value to the various stakeholders.
One of the most important aspects of Governance frameworks is to provide
questions that allow each project team to consider how to be successful.
Typically, an assessment of risks and vulnerabilities will provide the context within
which to connect the Vs to the specific situation. From this, it will be possible to
develop answers that ensure enhanced success for the project.
Each of the following slides briefly explore the purpose of the Vs.
Whilst Volume is one of the definitional Vs defining Big Data as that which
exceed the limits of current technologies, it can also be used as a question to
help us to understand the consequences on our projects in terms of the
technology stack that will be needed to ensure a satisfactory solution.
2
The velocity of data can pose significant challenges to the ability to provide
answers in the requisite timescales.
3
Variety refers to the wide range of formats and structures of the data that we
want to amalgamate and then process.
We now have structured and un-structured data which each have significant
challenges.
It is vital to understand the variety of our data sources in order to ensure that we
do not end-up with a fruit salad of mis-matched data. If we get it wrong, there is
the potential to create significant reputational, security and financial vulnerabilities
for our organisations and users.
4
Variability can be a vital aspect of understanding the relevance and value of the
detected patterns in the data. It can be a challenge even where there is a highly
regular periodicity in the data.
As an example, weather tends to have a seasonal periodicity, such as winter and
summer but can pose challenges due to the fact that it is also very variable from
day to day and even from year to year. This poses problems for retailers where
footfall can vary according to sun or rain on a particular day.
It will also affect sports events’ attendance.
However, it can also refer to the natural human variability in preferences and
likes. The fundamental question will be to assess the stability of personal
preferences (and even honesty with answers to survey questionnaires). Do we
really understand the stability of the way that humans answer a personality profile
questionnaire over time. Timescales could be over a period of days or weeks or
months.
5
Value is possibly one of the most difficult questions to answer.
What is the value of the project or application to the wide range of stakeholders in
the project.
How do each of the stakeholders obtain value.
Is it always monetary, or are there other factors, such as intellectual curiosity,
time saving, entertainment, customer retention, etc. The answers are many.
6
Are the data true? J Easton (IBM 2012) points out that probably 80% of all data
are of uncertain veracity.
Examples could be IoT sensor calibration drift, Location Services inaccuracy
(ThinkNear data suggests that approximately 14% of all location data used for
location based advertising is inaccurate by more than 60 miles (100km).
Social Media provide another source of probably low veracity data that needs
very careful analysis before any level of trust can be obtained.
7
Linked to Veracity is the question of the Validity of the data and the analysis.
A classic example of the lack of validity is the centre picture, where the
correlation is totally spurious and completely invalid.
8
Volatility poses the question of the timescales over-which data and the analyses
retain validity.
Weather forecasts provide an example in that the forecasts for a specific place
can change over relatively short timescales.
9
Verbosity particularly relates to questions about text sources and the problems of
machine understanding of the meaning of the text.
Natural language is often verbose in that it is not truly concise and is often highly
predictable. Does this help or hinder the analysis of text.
At the other end of the Verbosity problem are Tweets with the 140 character limit
and the consequences on textual analysis when all the normal rules of syntax
and grammar are ignored.
Can machine learning really understand irony or humour?
10
Vulnerability poses a wide range of questions about the system, the data and the
stakeholders.
Are the vulnerabilities technical, legal, financial, reputational…….?
11
Asking about Verification helps us to understand how we can demonstrate the
Validity and Veracity of the insights gained based on Verification and Validation of
our data cleansing and the analytics.
It helps us to demonstrate that we understand and have addressed the
Vulnerabilities.
12
Visualisation helps us to understand the best (and most ethical) ways of
visualising our data.
Are we accidentally or deliberately mis-leading the users of our visualisations. Are
we presenting our analyses in a way that is fair and honest?
13
This was the original presentation which was intended to demonstrate the
similarity of the levels of investment in the UK Infrastructure. It did this by using a
Log Scale. However, as the next slide demonstrates, this was a very severe
misrepresentation of the information.
This visualisation worked because most readers do not have much familiarity
with Log Scales and assume the more normal linear scales.
14
As this version shows, with a linear scale, only Energy and Transport have
significant levels of investment.
Investment in Flood Defences is insignificant compared to the others. This was in
the context of the sever flooding that had happened over the previous winter in
the UK
15