making of the data scientist profession

54
Making of the data scientist profession - DJDQ8 MSc in Digital Anthropology Making of the data scientist profession: provisional selves, career transitions and the boundaries between techne and episteme in data analysis roles Łukasz Alwast Dissertation submitted in partial fulfilment of the requirements for the degree of MSc in Digital Anthropology (UCL) of the University of London in 2014 Word Count: 14 254 UNIVERSITY COLLEGE LONDON DEPARTMENT OF ANTHROPOLOGY Note: This dissertation is an unrevised examination copy for consultation only and it should not be quoted or cited without the permission of the Chairman of the Board of Examiners for the MSc in Digital Anthropology (UCL) 1

Upload: lalwast

Post on 12-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

MSc dissertation in Digital Anthropology @UCL, written by Lukasz Alwast in 2014, in partial fulfilment for the degree certification.

TRANSCRIPT

Page 1: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

MSc in Digital Anthropology

Making of the data scientist profession:

provisional selves, career transitions and the boundaries between techne and episteme in data analysis roles

Łukasz Alwast

Dissertation submitted in partial fulfilment of the requirements for the degree of MSc in Digital Anthropology (UCL) of the University of London in 2014

Word Count: 14 254

UNIVERSITY COLLEGE LONDON

DEPARTMENT OF ANTHROPOLOGY

Note: This dissertation is an unrevised examination copy for consultation only and it should not be quoted or cited without the permission of the Chairman of the Board of Examiners for the MSc in Digital Anthropology (UCL)

1

Page 2: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

AbstractOver the past XX years, the term data science has swiftly moved into the vernacular of scientific and technological vocabulary. As this happened, it signified a larger phenomenon that is taking place in the sciences and society at large, namely – digitization and ‘datafication’ of many of the aspects of the world that had not been quantified and digitized before. This trend seems to have its own, new acolytes – data scientists. Heralded by the media as the ‘high-priests of algorithms’ and ‘the sexiest job of the XXI century’, the phenomenon unravels a more deeply grounded conversation about the establishment of a new profession in the public milieu, the making of science and scientists, and the evolving nature of handling and understanding data. Drawing on contributions from science and technology studies (STS), organizational studies, anthropology, and Internet studies, this work frames the research around the self-identity of a professional and group perception of the authenticity and competence of interdisciplinary, XXI century quantitative analysts.

#datascience #makingofscience #scientists #newprofessions #selfidentity #provisionalselves  #BigData #machinelearning

2

Page 3: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

Table of contents

1. Introduction 5

2. Methodology 9

3. Research questions 12

4. Limitations of the study 12

5. Framing the research – literature review 14

- Locating and grounding the term ‘data’ 15

- A historical trajectory of the statistics discipline and the ‘data-analyst’ role 16

- Big Data – the next frontier 19

- Provisional selves 21

- Communities of practice 23

6. Unpacking the key themes – research and analysis 26

- Computerization and digitization of sciences 26

- The data scientist training 28

- Transitioning from academia to industry 31

- Interdisciplinary work practice 33

- Tools of practice 36

- Evolving nature of the data analyst role 38

7. Discussion 41

8. Closing words 44

9. Bibliography 46

3

Page 4: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

AcknowledgementsI would like to thank my supervisor Stefana Broadbent for her guidance, patience and confidence in setting this piece of research on track. I always found her enthusiasm contagious, which made this creative endeavour much more invigorating, intriguing, and within my reach.

I am also grateful to Haidy Geismar, my course co-convenor and personal tutor, whom I could always count on for thoughtful advice and a critical eye.

I also appreciate the help of Ciara Green, my course peer, who dedicated her time to listen to my rants on data science and proved to be a good, critical listener.

Then there are of course my informants, with whom a number of in-depth interviews allowed me to investigate my questions in sufficient depth.

And finally, thanks to my mom and dad, for always supporting me in whatever I decided to pursue.

4

Page 5: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

I. Introduction Data, as it stands, surrounds us. For computational systems, we, as human

beings are carriers, herders and interpreters of it. Data, after all, is the

foundation for deriving information - the particular mean of insight and

intelligence that enables us to make informed, individual and collective

decisions. Or so we believe.

Some profound changes in this area have been happening over the past

30 years. With the advancement of computational technologies our ability to

collect, share and analyse data (and therefore, information) has changed to a

degree that is historically unprecedented. In fact, according to one of the

corporations that helped set up the infrastructure for this transition, IBM (2013)

- “90% of the world’s data has been produced in the last two years” - and we are

yet to recognize how to „harness its potential".

There is little doubt that amongst other phenomena, technology shapes

our lives (Bijker et al., 1989), and so do we, shape technology (Mackenzie &

Wajcman, 1985). After all ‚technology’ - the outcome of ‘making

something’ (techne), and ‚science’ - the outcome of ‘thoughtfully pursuing’ and

‘understanding’ (episteme), are inherently linked to one another (Parry, 2014).

As Thomas Kuhn (1963) asserted long years ago, science is inherently about the

data, so in as much should be technology. And if science is the initiator of a new

way of understanding the world, it also creates opportunities for doing things

differently. This, unfortunately, often translates in the popular discourse into a

simplification that “scientific opportunities = money”, or “data = money”, and

there are a number of larger and smaller loopholes of seeing the world through

such a lens. This is, however, often the reality of technology and business

narratives, and this is why the fairly new concept of data science, and its

acolytes - data scientists - appears so worthy of investigation.

There are, of course, some limitations to what only a few months of

research can capture in trying to unpack such a large phenomenon. This is why

this research aspired to become an ethnographic snapshot, on the level of its

5

Page 6: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

day-to-day craftsman, of the individuals who are expected to deliver the

expectations associated with the proliferation of Big Data, ubiquitous sensing

and evidence-driven decision making.

A few years back, rarely anyone had heard about this profession as it now

stands. But circumstances have changed. Since the 2012 Harvard Business

Review article (Davenport & Patil, 2012) proclaimed the ‘data scientist’ to be

the “sexiest job of the 21st century”, also did the term become a household

name. As a consequence, one could argue we are witnessing a profession in the

making; and as it is in the making, there seems to be a fair degree of dubiety

around it, showcasing intriguing aspects of how a professional identity is being

shaped, how communities of practice could be formed, what seems to be the

spirit of times in academic and technology research, and what could be the

implications for the individual, the organization one works for, and the society at

large.

In doing so, this dissertation follows a classical structure for ‘unpacking’

subsequent questions. The very beginning depicts the genesis behind choosing

this particular topic and approaching it from this - and not another – angle. It is

important to make it clear that investigating a community which is ill-defined,

dispersed, quite hardly accessible, and above all, diverse – would be challenging

for a four-month long, single-location based (London) ethnographic study. With

this in mind, I have strived to ground the research and analysis in its historical

context, conducted a number of in-depth interviews with individuals pursuing

the role of the data scientists and individuals on the boundaries of the

profession. In addition, I participated in a number of meetings, which were

taken to be the best reflection of a seed of a data scientists community

(precisely, the Data Science London meet-ups) and analysed online conversations

and media accounts. The research also draws strongly on my experiences,

observations and thinking as an individual who had the opportunity to work with

and amongst people who acquired the professional title of a data scientists. It

seemed inappropriate, however, to pursue an ethnographic study of an

organization I was actively a part of; as not to impact its internal dynamics.

6

Page 7: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

Preliminary consideration of these circumstances led me to believe that a

good way to frame the research would be through the lens of: (i) data as the

foundation of modern day decision-making (and to a degree, society constructed

around this assumption), (ii) the data analyst as the historical key-bearer to

deriving insights from data, (iii) Big Data seen as the next step to a new

paradigm of tools and methodologies associated with data analysis, (iv) inherent

links between science, academia and advanced data analysis training, (v) the

making of science and the scientists in the midst of digitization, (vi) establishing

provisional selfhood, self-identification of professional identity, authenticity of

competence, and (vii) the processs of establishing a community of practice.

This research was an interdisciplinary endeavor that involved inherently

cutting through a number of academic disciplines to better understand what

might be the forces at play. The sections on data, Big Data and the data analyst

profession are informed by research stemming from historical accounts of

statistical and computing disciplines, information sciences and Internet studies.

The latter sections, the making of science and the data scientists and their

toolkits, are strongly rooted in the tradition of science and technology studies

(STS), philosophy of science, sociology of expertise and material culture

anthropology. The deliberations on the provisional selfhood, professional self-

identification and communities of practice have strong links with social identity

theory, organizational studies and social and cognitive anthropology.

Following this, the study then becomes more insights-grounded and

analytical in terms of using the recognized findings for informing the argument.

Ethnographic analysis led to a far larger number of interesting areas of

investigation that this study could possibly have covered. It was therefore a

deliberate decision to distil and link the emergent themes around three

underlying research questions.

• What constitutes a professional self-identity of a data-scientist?

• How does the job of the data scientist fit into the larger picture of

7

Page 8: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

the making of science and the evolution of the data analyst role?

• Where do data scientists sit along the changing nature (and

understanding) of knowledge associated with digitization of data

and Big Data?

To answer these questions, it seemed critical to start with a commentary

on the proliferation of computerization and digitization within sciences. This has

been something that all of the involved informants, as well as expert-

commentaries and literature recognized to be a phenomenon transforming the

world of science. In particular, the introduction of machine-learning

methodologies and tailor-made, computational data analysis tools. Secondly, it

needed to be underlined that the data scientist is, in the opinion of many, still a

scientist, therefore academia – mainly quantitative and computational scientific

training – seemed to be a key part of nurturing the skills and capabilities which

would then allow one to fit the role.

It has been observed that this process of training has experienced

external winds of change. Academia is often not best suited to provide all of the

required skills and knowledge, hence a myriad of alternative sources of training

are increasingly emerging - especially through online learning platforms and

industry-linked fellowships. Precisely for that reason, the transition between

academia, and the decision on the new, more industrial / tech-entrepreneurship

career path seems to be an important choice and part of the data scientists’

professional identity. The job tasks, the tools and the methodologies the data

scientists pursue are part of this picture.

Data scientists, however, are also an element of a larger organizational

puzzle; the data scientist is often expected to leverage the organizations’ data-

capabilities and play an important role in establishing good practices around

data-literacy. This opens the lid to a rich conversation around the evolution of

the data analyst role, its place in modern society and implications on how it is

designed. Naturally, this is merely a snapshot of a much larger conversation

8

Page 9: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

which is already happening.

Finally, as is any formal piece of academic research, the analytical part is

followed by a discussion that picks, unravels and comments on the key themes of

this study. This required: (i) colliding the arguments on the establishment of an

interdisciplinary profession happening on the boundaries of quantitative and

computing sciences, (ii) recognizing the importance of training in pursuing the

role of a ‘scientist’ within an organization (often associated with objectivity and

evidence-led approach to solving challenges), and above all, (iii) understanding

the data-, computing- and epistemological- educational role in translating the

expectations standing in front of Big Data into day-to-day tools, processes and

practices.

All of this takes place within an ongoing conversation and surrounding

semantic tension around the term data science, data scientists and the

organizational and institutional changes this entails for the future of work and

decision-making - making it highly pertinent for the growing body of knowledge

within the field of digital anthropology.

9

Page 10: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

II.Methodology

Research for this study did not start with a pre-assumed research

question. It began with an iterative process of probing and exploring what would

be an appropriate angle to unravel ones’ interests, recognize compelling

questions and seek out relevant, transferable knowledge. Building on those

interests (the processes of long-term socio-technical development, broadly

understood ‘innovation’ and social perception (and expectations) for scientific

and technological change), I was looking for a phenomenon that would merge

those themes together - and data science and data scientists appeared as timely

and worthy candidates.

For the requirements of an ethnographic endeavour, however, this was not

going to be an easy task. Data scientist roles in organizations are still fairly

scarce, isolated, and highly industry-specific. Companies and recruiters are

outstripping each other in trying to acquire talent, and, if they are successful,

those individuals often work on some of the more critical aspects of

organizations processes, in many cases, highly confidential and sensitive. It was

therefore very difficult to convince any of the individuals I had in my network,

or in their network, to pursue an organizational ethnography and investigate

their organization, as a research field. Limiting the research to one organization

would also be a danger in itself, therefore a decision was made to conduct the

research with a number of informants from different organizations, focus on

them as individuals (rather than their organizations), and collate ethnographic

insight from an accessible field site, for which Data Science London appeared to

be a good candidate.

Inspiration for pursuing the research through such an approach emerged

from anthropological accounts of researchers who historically also tried to follow

either scientific, or ICT-heavy communities, from Latour and Woolgar (1986

(1979), Levy (1984), Latour (1987), Miller and Slater (2001), Biao (2006), Kelty

(2008), to Coleman (2013). Advised by my supervisor, I started exploring the

10

Page 11: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

issue by having explorative conversations with people involved in the field – a

researcher exploring how people interact with technologies in health and life

sciences and a data scientist working in one of London’s ‘top data science

agencies’. This was accompanied by participation in meetings of the Data

Science London community, taking part in hackathons and networking events.

These first encounters made me feel confident that, as a researcher, I would be

able to gather adequately diverse accounts of the topic, to allow the study to be

rich in insightful content. A result of this was a working hypothesis that data

science was an emergent profession, but due to its interdisciplinary character

and tech-pushed origins, quite a nebulous term by its nature - also for those

claiming to be data scientists themselves.

The key group of informants for the study were eight individuals with

whom in-depth semi-structured interviews were pursued in Spring and early

Summer 2014. The group was composed of individuals who had acquired job

titles of data scientists, just months before the study began. Another group of

informants was composed of individuals who were data analysts in different

organizational settings – e.g. a post-doctoral astrophysicist, a statistician with 15

years of industrial experience, or an economic research fellow. The third group

were individuals actively engaged in shaping (Data Science London) or

researching the community.

During the period of this study - and throughout the digital anthropology

masters program - I worked in an organization that was actively hiring and

building a data science team. However, due to the nature of being an ‘actor’ in

an organizational setting, and the fact that the act of pursuing ethnographic

research might change the dynamics of my ‘social position’ and relationships

within such environment (Berg & Lune, 2011), I made a deliberate decision that

this organization would not be a field of the study. However, without doubt, my

experiences and observations during that time helped me inform how the

research was framed and which themes would be selected for deeper

investigation, thus influencing the discussion and reflection on this subject.

Thirdly, my participation in Data Science London meet-ups led to a

number of conversations, observed practices and behaviours that were very

11

Page 12: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

informative for understanding how a community self-organizes, how individuals

perceive their membership in such a community, how they self-identify in their

relationship to this community and what motivated them to invest their time

and attention into it. At certain points, I was tempted to conduct a more in-

depth analysis of the community itself., however, due to the dispersed regularity

of the meetings – once every 1.5 and 2 months – the analysis would not have had

adequate depth. Here, rather than focusing on the self-identity of data

scientists, I deliberately chose to focus more on the ‘meet-ups’ as events

expressing a particular, local manifestation of a aspiring community of practice,

rather than as a larger picture of a profession.

Finally, the study was also complemented by on-going literature review

and analysis, both of historical accounts and media stories as they were

emerging. It is worth keeping in mind that this field attracted significant

attention throughout the year of this study (2013-2014). A number of new public

and private institutions legitimizing the term data science emerged both in the

UK and the US (e.g. Imperial College Data Science Institute and Data Science at

NYU) and led to a number of discussions and conversations around this topic - for

example, a meeting at Imperial College titled ‘A Data Scientist is a statistician

who lives in Shoreditch (?)’ (Data Science Institute, 2014).

This study might have also benefited more from applying ethno-

methodological approaches to the subject, however, due to the confidentially of

the work pursued by some of the questioned data scientists, and their limited

pool, this approach had to be withdrawn and restrained to a number of semi-

structured interviews. For the purpose of a more in-depth study, on a larger

group of informants, ethnomethodology would be highly recommended for

triangulation purposes.

12

Page 13: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

III. Research questions Direct exposure to the phenomenon in its making, the literature review and

emergent framework of analysis led to the following research questions:

• What constitutes a professional self-identity of a data-scientist?

• How does the role of the data scientist fit the larger picture of the making of

science and the evolution of the data analyst role?

• Where do data scientists sit along the changing nature (and understanding) of

knowledge associated with digitization of data and Big Data? 

IV. Limitations of the studyExploring the complexities of data analysis roles, data scientists themselves,

and data science as a phenomenon, one could easily recognize a microcosm of

trends and tensions that reflect the dynamics of society at large. However, the

limited scope of this study did not allow for investigation of the relationship

between those issues and data science in more detail - these are some of the

exemplars worth highlighting:

• STEM (science, technology, engineering and mathematics) disciplines and

the gender gap – participation in data science meet-ups, conversations with

data scientists and statistical data on female participation in STEM

professions (U.S. Department of Commerce, 2011) re-affirm that this

profession is under-represented in terms of gender imbalance. This links

closely with the issue of biased group inclusivity, in-group favouritism and the

gender-biased perception of competence (Moss-Racusin et al., 2012) and

might also have impacted this study.

• Values, beliefs and legacies of the open and free software movements –the

popularity of data science can be associated with the proliferation of tools

developed in the spirit of open software, that allowed tackling with

increasingly sophisticated data questions. In fact, the Data Science London

organizers included in their mission statement the following claim:

13

Page 14: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

“dedicated to free, open, dissemination of data science and promotion of

open source and open data” (Data Science London, 2014). This is resonant of

the concept of recursive publics and the maintenance of affinity – introduced,

amongst others, by Christopher Kelty (2008) – and sits closely to the notion of

‘geek cultures’ that data science clearly overlaps with. The relationship of

the two would be worth of a separate investigation that this study could

unfortunately not address.

• Pro-innovation bias – as data science thrived alongside the proliferation of

the concept of Big Data, there is a danger that it might become a victim of

technological hype and not yet well established body of critical literature - a

phenomenon named by Rogers (2010 (1962), as ‘pro-innovation bias’. Critical

studies of Big Data (boyd & Crawford, 2013) are in itself a theme that should

have strongly impacted the nature of this research. Although this critique is

acknowledged throughout this study, it does not constitute the core avenue of

argumentation.

• Monetization of data – there seems to be an on-going argument from a

number of technology enthusiasts that ‘more data’ equals ‘more money’,

‘more innovation’ and ‘better policy’ (Gartner, 2011). This is often

accompanied with the rush for immediate extraction of financial value from

‘any data’, which authors - such as David Harvey (2007) - would likely argue

to be linked with the persistence of economic liberalism within modern

political economies. Big Data indeed opens opportunities for economic

benefits, however, the fact that data science seems to have emerged from

within the American tech-industry bubble has also substantial implications for

its wider perception and interpretation that this study needs to acknowledge.

Keeping these few examples in mind, it is worth emphasising that this

study was also constrained by its short time scale (4 months), single-location

(London), and accessibility to limited accounts of the informants’ tasks and

routines (due to the often-confidential nature of their work). These are,

however, limitations well known to qualitative and ethnographic research (Berg

& Lune, 2011) therefore, where possible, historical and expert insights were

used to complement this picture.

14

Page 15: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

V. Framing the research – literature review

At an early stage of the research process it became clear that the nature

of the research would require insights from a range of the sciences, not

restricted solely to anthropology, nor solely social sciences. For that reason the

analytic framework had to build on some more classical aspects of the body of

knowledge derived from theories of socialization (Mead, 1934; Tajfel & Turner,

1986), studies of professional identity (Becker & Carper, 1956; Wilensky, 1964;

Abbott, 1988), and situated learning (Lave & Wenger, 2008 (1991)) – all of which,

throughout the years, attracted attention from social and cognitive

anthropology, psychology and organizational behaviour studies. Also STS

literature (Latour 1979, 1987; Poovey 1988) and anthropology of policy and

bureaucracy (Shore, 1997: Riles, 2010), proved to be supportive in thinking

about the constitution of new fields of expertise and new kinds of fact.

With little doubt, due to the nature of the analysed community and

overlapping boundaries between different disciplines, it was required to refer to

perspectives from quantitative disciplines – statistics, mathematics, economics –

and computing disciplines - computer sciences, information retrieval sciences,

machine learning and artificial intelligence (examples including: Cleveland,

2001; Friedman, 2001; Varian, 2014). The larger picture of domains where data

science is applied, i.e. the life sciences and data analysis sectors: such as

finance, business and quantitative policy (Mattman, 2013; Pentland, 2014), also

seemed vital. Needless to say, the topic of ‘Big Data’, which developed

alongside data science, also attracted significant and very interesting scholarly

work from researchers in Internet studies (Mayer-Schönberger & Cukier, 2013)

communications studies (Parks, 2014), digital- anthropology (Boellstorff, 2013),

digital- humanities (Manovich, 2011) and sociology (Ruppert, 2013).

This is why the literature review will be in its nature cross cutting,

pointing to contributions and sources available amongst the different areas

15

Page 16: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

discussed above. Deeper commentary will be given only to those positions that

have been identified and judged to be important for supporting the research

questions and revealing of the larger picture of conversations that take place

within this topic.

The first part of the literature review will briefly introduce the term

data, and show an interesting historical trajectory of two disciplines, namely

statistics and computer science, which have always tackled data from a

perspective relevant for the profession of the data scientist. The latter part of

the literature review will introduce the body of literature on professional self-

identity, organizational socialization and provisional selves. This will correspond

closely with the section on the characteristics and dynamics of communities of

practice, and the forms of practices, tools and behaviours that make a group and

the individuals within it both socialized and distinctive.

V. I. Locating and grounding the term ‘data’.

The way the term ‘data’ is used by different groups varies widely; words

change their semantics due to a confluence of social, cultural and linguistic

factors (Puschmann and Burgess, 2014: 1962), and almost every discipline and

disciplinary institution has its own norms and standards for the imagination of

data (Gitelman & Jackson, 2013: 12). Historically, the word ‘data’ was derived

from the Latin plural datum which, in association with the verbal form ‘dare’,

translates into ‘something given’. A thoughtful investigation of the etymology of

the term conducted by Puschmann and Burgess (2014) suggests that the earliest

uses of the English word ‘data’ in theoretical and mathematical context were in

the 17th century, in reference to mathematical variables and descriptions of

historical events.

The principal sense of data shifted during the 18th century from anything

widely accepted as given, granted, or generally known, to a result of

experimentation, discovery or collection (Rosenberg, 2013). The usage of the

word increased over the 18th and 19th centuries, establishing itself firmly in

economics and administration beyond its earlier use only in mathematical and

16

Page 17: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

natural philosophy. The term entrenched in science, business and administration

in the 19th and 20th century when both its frequency and use context expanded

significantly (Puschmann and Burgess, 2014: 1692). It was in the 1940s that the

earlier uses were supplemented with the use of the word data to describe any

information used and stored in the context of computing. With the shift from

paper record to digital information, data was increasingly used to refer to digital

objects that could be manipulated using a computer rather than generally

accepted facts or outcomes of experimentation or observation. As computing

matured, data increasingly left laboratories and offices to play a role in new,

domestic and public environments.

An interesting argument elaborated on by Puschmann and Burgess (2014:

1693) suggests that data stored as a piece of digital information marks a

departure from previous understandings of the term. In its past meaning, the

processes of giving and interpreting appeared to be highlighted, whereas in the

more recent meaning, data seems to come into being by ‘acts of recording’. As a

result of this shift, the most pronounced difference between the two is the

aspect of agency in data creation. In the past, data was mostly associated with

the role of the statistician, or sometimes more broadly, the ‘data analyst’, and

today it is much more grounded in the design and operations of computational

systems.

V. II. A historical trajectory of the statistics discipline and the data analyst role

Individuals and institutions for centuries have gathered data. One of the

best documented events in ancient history is Herod’s gathering of census data in

Palestine, or the sophisticated tax-collection system imposed by the Roman

Empire. However, the modern institutionalization of data gathering and data

analysis is associated with the emergence of national statistical offices (18th

century - mostly UK, Sweden, Netherlands, Prussia) and the methodological

science diverging from mathematics – statistics (Rosling, 2010). A leap forward

17

Page 18: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

from then, in 1979, the American Statistical Association - the leading US

association for statisticians - organized a conference focused on “The Analysis of

Large Complex Data Sets” to address the issue of increasingly larger data sets

and inappropriate tools and knowledge to tackle with those new volumes of

data. Twenty years later, history repeated itself; in 1997 another ASA symposium

is themed “Data Mining and the analysis of large data sets” and today (in 2013

and 2014), industry conferences such as O’Reilly Strata choose as their main

theme - ‘Making Data work - Big Data’ (O’Reilly, 2014).

Despite these clear historical loops, a number of respectable scholars and

pundits (Friedman, 2001; Anderson, 2008: Manovich, 2012; Mayer-Schönberger &

Cukier, 2013) argue that there has been a considerable change in the nature of

thinking about and dealing with data in the recent years. Back in the 70’s and

80’s, large and complex data sets were rare and little need was seen to analyse

those few that did exist. Data was collected manually, and the cost of collecting

it was closely associated to its volume, resulting in data collection being,

throughout the whole process, fairly expensive. This changed as

computerization entered the space, and to some degree, the cost of setting up

data collecting infrastructures has decreased to a point that gave new

opportunities for entities that were not able to use these types of solutions

before. Needless to say, in extreme cases – like the NSA or Google data farms

(Forbes, 2014) - even today’s data infrastructures can be very expensive to

operate.

Not surprisingly, individuals in the field of statistics have repetitively been

asking themselves questions - what is the role of statistics in the ‘data

revolution’? Friedman himself, a statistics Professor at Stanford, argued over a

decade ago that the idea of learning from data has been around for a long time,

but the “interest in analysing these large and complex data sets has only

recently [2000’s] become so intense” (Friedman, 2001: 5). He associated this

with the development of novel, data-base management systems where large

quantities of data resided, and as a result, has given fertile ground for ‘data

mining’ approaches. The processes of analysing data for purposes other than for

18

Page 19: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

which it was collected (Friedman, 2001: 6), shifting the usual application of data

from ‘transaction processing’ to ‘decision-support’.

This argument, dating back to 2001, is worth remembering when looking

at current conversations around Big Data, including a question Friedman posed

towards data mining - “although data mining appears to be a viable commercial

enterprise, one can ask whether or not it qualifies as an intellectual discipline?”

(Friedman, 2001: 7). In Friedman’s words “‘data mining’ is not yet an

intellectual discipline, but in the future, almost certainly [it will be]” and “one

can predict a big intellectual and academic future for new data mining

methodologies will emerge” (2001: 7). At that time, data mining packages were

already incorporating well-known procedures from the fields of machine

learning, pattern recognition, neural-networks and data visualization. And of

course, some questions remained unanswered - should statistics remain at “what

it’s good at” (i.e. probabilistic inference based on mathematics), or ought it be

concerned with a set of ‘problems’, rather than tools?

An important remark in Freedman’s argument was that “statisticians will

first and foremost have to ‘make peace’ with computing”, as this is where the

data is. As if computing is to become one of the fundamental research tools,

then the community “will have to teach, or be sure that students learn, the

relevant Computer Science topics”, and some basic paradigms of the field will

have to be modified (Friedman, 2001: 9). This thought neatly corresponds with

what Hal Varian (2014), Professor at iSchool at Berkeley and Chief Economist at

Google, argues about the modern training of economists - in particular,

econometricians – and the type of skills and tools they need to start acquiring

from their computer science comrades.

This observation leads to conversations about career prospects in ‘data

analysis’ roles. Friedman argued (2001: 9) that up until around the 2000s, if

someone was interested in data analysis, then statistics was one of the very few

(even remotely) appropriate fields to work in. In 2013/2014, this is no longer the

case. “There are many other exciting data orientated sciences that are

competing [with statistics] for customers, students, jobs and even [our own]

statisticians” (Friedman, 2001: 9). „Even prominent statisticians are becoming

19

Page 20: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

more interested in researching problems embraced by other fields, and prefer

to work or publish in other areas” (Friedman, 2001: 9). Having said that, this is

a very important issue for locating the data scientists profession in the larger

‘data analyst’ perspective. For Friedman, this ‘brain drain’ of students and

researchers away from statistics was representing the most serious threat to the

future of the discipline, requiring profound re-examination of its place amongst

the information sciences.

The entirety of Friedman’s argument, happening in the midst of the

Internet-bubble in 2001, corresponds very well with the thinking of William

Cleveland, Statistics Research Fellow at Bell Labs, who also in 2001 called out

for ‘An Action Plan For Expanding the Technical Areas of the Fields of

Statistics’, under the new label of data science (Cleveland, 2001). Cleveland

expected that soon ‘computer science [will] join mathematics as an area of

competency for the field of data science’, enlarging its intellectual foundations,

and more importantly ‘will carry statistical thinking to subject matter

disciplines’. This resembles, one could argue, a ‘re-branding’ of more

sophisticated data analysts roles, from now requiring computational skills to

transcend other analysis-intensive disciplines.

For this reason – explored later in the research and analysis section - it is

worthwhile to inquire how the data scientists themselves perceive this

phenomenon, and what causes them to believe what they do.

V. III. Big Data – the next frontier

A number of respected individuals and organizations argue that the

potential benefits and costs of using large volumes of data – e.g. for analysing

genetic sequences, social media interactions, health records, phone records,

government records, and other digital traces left by people – are significant, but

still not sufficiently explored (boyd & Crawford, 2013: 663; Mayer-Schönberger &

Cukier, 2013). According to Manovich (2011), the term Big Data has been used,

mostly in the sciences, to refer to data sets that are large enough that they

20

Page 21: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

require supercomputers. However, what once required customized machines can

now be analysed on desktop computers with freely available software. This fact

reiterates that it is not the hard infrastructure (as it often was in the past) that

defined Big Data, but the practices, skills and tools used on more ‘mundane’

levels of interactions. The question is, however, whether the ‘mundane devices’

are, as material objects, enough to conduct the analysis required for Big Data?

boyd & Crawford argue (2013, 663) that Big Data is less about data than it is

about the capacity to search, aggregate, and cross-reference large data sets and

define it as an interplay between a cultural, technological and scholarly

phenomenon (boyd & Crawford, 2013: 663). They define Big Data as a

phenomenon that rests on the interplay of:

‘Technology - maximizing computation power and algorithmic accuracy to gather, analyse, link

and compare large data sets. Analysis – drawing on large data sets to identify patterns in order

to make economic, social, technical and legal claims. Mythology – the widespread belief that

large data sets offer a higher form of intelligence and knowledge that can generate insights

that were previously impossible, with the aura of truth, objectivity and accuracy.’

Source: boyd & Crawford, 2013: 663

Looking at the etymology of the word, according to Steve Lohr’s

investigation on behalf of the New York Times (Lohr, 2012): “2012 was the

breakout year for Big Data as an idea, in the marketplace, and as a term”,

though its origins have hardly been explored before. Collaborating with an

economist from the University of Pennsylvania – Francis Diebold - the two

recognized the first reference to ‘Big Data’ in 2003 in a paper tilted – “Big Data

Dynamic Fac to r Mode l s Fo r Mac roeconomic Measu rement and

Forecasting” (Diebold, 2012).

However, after some further investigation, it appeared that the term Big

Data “probably originated in the lunch-table conversations at Silicon Graphics, a

high-performance computing manufacturer, in the mid-1990’s, and John Mashey,

its chief scientist prominently” (Diebold, 2012) and has since, with a significant

uptake from 2007, gained traction within the computer-software industry. This

is re-affirmed in the research by Puschmann and Burgess (2014) who argue that

the genesis of the term Big Data lies firmly in the business world. Although the

21

Page 22: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

early discussions on data processing technologies in business closely reflected

this necessity (for new tools, allowing companies to deliver faster search results

or store larger volumes of customer data) it has since evolved into a

conversation centred around using collected information for analytical purposes,

specifically for predictive modelling (Puschmann and Burgess, 2014: 1691).

Not surprisingly, according to boyd & Crawford (2013: 663), computerized

databases, the core at what Big Data stands for, are not new. The US Bureau of

the Census deployed the world’s first automated processing equipment in 1890,

with relation databases not emerging until the 1960s. It was personal computing

and the Internet that made it possible for a wider range of people – including

scholars, marketers, governmental agencies, educational institutions and

motivated individuals, to produce, share, interact with and organize data. One

could argue that this increased computing power, and development of

appropriate tools in the 2000s, where the factors that contributed to the

diffusion of this argument.

Big Data, as boyd & Crawford also argue (2013: 664) is associated with a

possible change to the definition of ‘knowledge’. The introduction of Henry

Ford’s manufacturing system of mass production in the 20th century - using

specialized machinery and standardized products - quickly became the dominant

vision of technological progress, whilst Fordism produced a new understanding of

labour, the human relationship to work, and society at large. According to boyd

& Crawford (2012:665) Big Data has emerged as a system of knowledge that is

already changing it as an object, while also having the power to inform how we

understand human networks and community.

“Change the instruments, and you will change the entire social theory that goes with

them”

Source: Latour, 2009 in boyd & Crawford, 2012: 665

This change is mostly experienced at the layers of epistemology and

ethics – Big Data is said to reframe key questions about the constitution of

22

Page 23: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

knowledge, the processes of research, how we should engage with information,

and the nature and categorization of reality (boyd & Crawford, 2012: 665). This

follows a tendency to assume that the massive amounts of data, along with

applied mathematical and computational applications, will replace every other

tool that might be brought to bear (Anderson, 2008 in boyd & Crawford, 2012).

In a way, the question that arises from this is: “how the harvesters of Big Data

might change the meaning of learning, and what new possibilities and new

limitations may come with these systems of knowing”? With little doubt, this is

one of the kind of questions data scientists are expected to cope with on a

reccuring basis.

V. IV. Provisional selves

Becoming a data scientist is a fairly recent phenomenon, blossoming in

the last 3-4 years, and increasingly acquiring wider, public attention. This leads

into questions about the nature of self-identity and self-adaptation to a

professional role, or as Herminia Ibarra (1999: 764), a Harvard organizational

psychologist calls it - ‘provisional selves’. Professional identities are deifined as relatively stable and enduring

constellations of attributes, beliefs, values, motives and experiences in terms of

how people define themselves in a professional word (Schein, 1978 in Ibarra,

1999: 765). This process of ‘becoming a professional’ was well described in

sociological literature throughout the 70’s and 80’s (Hall, 1968; Wilensky, 1964;

Krause, 1971; Montagna, 1977 in Adams & Kowalski, 1980). Professional

identities form over time with varied experiences and feedback that allows

people to gain insight about their central and enduring preferences, talents and

values. For some professions - for example engineers, doctors, or architects -

official, professional association is granted as a result of certification, restricted

membership or educational accreditation. And for the time being, what makes a

data scientist a ‘Data Scientist’ remains nebulous. There is no professional

association of data scientists, no accredited ‘certification’, and still rarely -

23

Page 24: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

although increasingly growing over the last few years – educational programmes

finishing with a degree in data science.

This argument corresponds interestingly with the work of Gil Eyal (2013),

sociologist at Columbia University, who argues that sociology of profession is in

fact a story of the past and sociology of expertise is a more timely and

comprehensive to capture the changes of todays world. In his words, sociology of

expertise maintains an analytical distinction between experts and expertise as

two irreducible models of analysis, treating expertise neither as an attribution,

nor a set of skills, but as a network connecting actors, instruments, statements

and institutional arrangements (Eyal, 2013).

Coming from that, a straightforward question seems to be whether it is

the ‘data’ or the ‘scientist’ part of the role, which has a stronger influence on

professional identification, or if it is something different? Ibarra suggests that in

assuming new roles, people not only acquire new skills but also adopt the social

norms and rules that govern how they should conduct themselves (Shein, 1978 in

Ibarra, 1999: 765). Practices and social norms of scientists in a lab setting were

already well investigated in a seminal study by Latour and Woolgar in 1979

(1986). However, it is quite clear that today’s labs, due to digitization and a

number of other social processes and institutional changes, represent a very

different environment. The question of ‘what is the field of the data scientist?’

is also very poingnant - a perspective that might be different depending on who

asks the question.

Data science, as a relitively new and undefined job title, often pushes

individuals into situations that require new skills, behaviours, attitudes and

patterns of interaction, that can produce fundamental changes to an individual’s

self-definition (Ibarra, 1999: 766). Not surprisingly, this phenomenon is

particularly relevant for data scientists as these individuals often transition from

academic/research backgrounds into industry, where the dynamics and

challenges of the environment require different practices. Identities have long

been seen as constructed and negotiated in social interaction (Mead, 1934;

Goffman, 1959) and socialization is not a unilateral process imposing conformity

on the individual. It is a negotiated adaptation by which people strive to improve

24

Page 25: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

the fit between themselves and their work environment. People often make

identity claims by conveying images that signal how they view themselves or

hope to be viewed by others, but it is unclear to what degree they remain part

of their past, scientific role, and to what degree part of a new form of a data

analysis/lead scientists/consultant, as is often expected of them. Not without

significance is the self-perception of authenticity; that is, the degree of

congruence between what one feels and communicates in public behaviour

about his or her character or competence (McIntosh 1989, in Ibarra 1999: 778).

For the data scientists, this is an area where the concept of situated learning

and communities of practice falls neatly into place.

V. V. Communities of practice

Communities of practice (CoP) is a concept developed at the beginning of

the 1990s by Jean Lave and Etienne Wenger, who proposed a new model of

learning, described at the time as ‘situated learning theory’ (Lave & Wenger,

1991). The concept was a critique of earlier cognitivist theories of learning as

knowledge was said primarily not to be abstract and symbolic, but provisional,

mediated and socially constructed (Berger and Luckmann, 1966; Blacker, 1995).

Situated learning theory positions ‘communities of practice’ as the context in

which an individual develops the practices - values, norms, relationships - and

identities appropriate to that community. It differs in some aspects from

theories of socialization (Vygotsky, 1978) as it calls to attention the possibilities

for variation and intra-community conflict. Following this, learning is described

as an ‘integral and inseparable aspect of social practice’ that involves the

construction of identity through changing forms of participation in communities

of practice - based mostly on processes of participation, identity-construction,

and practices (Handley et al., 2006).

As Wenger argued, participation refers ‘not just to local events of

engagement in certain activities with certain people, but to a more

encompassing process of being active participants in the practices of social

25

Page 26: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

communit ies and construct ing ident it ies in re lat ion to these

communities’ (Wenger, 1998: 4). Therefore, participation is not seen just as a

physical action or event, but it involves both action (‘taking part’) as well as

‘connection’ (Wenger, 1998: 55). This implies the ‘possibility of mutual

recognition’ and the ability to negotiate meaning, but does not necessarily entail

equality, respect or collaboration.

A particularly intriguing aspect of participation is how members of a

community gain status within it. In their early works, Etienne and Wenger

(1990) suggested that there is a distinction between a core, and peripheries, and

it is through continuous participation that one gains recognition or moves to the

centre. They have, however, deviated slightly from this opinion since then and

acknowledged that participation may involve learning trajectories which do not

lead to a comprehensive ‘full’ participation (Handley et al., 2006: 644). This is

an important point to note in respect to the interviews from this study.

Another important aspect of CoP is identity. The concept of identity rests

on critical readings of social identity theory (Handley et al., 2006: 664); but,

according to Leve and Wenger (1991), learning is not simply about developing

one’s knowledge and practice, but also involves a process of understanding who

we are and in which communities of practice we belong and are accepted. Two

main processes of identity construction in a workplace are identity-regulation

and identity-work. According to Handley (2006: 644), the first process refers to

regulation originating from the organization (e.g. recruitment, induction and

promotion policies) and the employees’ individual responses. The second process

of ‘identity-work’ refers to employees’ efforts to form, repair, maintain or revise

their perceptions of self, and this involves a negotiation between the

organizations’ efforts at identity-regulation (which the employee may, or may

not internalize) and the employees’ sense of self, derived from current work as

well as other identities (Handley, 2006: 645) – all highly relevant for data

scientists in their working environments.

The third and final aspect of CoP is indeed ‘practice’, which according to

Brown and Duguid (2001: 203) is an ‘undertaking or engaging fully in a task, job

or profession’. After all, by participating in a community, newcomers develop an

26

Page 27: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

awareness of that community’s practice. They come to understand and engage

with - or adopt and transform - various tools, language, role-definitions and

other explicit artefacts and implicit relations. For data science, as an

intersection between science, computing and ‘data analysis’, this is particularly

interesting, as tools can be very defining of a profession, allowing for formal and

informal coordination and exchange of knowledge to be identified.

Finally, it is critical to note that communities of practice are not

homogenous, but differ across several dimensions – geographic spread, lifecycle

and pace of evolution. Individuals may participate to a different degree in loose

‘networks of practice’ both across and beyond organizational boundaries, but

according to Handley et al. (2006: 646), it is in relation to these communities

and networks that individuals develop their identities and practices through

processes such as role modelling, experimentation and identity-construction. An

individual’s continual negotiation of ‘self’ within and across multiple

communities of practice may generate intra-personal instabilities within the

community. An example of this is a scenario where a newcomer experiences a

conflict of identity in relation to a role or practice he or she is expected to

adopt (Ashforth and Humphrey, 1993) - a case that data scientists are

particularly exposed to, as they enter new organizational environments with,

often, inflated expectations to the nature of their work.

27

Page 28: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

VI. Unpacking the key themes - research and analysis

VI. I. Computerization and digitization of sciences

“I believe ‘data science’ is more about the science bit, than the data”

- Data Scientist [1], working in industry

The last two decades have seen significant changes to the way we use

computational tools in modern workspaces (Brynjolfsonn & McAfee, 2014;

Pentland, 2014). The Internet, email, cloud services and mobile phones are only

a few manifestations of this phenomenon. Along consumer products, big changes

have been also happening in the world of artistic and scientific crafts – namely,

the worlds of design and science.

A good example for design are architects, who are less and less being

trained in being proficient at the drawing board, but instead master the use of

design software such as AutoCad, Adobe Suite and the likes. Another example

are surgeons who increasingly need to become proficient in using tools that

allow them to conduct distant, robotic surgeries. These changes are also

reaching the sciences. The basic tools of scientific practice have changed too –

in many cases, today, a laptop connected to the Internet and appropriate

research software is enough to pursue multiple scientific inquiries. This has

impact on conducting both qualitative and quantitative research, in the majority

of disciplines – ranging from the (digital) humanities to (digital) sociology and

Internet studies, to molecular biology, nanomedicine and computational

28

Page 29: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

statistics. This phenomenon, at its macro-level, seems to be key to

understanding the context in which data science has come to life, and how it

reflects the current spirit of times.

Data is, after all, at the core of all sciences. The scientific pursuit is

perceived as one of the most rigorous ways of conducting research, as it

crunches and critically evaluates data. It is therefore difficult to imagine data

without science, or science without data. Especially, as scientific processes and

practices are increasingly taking place within the sphere of the ‘digital’, it is

increasingly difficult to see both science and data outside of a digitized context.

As a consequence, computational literacy is increasingly becoming a key factor

for scientific careers - either by making the research more sophisticated, or

leveraging scientific communication and engagement. It therefore not surprising

that a number of disciplines are increasingly considering whether they should

improve their own, intra-disciplinary tools, or reach out to other disciplines to

borrow, apply and build on the tools of other disciplines.

This process of continual learning, swapping and experimenting with tools

seems to be at the heart of data science. In a number of public discussions that

are happening around data science, there is an agreement that it involves

matching skills from the computational sciences with statistical and

mathematical interference, and applying them to certain domain challenges

(Rauser, 2011). There are some limitations to this approach. Proficient training

in scientific domains – e.g. chemistry, regenerative medicine, fluid dynamics,

economics – already in itself is resource-consuming, with additional

computational training adding to this complexity. To some degree, this is why

computer science and machine learning – skills usually associated with data

scientists – are blending into other scientific tool kits, often perceived as an

additional ‘data crunching’ resources. This phenomenon is well illustrated by

one of the informants:

“in cosmology there are enormous amounts of data to tackle; its’ not a controlled

environment and there’s pressure on taking as much data as there is possible – that’s where

machine learning becomes useful, when you don’t have enough information, or when you’re

29

Page 30: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

dealing with noisy information (…) but there’s also a danger, sometimes machine learners are

satisfied with answers which are ‘good enough’, which is OK for industry – e.g. a web search in

Google – but often not enough for science”

- Astrophysicists, dealing with Big Data from space measurements

It is a convincing argument that the influence of computerization and

digitization on digital-literacy of the workforce (and, as a result, decision-

making) might have far-reaching organizational and institutional consequences

(Simon, 1965; Mayer-Schönberger & Cukier, 2013; Brynjolfsonn & McAfee, 2014).

At the same time, this phenomenon reveals interesting tensions between making

(techne) and understanding (espiteme), in data analysis jobs. The recurring

question seems to be how technically or scientifically literate the members of

the group need to be, or their decision makers, to pursue well informed and

comprehended decisions. There is something in how this literacy/expertise is

constituted as a kind of political and organizational process. It could be an

expression of a wider, historically well known trend of incorporating

‘scientification’ to working environments – be it in the form of data-driven or

evidence-based decision making – and seems to sit in parallel to the ongoing

process of automation of work. This suggests data science is part of a larger

conversation around data-literacy, expertise and skills that might at one point

become a requirement for life-long learning. One could imagine the data

scientists playing an active role in not only being the ‘skilled technical

craftsman’ but also the ‘digital champion’ or ‘educator’ of this transition in skills

training.

30

Page 31: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

VI. II. The data scientists’ training“you’re a scientist first of all – there is a process of how you see the world, design,

experiment - and then you need to have a combination of maths, computing and domain

knowledge”

- Data Scientist [1], working in industry

Speaking of training. Along with the extensive digitization of data storage,

the interpretation of vast amounts of data meant a new breed of researchers

familiar with both science and advanced computing needed to emerge. In

Mattman’s (2013) words (who, at the time, was the principal lead on big data

initiatives in the Californian Jet Propulsion Laboratory), to solve Big-Data

challenges “researchers need skills both in science and computing”, and this

opinion strongly resonates with the standpoint of the informants:

“from my experience, people with whom I work with [bio-informaticians, bio-engineers]

and whom you would consider ‘data scientist’ need to know programming to be able to do high

performance computing (…) there is the cry for more data and computing literacy (…) and its

quite difficult to find people with those type of skills”

- Researcher, exploring how people interact with technologies in health and life sciences

“what would make a ‘data scientist’? – exceptional programming skills, use of common

statistical software and an academic background in physical sciences or statistics”

- Statistician / Data Analyst, with 15 years of industry experience

This perspective remains consistent among different interviewees, job

advertisements and the opinions of data scientists themselves and individuals

working in the wider field. In the past, an equivalent of data scientist training

31

Page 32: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

would likely be a graduate in quantitative disciplines – maths, statistics, physics

or economics. However, as computing skills arrive at the forefront of market

and organizational transformations, this pool is enlarged to include computer

scientists and engineers. In fact, most of the research informants who associated

themselves with data science had advanced academic training at a PhD-level in

scientific disciplines: computational modelling of biological systems,

astrophysics, computer science or statistics. Not surprisingly, having a degree in

data science is today still rare and it is only in the last two-three years that

academic institutions have launched, or are planning to launch, certificates and

degree programs to address this educational demand. It is also clear that

academic training alone is insufficient for acquiring a job in data science, which

is often the defining event that transforms a scientist into a data scientist.

“data scientists work less on the data collection side, or infrastructure (…) data science

is more about analysis ‘the front end between insight and data’ that refers to services and

data”

- Research Fellow in Economics

“they [data scientists] have to program, do statistics, follow the ‘digital trail’ and know

what to do at the end of it (…) there is much about experience – humility about the data”

- Data scientists [2], working in industry

With this in mind, academic training is often perceived merely as a filter,

a first step in shaping the profile of the data scientist. To be successful in an

organizational setting – helping to address data-challenges - one needs to be

able to adjust to the changing circumstances and demands of the role. This

entails different types of skills, behaviours and experiences that one would not

be exposed to in academia:

32

Page 33: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

“in academia you’re getting points for being clever, and in industry it works out what you do, or

it doesn’t, it’s important that you can do things quickly (…) in academia if you show how you did

it, they’ll say ‘umm…that’s just linear regression’”

- Data scientists [2], working in industry

“when you leave academia you start learning C++ and Python (…) there seems to be a big

community movement to change tools – catch-up – with the ‘great tools’ of the outside world”

- Astrophysicists, dealing with Big Data from space measurements

There is a discrepancy between the availability of well-trained scientists with

computing and analytical skills and the market demand for them. A McKinsey’s

report argues that “by 2018, the United States alone could face a shortage of

140,000 to 190,000 people with deep analytical skills as well as 1.5 million

managers and analysts with the know-how to use the analysis of big data to

make effective decisions” (McKinsey, 2011). For that reason, a number of new

initiatives, including one backed by a consortium of Silicon Valley tech-

companies, created the Insight Data Science Programme ,which is a 6-week

training programme for post-doctoral, quantitative graduates to “bridge the gap

between career in academia and data science (…) and enable scientists to learn

the industry specific skills to work in the growing field of big data at leading

companies” (Insight Data Science Program, 2014). The training programme

illustrates well what data science training means for industry:

1. Intro to Data Science – a round table discussion introducing concepts of

data science, a big-picture overview of what the field is and what makes

a great data scientist.

2. Data project – a 3 week exercise to showcase existing data analysis skills

in a context that companies are familiar with, while ‘forcing to learn’ the

technical skills and technologies that are standard in industry, including:

33

Page 34: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

software engineering best practice, storing and retrieving data,

statistical analysis and machine learning, visualizing and communicating

results.

Source: Insight Data Science Fellows (2014)

This, and other training programmes in data science - e.g. those led by

General Assembly - are part of a larger system of training packages

complemented by earlier established on-line education courses promoted by

academic power-houses such as MIT and Stanford (with a number of on-line or

distance courses in Data Science, Machine Learning and Data Visualization). This

strongly affirms that data science is a career path that did not exist in the past

(at least, not under such a name). It is only in recent years that traditional

organizations of power and educational credibility - organizations such as iSchool

at Berkeley University, NYC and Imperial College London - have opened research

programmes in data science and entered this growing field.

VII. III. Transitioning from academia to industry

“in particle physics, cosmology, ‘data science’ has been done for years (…) after a PhD

there’s a gap to become a lecturer and many people want to become ‘data scientist’, as it’s

about the technical craft”

- Astrophysicists, dealing with Big Data from space measurements One cannot look at the emergence of the data scientist role without the

context, and ongoing evolution of the labour market and novel employment

opportunities associated with socio-economic and technological change. For a

long time, some of the best and financially most rewarding career paths for

students graduating in mathematics, physics, economics and engineering – in

general, quantitative degrees - were in big technology, engineering or financial

34

Page 35: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

organizations. An alternative to this were academic or corporate research

centres. It goes without saying that post-graduate education – what still remains

at the core of the data scientist training – has also changed over the years. More

and more PhD training opportunities have been offered at academic institutions

that were later not matched with further post-doc or tenured opportunities in

academia. CERN, the Switzerland-based research centre best known for the

empirical backing of the Higgs Boson, in itself was home to hundreds of PhD and

post-doctoral scientists in fields ranging from physics, through to engineering

and computer science. However, after the researchers’ term-of-practice is

finished, they may have to consider moving into industry due to the limited

opportunities at other research institutions. This is said to be one of the reasons

why the label data science has found fertile grounds – scientists needed to ‘re-brand’ themselves for the purpose of industry roles.

Amongst the informants of the study were both experienced individuals

who have gone ‘down the path’ to becoming a data scientist and individuals

working at the gateways of this role. This provided the study with an interesting

perspective on where and when the transition between data science began, and

whether it could be a sustainable label for self-identification in a working

environment.

“I heard about ‘data science’ for the first time as a PhD student, during an industrial placement I did at a *major web-company*”

- Data Scientist [1], working in industry

“I heard for the first time about data science when I was working at *major tech company* 4-5

years ago and tech-companies were starting recruiting ‘data scientist’ ”

- Data Scientist [2], working in industry

“I heard about ‘data science’ for the first time when I was doing a PhD, and people were leaving for industry”

35

Page 36: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

- Astrophysicists, dealing with Big Data from space measurements

“I heard about ‘data science’ probably for the first time with regards to Patil’s &

Davenport’s article in Harvard Business Review [the sexiest job of the 21st century]”

- Research Fellow in Economics

There is, however, a significant difference between training happening in

academia, and the type of expectations that the data scientists are often

supposed to match when acquiring a role at an organization driven by

commercial dynamics. As one of the informants highlighted, recognition and

reputation building in academia – which is an important part of establishing a

professional identify – is, unlike in industry, often more related to the robustness

and sophistication of getting to a certain output, than the results themselves.

“academia is pressured for novelty (…) that’s why each life scientist writes his own

code, because everybody in academia needs to come with their own solution, and industry is

more about finding the code that is the most efficient for the given task – and finance is

absolutely the best at it, and has for years been hiring some of the best of the best physicists”

- Research, exploring how people interact with technologies in health and life sciences

It also seems that the term ‘data scientist’ has strong origins in the tech-

industry, in particular in places such as San Francisco and Seattle (home to the

largest tech-companies). To some degree, this should not be surprising, as data

science emerged from the recruiting practices of companies that really could

pursue Big Data – Microsoft, IBM, Facebook, Twitter, Google etc., and it was (and

still often is) the endeavours of these organizations that the term seems to be

receiving so much attention. These companies are also some of the main

recruiters on academic campuses, hiring both young graduates as well as

experienced scientists to run their more sophisticated streams of work. They

36

Page 37: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

are the most involved in the transition of scientists into tech jobs, which also

fits the larger campaigns narrative for STEM education.

VI. IV. Interdisciplinary work practice

“we have computer scientists being hired into biological projects and roles (…) because

the data is so vast, machine learning techniques make it so much easier”

- Researcher, exploring how people interact with technologies in health and life sciences

The type of projects which data scientists get involved in - unless they are

in strictly domain-specific areas such as banking, insurance or biological mapping

– often entail complexity that reaches far beyond what a simple computational-

system could frame, and are within the interplay of a number of dynamic socio-

technical systems. Data-driven decision making, which seems to be at the heart

of data science for such areas as public health, transport, or public services

transport, requires expert knowledge from a range of disciplines and stand-

points, often also taking into consideration social, political, scientific, usability

and aesthetic aspects of the developed solutions. However, inter-disciplinary

work comes at a price. Different backgrounds and practices of problem inquiry

require the usage of different language, processes and practices of work that are

not always complementary and mutually understood.

“I believe data science is more about the science bit, than the data (…) for example,

designers use data from a design perspective, but what they really do is design”

- Data Scientist [1], working in industry

37

Page 38: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

“data in itself is useless, unless it can be used as a tool to solve business or research questions (…)

that’s why data scientists need to express not only technical proficiency and a data-driven approach, but

also soft skills: team working, storytelling, engaging communication”

- Research Fellow in Economics

As a result, there is a large emphasis on the ability to work across

different organizational and domain boundaries and adjusting to the technical

knowledge of the audience. Communication in itself is a useful capability and

often on its own requires a separate skillset – e.g. the practices and language to

speak are different for a design organization, a science organization and a policy

institution. However, these are the environments that the data scientists need

to operate within, as they have to be able to work not only on a level of

technical expertise (of handling data and applying computational techniques),

and this is one of the reasons a comprehensive skillset of a data scientist is so

rare, and difficult to capture within one individual. Each of the fields of training

– science, computing, communication, domain knowledge, and business acumen

– are in themselves areas, which require substantial attention in achieving

proficiency.

This takes us back to another larger conversation about the specialization

of labour and wide array of skill that the modern economy requires. More and

more emphasis is placed on the collaborative output (data-analysis, synthesis,

communication, framing and delivery), which is often attributed to

interdisciplinary teams. Depending on the complexity of the issue - e.g. a fairly

simple web-study, or sequencing of the whole human genome - this can range

from several to several hundred individuals in different organizational

constellations. Mitigating this complexity of interactions often requires

appropriate management and organizational culture. Difficulty with

interdisciplinary work is well captured by the informants’ comments:

“A statisticians way of thinking is being comfortable with uncertainty, and that’s often

quite opposite of programmers, who in how they work need proof of correctness”

38

Page 39: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

- Data Scientist [2], working in industry

“I can plot results in R, MatLab, Gnuplot, when I speak to another statisticians – but

with designers, they need to briefed in other ways (…) I met creatives who have no idea about

‘data science’ and just see it through visualization, but completely loose the science parts, and

that’s a completely lost perspective”

- Data Scientist [2], working in industry

In fact, an interesting perspective for how data-challenges can be

perceived is through the lens of data-science hackathons – events where data

scientists and other data-wonks gather for 24 or 48 hours to tackle data-

challenges.

“[Data Science London] This is a meetup for data scientists, data miners, statisticians, data

analysts, data engineers, data architects, data visualizers, data journalists, data science

practitioners, data consultants, academics, researchers, people from science and social

sciences, and in general people directly involved in data projects.”

- Meet-up website

The formula of these events is mostly built around a set of data provided

either by a third-party, or by the organizers themselves. This data is, in many

cases, unstructured and requires a certain level of ingenuity and expertise to

make use of it. This data is then available to teams of ‘hackers’ composed of

data scientists, software engineers, web developers, graphic designers and

others – to devise, in most cases, a prototype for a data-product that in some

way would address a need in a novel and impactful way. This is the space where

interdisciplinary work takes place at its most extreme. Participants often do not

39

Page 40: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

know each other before the event and have to coin teams through conversation

at the beginning of the meeting. At many of these events, it is reiterated that

group-work leads to the best results, requiring an effort to reach out to people

who are usually outside of ones’ disciplinary background.

“I’m a technologist. I spent 20 years writing software building infrastructure, using

technology to answer hard questions. Maybe the hardest thing I learned in those 20 years is: in

order to do great work, you can’t limit yourself only to only knowing technical things. (…) you

need to know people who are very different to yourselves, and sadly, a tech education does not

prepare you for this very well”

- PhD candidate in Computer Science, one of the meetup’ presenters

But that there is a difference between declarations and practice.

Hackathons entail an element of competition, either for an award, satisfaction,

or the ‘joy of play’. The spirit of competition impacts how teams are being

formed – a kind of ‘speed-dating’ process takes place, where individuals talk to

each other, recognize whether they are mutually interested in a given problem

and what each of them can bring to the table in terms of experience, skills and

tools. The boundary of who is considered as a credible team member varies, but

ultimately rests on a combination of: perceived educational training, exposure

to appropriate industry experience and a universal ability to apply a context-

agnostic toolset to certain data-problems. These were ultimately the kind of

features that were ‘respected’. Individuals from different fields had to make an

effort to be accepted into the group, particularly if they did not have the data

literacy and ability to use the appropriate data science tools.

VI. V. Tools of practiceTools often play a key role in determining a professional occupation. As

proficiency in using the Adobe Suite might entail someone to call themselves a

graphic designer, the pure use of a drawing board or AutoCad does not make

40

Page 41: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

someone an architect, at least in the view of formal institutions. This interplay

between professionalization, accreditation, regulation and use of vocational

skills is likely to be as old as the development of craftsman guilds. For a

nebulous term such as data science, this becomes even more confusing. It is

only the last few years, if not months, that certain institutional frameworks for

recognition and accreditation have begun to be established (i.e. the Insight Data

Science Fellowships, Coursera or EDx courses), and it seems quite clear that the

expectation for data science skills flourishes on the demand side of the market,

rather than the supply side. For that reason, it was worth exploring what where

the skills and tools so much desired by industry recruiters:

Source: Linkedin search results (2014)

One can easily recognize similarities between the requirements earlier

described in the training of the data scientist section - a wide range of tools

refers to programming languages (C++, Python, Java), data retrieval and

Facebook (Data Scientist) Fluency with at least one scripting language Python or PHP,

familiarity with relational databases and SQL, expert knowledge

of an analysis tool such as R, Matlab or SAS, experience working

with large data sets Map/Reduce, Hadoop, Hive (and a PhD in a

technical discipline)

Linkedin (Data Scientist) Experience programming in an object orientated language (Java,

C++, etc.), knowledge of scripting languages Ruby or Python,

comfortable in data analysis & visualization using tools like R,

Matlab, or SciPy (and a MSc/PhD in a quantitative field, with a

strong background in machine learning, statistics or information

retrieval).

B A E S y s t e m s ( D a t a

Scientist)

Hands-on analytical experience in technologies such as SAS and R,

appreciation and understanding of relational databases, ETL

principles, and platforms such as Hadoop or MongoDB (no

particular education indicated)

W G S N ( S e n i o r D a t a

Scientist)

Hands-on experience with big data technologies (Hadoop, Elastic

Search, Solr, Java, Pig, Map Reduce), expert SQL skills, exposure

and understanding of development tools such as Java; predictive

analytics and machine learning packages (and BA/BS in maths/

statistics/machine learning or equivalent)

41

Page 42: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

operation tools (Map/Reduce, Hadoop, Hive) and analytical software (R, MatLab,

SAS). However, many of the study informants on several occasions underlined

that it is not the knowledge of these tools, per se, that is key to the data

scientist doing his or her job correctly, but the ability to apply them to a data

problem. That is precisely what distinguishes a skilled data scientist from an

aspiring one.

“many people come into data science, or machine learning meet-ups, they’re provided

with data, they’ll run a simple algorithm and say – this is the result; I’m a data scientist (…)

that’s not how this works”

- Data Scientist [1], working in industry

“good ones [data scientists] know how to use proper tools for a given context, others are just ‘enthusiasts’”

- Data Scientist [2], working in industry

It is important to underline the origins of these tools as it reveals yet

another link between how leading technology companies are influencing the

drivers of change on the labour market and how academic curricula are changing

accordingly. After all, in order to run computer mediated transactions (that, in

many cases, go into the billions) it was difficult to analyse this data with

conventional databases. Companies felt it was necessary to develop systems that

would allow them to manage this data on their own. Once this happened, the

tools were then released and labelled as ‘Big Data tools’, and gave space for the

development of data science.

“There is a difference between conventional and non-conventional ‘data science’ (…) and that

depends probably on the type of data source you’re working on (…) tech companies – Google,

Microsoft, Facebook - they get it right, they have the data, and can use it to do ‘data

science’ (…) they’re the people who hit the problems first, and develop the tools”

- Data Scientist [2], working in industry

42

Page 43: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

VI. VI. The evolving nature of the data analyst role

“a few years ago the requirements for the role were: familiarity with maths skills,

mostly Bayesian techniques, and these days its machine learning, and more engineering

backgrounds”

- Data Scientist [2], working in industry

An underlying theme in the discussions about the data scientist profession

is the evolving nature of jobs associated with data analysis, its identification,

computerization and merger with usability and data-aesthetics. This process, as

any other happening in a society heavily impacted by technological change,

implies some notable consequences. Increasingly, more and more data analysis

is now conducted by autonomous software and sophisticated education and

training is required to be able to actively participate in the design and use of

those systems. As in any past evolutions of the labour market, this means that

some individuals who “don’t catch-up” or do not have the ability or will to

receive appropriate training, are being left behind. A counter measure to this

seems to be the emergence of online-learning platforms, which are (at least to

some degree) striving to address this polarization, however the real training of

the new ‘adepts’ of data science is still, as proven by the informants insights, an

intensive and time consuming process, which also requires a certain degree of

quantitative literacy to remain in the process.

“data is now on everyone’s mind (…) it’s a bit of a frenzy (…) and if you don’t use data – you’re

not innovating”

- Researcher, exploring how people interact with technologies in health and life sciences

43

Page 44: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

“because it’s a newish thing [data science], it seems attractive for executive staff (…) like,

wow, he does ‘data science’ means he’s doing innovation”

- Data Scientist [1], working in industry

“a lot of people are trying to brand themselves as ‘data scientists’ – e.g. a quant wants to find a

job, or an excel analyst tries to sell himself (…) there is a sense of tribalism, you know ‘I want

to brand myself as a data-scientist’, these are the good network, and things I can pick-up and

progress in my career”

- Data Scientist [2], working in industry

As shown in the literature review, this might not be a completely

unexpected turn of events. It is the increasing degree of automation and

autonomy of computer systems that take over the decision-making process from

human beings that makes the difference. Advances in machine learning and

neural networks, in some cases, result in the so called ‘wisdom of Big Data’ to

be ultimately more valued within organizations than more contemporary

analytical methods. That is, for organisations with low- or mediacore- data

literacy. As a result, an open question arises, whether the centre of gravity of

data analysis remains with the analyst, or within the algorithm. Will the new

generation of analysts be composed of context-agnostic specialists with machine

learning and software engineering skills - as has already happened in some

occasions – or will these skills simply become a casual part of the science or MBA

training? The question remains whether such a breed of analysts will be able to

tackle the kind of complex problems that are standing in front of us today (e.g.

public health, environmental pollution, ageing infrastructure, energy constrains)

if the emphasis is given more to the software agent, data visualization and the

story, than contextual and methodological rigour.

44

Page 45: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

VII. Discussion

Research and analysis conducted during this short study leads to a

convincing argument that we might be witnessing an establishment of a new

profession - emerging at the boundaries of engineering, computing and statistics

- sitting within a longer tradition of the evolution of the data analyst role. The

need for a ‘new breed’ of researchers and data analysts has been expressed on

both sides of the market – amongst the scientific community and the industry, -

spearheaded mainly by technology companies from the usual, US innovation

hubs. This phenomenon fits into the picture of a more universal change

happening in society: that is, digitization and scientification of work practice.

This process might be often concealed under the messages of increasing

‚datafication of products, services and policy interventions’, and marked by

additional slogans of ‘data-driven analytics’ or ‘evidence-based decision

making’. This is, however, a manifestation of technological advancement and the

increasing consequences that the ICT revolution is having on subsequent areas of

our lives.

This case is also a good example for observing and evaluating how

professional identity is evolving within the community of people calling

themselves, or being labeled, as data scientists. This has much to do with the

attributes, beliefs, motives and experiences that they express and the identities

they construct in social interactions. This is particularly interesting for the data

scientists profession, as it is in the process of making. By some it is viewed with

a degree of scepticism, by others eagerly taken on, whilst for many still remains

nebulous. Additionally, the term has inbuilt semantic and linguistic conflict of its

parts - at what point was there ever science without data? As this profession

forms out of a stream of scientific training, it fits in the larger conversation

about the making of science, of a scientists, of science communication, and

scientific management. This leaves us with two questions: how does a social

45

Page 46: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

group create a new role, and how does the self-perception of authenticity

accord with what one feels and communicates around his or her competencies?

For the self-perceived data scientist and for the industry recruiters who

also shape the perception of the profession, the role is strongly associated with

an advanced (Masters or PhD) degree in applied quantitative disciplines or

computer science. This is because the role of the data scientist seems to assume

a blend between computing and quantitative skills, backed with practice and

experience in conducting scientific work. This educational background is still

asserted by the labels of higher education, however respective on-line and

industry-led programmes have been made available for enriching the training

base. This training is mostly focused on acquiring knowledge about the use of Big

Data tools and their appropriate use for a changing organizational context. In

many cases the tools developed by industry in the last few years are the ones

mostly associated with the data scientists role – Hadoop, Cassandra, Map/

Reduce, Hive, Pig, are all the new generation of Big Data tools. And

programming languages: C++, Python, Java; and analytical tools – MatLab, Stata

and R, join this group too.

These are the technical „craftsmanship tools" that are expected for data

scientists by the labour market, and to some degree, by the data scientist

themselves. As this research and analysis suggests, it is not the pure knowledge

of these tools that makes one a respected data scientist amongst peers, but the

ability to independently choose the appropriate tools for the given context and

the aptitude to skilfully interpret and communicate the findings. Pure

knowledge of these tools, training, or education doesn’t seem to yet make one

the data scientist. This is, rather, a consequence of the type of role one is

expected to pursue at the workplace and its organisational title. For example,

there are a number of individuals possessing the above traits, who pursue the

data scientist tasks, but are not labelled as data scientists.

Due to this, the phenomenon corresponds closely with the question of

how the job of the data scientist fits the larger picture of the making of science

and the evolution of data analysis roles. There is a push in the public narrative,

supported by the tech-industry and backed by some policy decisions, that the

46

Page 47: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

society experiences lack of appropriate training in scientific roles (STEM) which

puts a strain on the employment market. It is also the expression of how

computing progressively penetrates subsequent areas of our lives, including

scientific conduct and the making and communication of science. The impact of

information communication technologies and computing also reflects that the

way we work in organizations changes over time– be it academic, industrial,

public or third-sector. And data science and data scientists seem to be

universal labels to capture data- and computing- literate individuals who are

comfortable applying the novel tools that the technology sector (and academia)

create to tackle with the increasingly complex environment of data-generating

instruments.

A particularly interesting aspect that emerged over the last few years, is

the increasing emphasis on applying machine learning methodologies to vast

data streams. This reflects yet another phenomenon increasingly taking place in

organizations, namely the automation of work – and either replacement of

human physical labour with robotics, or human cognitive labour with algorithms.

Taking this forward to data analysis roles, one can recognize that there is a

tendency towards substituting organisational resources of classical data analysts

(both in science, and in industry) with computational solutions, and data

scientists due to their role and training are the ones bearing the torch. However,

due to the complexity of many of the issues at the centre of social expectations

– e.g. medical records and public health, environmental sensors and pollution

monitoring, mobility patterns and crisis management – a purely computational,

tech-led approach, has been often believed to be misleading.

As in many other cases, domain knowledge is necessary in order to

recognize appropriate questions, frame the design of the research and soundly

interpret the results. For that reason, there is an increasing emphasis on

interdisciplinary skill-sets, creating teams with different skills and competences

that include different attributes, values, practices and experiences associated

with their roles. This can in some cases lead to enhanced outputs, but in the

process creates tensions resulting from different professional „philosophies of

conduct" and approaches to adressing, interpreting and solving problems. Data

47

Page 48: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

scientists, due to their interdisciplinary profile and exposure to some of the

most sophisticated data problems, are often at the forefornt of these

conversations.

As a result, the data scientist sit alongside the changing (or rather

evolving) nature of techne and episteme deriving from the introduction of Big

Data approaches to increasingly new areas of data analysis. They bare much

responsibility over how the intepretation of novel data analysis tools and

approaches will be translated into the fabric of the organizations that they are

working for (or the cause they are impacting). As with the use of machine

learning, without a thorough understanding of the investigated data and

context, Big Data methods can easily exclude certain observations and insights.

Data scientists are therefore playing dual-natured roles in the organizational

context. They are the source of wisdom, research and of scientific rigour to the

pursued data-problem. But what is already well documented in philosophy of

science and STS (e.g. Thomas Kuhn, 1963 and Donna Haraway, 1988) – there is

rarely (if ever) an ‘objective truth’ or ‘neutral agenda’ to a political question –

and so objectivity is situated and historically contextual.

This in some way circles the conversation back towards the notions of

education – data, computational and epistemiological literacy - and the

implications on the systems of knowing, and the meaning of learning. The ways

in which data scientists will be establishing their professional identities -

attributes, beliefes, values, motives and experiences – as the profession grows,

might have substantial implications on how decision making is conducted in

industry, business or the public sphere. For that reason, it is critical to make

sure the process of educating data scientists is comprehensive enough to

overcome the interpretative socio-technical and political limitations of Big Data,

machine learning, and whaterver comes next. And this is what makes data

science, data scientists and the making of the next generation of data analyst

roles so important for further research.

48

Page 49: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

VII. Closing wordsThis piece of research has been a genuine attempt at recognizing,

capturing and unleashing some of the most interesting conversations currently

taking place around the still nascent term of data science. The study

investigated how a new profession of advanced data analysts (data scientists) is

emerging out of the statitistical and computational sciences, and how it

proliferates to other domains of labour, as data and Big Data become the

bedrock of science and decision-making. This research also investigated what

consitutes a professional identity of a data scientist. It recognized that

education, training and adaptation to a professional role are associated with the

move from academia to industry. It also stressed the role of the perception of

authenthicity and competences linked to the ability to use certain tools (techne)

and the ability to use them for the right context (episteme), and how it

impacted the process of professional self-identification. The study also touched

upon the place of data scientists in the making of science – influenced by

digitization, computerization, interdisciplinarity of work – and how this

corresponded with the evolution of the data analyst role, which increasingly

requires sophisticated and advanced computational and quantitative training.

Finally, the research, analysis and subsequent interpretation marked how

data scientists sit alongside the debates about the changing nature (and

understanding) of knowledge associated with the introduction of Big Data

methodologies, and what kind of responsibilities – linked with beliefs, values,

motives, data-literacy and science communication – lie ahead for individuals

moving into this role. In itself, the study also proved to be well suited for

applying a combination of ethnographic work, in-depth interviews and historical

analysis that led to insightul observations, helping to unpack a phenomenon

happening in front of our eyes – as the macro trends (digitization, datafication

and automatization) drive and influence the socio- and techno-economic

environment we are currently experiencing. Above all, it also adds a small brick

into the body of knowledge of digital anthropology with additional, empirical

insights on the dynamics and forces that are shaping the relationships between

individuals, communities, decision-making agents and digital-era technology.

49

Page 50: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

BibliographyAbbott, A. (1988). The System of Professions: An Essay on Division of Expert Labour.

Chicago: The University of Chicago Press.

Adams, M., & Kowalski, G. (1980). Professional Self-Identification Among Art Students. Studies in Art Education vol. 21 no. 3, 31-39.

Anderson, C. (2008, 23 June). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Retrieved from Wired Magazine: http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory

Ashfort, B., & Humphrey, R. (1993). Emotional Labor in Service Roles: The Influence of Identity. Academy of Management Review vol. 18 no. 1, 88-115.

Bandura, A. (1977). Self-efficacy: Toward a Unifying Theory of Behavioural Change. Psychological Review vol. 84, no. 2, 191-215.

BBC. (2010). Joy of Stats (with Prof. Hans Rosling) [Motion Picture].

Becker, S., & Carper, J. (1956). The Development of Identification with an Occupation. The American Journal of Sociology, 289-298.

Berger, L., & Luckmann, T. (1966). The Social Construction of Reality. New York: Penguin Books.

Biao, X. (2006). Global "Body Shopping": An Indian Labour System in the Information Technology Industry. Princeton, NJ: Princeton University Press.

Bijker, W., & Law, J. (2012). Shaping Technology/Building Society: Studies in Sociotechnical Change. MIT.

Blacker, F. (1995). Knowledge, Knowledge Work and Organizations: An Overview and Interpretation. Organization Studies, 1021-1046.

Boellstorff, T. (2013). Making big data, in theory. First Monday vol. 18, nr. 10.

boyd, d., & Crawford, K. (2012). Critical Questions For Big Data: Provocations for a cultural, technological and scholarly phenomenon. Information, Communication & Society vol. 15, iss. 5, 662-679.

Brown, J., & Duguid, P. (2001). Knowledge and Organization: A Social-Practice Perspective . Organizational Science 12(2), 198-213.

Brynjolfsson, E., & McAfee, A. (2014). The Second Machine Age: Work Progress and Prosperity in Time of Brilliant Technologies. New York: W.W. Norton & Company.

Burkholder, L. (1992). Philosophy and the Computer. Boulder, San Francisco and Oxford: Westview Press.

Cleveland, S. W. (2001). Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics. International Statistical Review, 21-26.

Colleman, G. (2013). Coding Freedom: The Ethics and Aesthetics of Hacking. Princeton: Princeton University Press.

Data Science Institute. (2014, September 5). Data Science Institute - Events. Retrieved from Imperial College London: http://www3.imperial.ac.uk/data-science/events

50

Page 51: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

Data Science London. (2014, September 5). About @DS_LDN. Retrieved from Data Science London: http://datasciencelondon.org/data-science-london/

Davenport, T., & Patil, D. J. (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review.

Diebold, F. (2012). "On the Origin(s) and Development of the Term “Big Data". Working Paper - Penn Economics.

Eyal, G. (2013) For a Sociology of Expertise: The Social Origins of the Autism Epidemic. AJS vol. 118, no 4., 863-907

Forbes. (2014, September 5). Article: Blueprints Of NSA's Ridiculously Expensive Data Center In Utah Suggest It Holds Less Info Than Thought. Retrieved from Forbes: http://www.forbes.com/sites/kashmirhill/2013/07/24/blueprints-of-nsa-data-center-in-utah-suggest-its-storage-capacity-is-less-impressive-than-thought/

Friedman, J. H. (2001). The Role of Statistics in the Data Revolution? International Statistical Review 69, (1), 5-10.

Gillespie, T. (2014). The relevance of algorithms. In T. Gillespie, & B. P., Media technologies: Essays on communication, materiality, and society (pp. 167-194). Cambridge, MA: MIT Press.

Gitelman, L., & Jackson, V. (2013). Introduction. In L. (. Gitelman, "Raw Data" Is an Oxymoron (pp. 9-23). Cambridge, MA: The MIT Press.

Goffman, E. (1959). The Presentation of Self in Everyday Life. Anchor Books .

Google. (2014, September 5). Company Overview. Retrieved from Google Company: https://www.google.com/about/company/

Hall, R. (1968). Professionalization and bureacratization. American Sociological Review, 92-104.

Handley, K., Sturdy, A., Finchman, R., & Clark, T. (2006). Within and beyond communities of practice: Making sense of learning through participation, identity and practice. Journal of Management Studies, 641-653.

Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective. Feminist Studies vol. 14, no. 3, 575-599.

Harvey, D. (2007). A Brief History of Neoliberalism. Oxford: Oxford University Press.

Ibarra, H. (1999). Provisional selves: Experimenting with image and identity in professional adaptation. Administrative Science Quaterly vol. 44 iss. 4, 764-791.

IBM. (2014, September 5). Apply new analytics tools to reveal new opportunities. Retrieved from IBM Smarterplanet: http://www.ibm.com/smarterplanet/us/en/business_analytics/article/it_business_intelligence.html

Insight Data Science Program. (2014). White Paper. San Francisco: Insight Data Science Program.

Kelty, C. (2008). Two Bits - The Cultural Significance of Free Software. Durham and London: Duke University Press.

Krause, E. (1971). The Sociology of Occupations. Boston: Little, Brown and Company.

Kuhn, T. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago Press.

51

Page 52: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

Latour, B. (1987). Science in Action: How to Follow Scientists and Engineers through Society. Cambridge, MA: Harvard University Press.

Latour, B., & Woolgar, S. (1986 (1979)). Laboratory Life: The Construction of Scientific Facts. Princeton, NJ: Princeton University Press.

Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). ‘The whole is always smaller than its parts’ – a digital test of Gabriel Tardes' monads. The British Journal of Sociology vol. 63, iss. 4, 590-615.

Lave, J., & Wenger, E. (2008 (1991)). Communities of Practice: Learning, Meaning, and Identity. Cambridge University Press.

Levy, S. (1984). Hackers: Heroes of the Computer Revolution. New York : Nerraw Manijaime/Doubleday.

Lohr, S. (2012, August 11). How Big Data Became So Big. Retrieved from The New York Times: http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?pagewanted=all&_r=0

Manovich, L. (2011). Trending: The Promises and the Challenges of Big Social Data. In M. K. Gold, Debates in Digital Humanities. The University of Minnesota Press: Minneapolis.

Mattman, C. A. (2013). A vision for data science. Nature vol. 493, 473 - 475.

Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live, Work, and Think. Eamon Dolan/Houghton Mifflin Harcourt.

McIntosh, P. (1989). Feeling like a fraud: Part II. Stone Center Working Paper no. 37, Wellsey College.

McKinsey. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey.

Mead, G. H. (1934). Mind, Self, and Society . Chicago: University of Chicago Press.

Miller, D., & Slater, D. (2001). The Internet: An Ethnographic Approach. London: Bloomsbury Academic.

Moss-Racusin, C., Dovidio, J. F., Brescoll, V., Grahama, M., & Handelsman, J. (2012). Science faculty’s subtle gender biases favor male students. Proceedings of the National Academy of Sciences of the United States of America (vol. 109 no. 41), 16474-16479.

O'Reilly. (2013, September 5). Retrieved from Strata Conference: http://strataconf.com/

Parks, M. (2014). Big Data in Communication Research: Its Contents and Discontents. Journal of Communication vol. 64, iss. 2, 355-360.

Parry, R. (2014, September 5). Episteme and Techne. Retrieved from The Stanford Encyclopedia of Philosophy (Fall 2014 Edition): http://plato.stanford.edu/archives/fall2014/entries/episteme-techne/

Pentland, A. (2014). Social Physics: How Good Ideas Spread – The Lessons From a New Science. New York: The Penguin Press.

52

Page 53: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

Poovey, M. (1988). A History of the Modern Fact: Problems of Knowledge in the Sciences of Wealth and Society. Chicago: The University of Chicago Press

Puschman, C., & Burgess, J. (2014). Metaphors of Big Data. International Journal of Communication, 1690-1709.

Rauser, J. (2014, September 5). Strata New York 2011: John Rauser, "What is a Career in Big Data?". Retrieved from Youtube: https://www.youtube.com/watch?v=0tuEEnL61HM

Riles, A. (2010). Collateral Expertise: Legal Knowledge in the Global Financial Markets. Current Anthropology 51(6), 795-818

Rogers, E. (2010 (1962)). Diffusion of Innovations. New York: Free Press.

Rosenberg, D. (2013). Data before the fact. In L. Gitelman, Raw data is an oxymoron (pp. 15-40). Cambridge, MA: MIT Press.

Ruppert, E. (2013). Rethinking Empirical Social Sciences. Dialogues in Human Geography, 268-273.

Schein, E. (1978). Career Dynamics: Matching Individual and Organizational Need. Reading, MA: Addison-Wesley.

Simon, H. (1965). The Shape of Automation for Men and Management. New York: Harper and Row.

Shore, C. (1997) Anthropology of Policy: Perspectives on Governance and Power, London & New York: Routledge

Tajfel, H., & Turner, J. (1986). The social identity theory of intergroup behaviour. In S. Worchel, & W. Austin, Psychology of Intergroup Relations (pp. 7-24). Chicago: Nelson-Hall.

U.S. Department of Commerce. (2011). Women in STEM: A Gender Gap to Innovation. Washington D.C.: U.S. Department of Commerce.

Varian, H. (2014). Big Data: New Tricks for Econometrics. Journal of Economic Perspectives 28(2), 3-28.

Vygotsky, L. (1978). Mind in Society. Cambridge, MA: Harvard University Press.

Wenger, E. (1998). Communities of Practice: Learning, Meaning, and Identity. Cambridge, MA: Cambridge University Press.

Wilensky, H. (1964). The Professionalization of Everyone? American Journal of Sociology, 137-158.

53

Page 54: Making of the Data Scientist Profession

Making of the data scientist profession - DJDQ8

54