2018 anaconda state of data science report...the anaconda state of data science is strong. with 2 to...

7
In April this year, Anaconda Inc. launched its first survey of the Anaconda community. We wanted to get a better understanding of what users do with Anaconda, what they think about it, and the data sources, visualization, and scale-out approaches they use. The survey ran from March 22-April 30, resulting in 4,218 responses with a 100% survey completion rate. We at Anaconda are very grateful to everyone who responded, especially because so many took the time to provide detailed comments and feedback. Executive Summary The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January- March 2018, Anaconda is easily the most popular Python distribution, with a growing R following. Below are key conclusions from the survey: • The future is bright for Python and R. Students and academics are strong users of Anaconda, comprising 41% of the respondents. The future data scientists and machine learning experts of tomorrow are learning and using Python and R today. • There is a Data Scientist/Software Developer crossover. Both Data Scientists and Software Developers write Python or R code, and the two job roles are not mutually exclusive: Data Scientists who write Python and R libraries can also be Software Developers. But we did not expect almost as many Software Developer users as Data Scientists (15% vs. 16%). • Machine learning is a key application for Anaconda users. 14% of respondents are doing machine learning. • Cloud-native data science and use of cloud services continue to rise, at the expense of traditional Hadoop-centric “big data” infrastructure. Responses concerning data sources and scale-out technologies indicate strong uptake of APIs, cloud data services, and container-based approaches to data science at the expense of traditional Hadoop deployments. • Matplotlib continues to enjoy its first-mover advantage in visualization, sweeping the category. But it is still a highly crowded space with many strong competitors, both open source and commercial. Plotly, Tableau, Microsoft Power BI, and Tibco Spotfire are all strong commercial competitors to Matplotlib and other open source projects like ggplot, Bokeh, D3, and Altair. • It matters a lot that Anaconda is free… but not so much that it is open source. Free was ranked the most important attribute, while the open source licensing was second to last. 2018 Anaconda State of Data Science Report

Upload: others

Post on 08-Jan-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

In April this year, Anaconda Inc. launched its first survey of the Anaconda community. We wanted to get a better understanding of what users do with Anaconda, what they think about it, and the data sources, visualization, and scale-out approaches they use. The survey ran from March 22-April 30, resulting in 4,218 responses with a 100% survey completion rate.

We at Anaconda are very grateful to everyone who responded, especially because so many took the time to provide detailed comments and feedback.

Executive SummaryThe Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the most popular Python distribution, with a growing R following. Below are key conclusions from the survey:

• The future is bright for Python and R. Students and academics are strong users of Anaconda, comprising 41% of the respondents. The future data scientists and machine learning experts of tomorrow are learning and using Python and R today.

• There is a Data Scientist/Software Developer crossover. Both Data Scientists and Software Developers write Python or R code, and the two job roles are not mutually exclusive: Data Scientists who write Python and R libraries can also be Software Developers. But we did not expect almost as many Software Developer users as Data Scientists (15% vs. 16%).

• Machine learning is a key application for Anaconda users. 14% of respondents are doing machine learning.

• Cloud-native data science and use of cloud services continue to rise, at the expense of traditional Hadoop-centric “big data” infrastructure. Responses concerning data sources and scale-out technologies indicate strong uptake of APIs, cloud data services, and container-based approaches to data science at the expense of traditional Hadoop deployments.

• Matplotlib continues to enjoy its first-mover advantage in visualization, sweeping the category. But it is still a highly crowded space with many strong competitors, both open source and commercial. Plotly, Tableau, Microsoft Power BI, and Tibco Spotfire are all strong commercial competitors to Matplotlib and other open source projects like ggplot, Bokeh, D3, and Altair.

• It matters a lot that Anaconda is free… but not so much that it is open source. Free was ranked the most important attribute, while the open source licensing was second to last.

2018 Anaconda State of Data Science Report

Page 2: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

Demographics

We wanted to understand the occupations of Anaconda users, so this was a single choice question where the results sum to 100%. The leading category is Student (26%), which is understandable given that Anaconda is very popular in the teaching realm due to its ease of use, as well as its ability to ensure that every student gets exactly the same Python and R environment and can reproduce the instructor’s results. It also suggests that the growth of Anaconda Python/R users will continue to accelerate as these students graduate and put their expertise to work.

Outside of students, Data Scientists (16%), Academics (15%), and Software Developers (15%) form the majority of users. Because Anaconda was born out of frustration with the difficulty of doing reproducible Python data science, we expected that Data Scientists and Academics would be popular user occupations, but we did not expect almost as many Software Developers as Data Scientists. There is an overlap of needs given both groups write code, but Software Developers also have some distinct priorities, and our product team has been reading all the comments carefully to ensure we understand those and can serve them better.

The most popular response in the “Other” category was Scientist (including specializations like Geoscientist, Chemist, etc.) followed by Engineer.

Python versus R 99% of respondents use Anaconda for Python, as we might expect for a project that came out of the Python community. However, 1% of respondents use Anaconda for R only, with 14% using both R and Python. We have recently boosted Anaconda’s R capabilities and made Microsoft MRO the default package set in the distribution, and we will continue to expand our R support.

Student (attending school full or part time)

Data Scientist

Academic (e.g. Researcher, Professor)

Software Developer / Engineer / Programmer

Analyst (e.g. Business Analyst)

Hobbyist

Other (Please Specify)

IT (any role within an IT org)

Consultant

Trainer or Educator

25.53%

16.38%

15.48%

14.65%

8.39%

8.25%

3.70%

3.58%

2.84%

1.19%

1,077

691

653

618

354

348

156

151

120

50

Page 3: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

How Anaconda is Being Used We also asked respondents to provide a brief summary of what they are doing with Anaconda, as a free text field. We found that Anaconda usage truly spans a wide variety of fields in academia, industry, and government. Use cases include financial asset allocation, formulating cancer public policy, 4D oilfield seismic data processing, medical image machine learning, and molecular dynamic simulations.

Using word matching, we tagged responses with broader categories. For example, for machine learning we matched words and phrases like “ML,” “machine learning,” and “Neural network,” as well as the names of popular ML libraries such as TensorFlow and scikit-learn. While not an exact science, it does give an indication of the relative popularity of use cases. Responses could be given more than one category tag.

Perhaps unsurprisingly, data and numerical analysis (understanding data and what it tells us) was the largest category of usage at 20%. Python and R package management came in just behind at 18%, followed by machine learning at 14%. We did not try to differentiate between different forms of machine learning, given this was a free text field and many responses didn’t provide that level of detail.

What’s Important To Anaconda Users

For this question, we asked respondents to stack-rank seven qualities of Anaconda from first to last in order to understand what was most important to them. The displayed order was randomized for each respondent to help minimize the impact of cognitive bias.

One of the challenges faced by any open source project is how to fund its development, and there are certainly plenty of opinions about the best way to do that within the open source community. Anaconda has always been free, and the survey responses validate that approach: being free is the most important

Free

Easy to install and manage packages

Trusted source of packages

Easily reproduce same results elsewhere

conda environments

BSD open source license

Navigator GUI

Page 4: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

characteristic of the product, scoring 5.16. Zero cost is much more important than its open source licensing, which scores 3.26.

Close behind Free is the ease of installing and managing Python and R packages (4.82). Behind that is the fact that Anaconda Inc. provides a trusted source of packages to the community through our free open repository (4.16). Ease of reproducing data science across systems and the use of conda environments are almost a tie, at 3.86 and 3.70 respectively.

The Navigator GUI comes last at 3.04. In reading the comments, we discovered two reasons for this: first, many users prefer the command line to a GUI, especially when automating tasks; and second, there is room for improvement in Anaconda Navigator, and the product team has already begun planning improvements there.

Data Sources

Files reign supreme when it comes to data sources, with 89% of respondents using them for data access, followed by classical SQL databases at 49%. In third place are REST APIs at 25%, demonstrating that getting data from other applications and websites is a key part of modern data science.

In fourth place, Google Cloud’s data services (16.95%) just edge out traditional big data stores like HDFS / Hadoop and Spark (16.88%) and Amazon Web Services’ data offerings (16.12%). AWS is the dominant player in the IaaS marketplace and pioneered modern large scale object storage that is a fraction of the cost of big data technologies like HDFS, as well as fast large-scale query-oriented databases like RedShift. The strength of Google Cloud’s showing is impressive given that AWS has 10x the IaaS revenue of Google (per Gartner: www.gartner.com/newsroom/id/3808563). Google Cloud has a strong data services play and has been

CSV or other files

SQL database (e.g. Oracle, SQL Server, MySQL,

MariaDB, Postgres

REST API from another app (e.g. Twitter API)

Google Cloud (e.g. Cloud Storage, BigQuery,

BigTable)

HDFS / Hadoop / Spark

AWS (e.g. S3, Redshift, Dynamo)

NoSQL database (e.g. MongoDB, CouchDB,

Cassandra)

Distributed SQL engines (e.g. Hive, Impala, Presto)

Azure (e.g. Blob Storage, Cloud Database)

Other (please specify)

89.09%

49.19%

24.59%

16.95%

16.88%

16.12%

13.61%

8.80%

7.14%

6.92%

3,758

2,075

1,037

715

712

680

574

371

301

292

Page 5: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

focused on leveraging that into growth of customer base, and this result indicates that it’s paying off with the Anaconda community.

Hadoop-style big data performed relatively weakly versus the other options given this is a data-centric audience. Hadoop has dominated on-premises (non-cloud) data infrastructure for the past 10 years and spawned two tech IPOs (Hortonworks and Cloudera). What was “big data” in 2005 when Hadoop began now easily fits into a single server’s memory, and there is a plethora of alternatives to building a Hadoop data lake.

NoSQL databases came in at 14%, right behind the cloud services, demonstrating their value for storing and processing semi-structured data. Microsoft Azure usage came in at 7.14%.

Scaling out data science and machine learning

We asked respondents what technologies they use for scaling out their data science. The majority (52%) don’t use scale-out technologies, and the next closest is deployment to a Linux server at 34%.

Docker makes a strong showing at 19%, beating out Hadoop/Spark with 15%, followed by Kubernetes at 5.8%. This result suggests modern cloud-native style architectures like Docker and Kubernetes are in the ascendancy, at the expense of traditional Hadoop “big data” and Apache Mesos (0.85%).

Dask, an open source technology for parallelizing single host algorithms and machine learning across multiple CPU cores or multiple servers, came in at 3.0% of responses.

The “Other” category at 1.5% included a variety of supercomputers and various AWS, Google Cloud, and Microsoft Azure services.

None

Linux Servers

Docker

Hadoop/Spark

Kubernetes

Dask

Other (Please Specify)

Mesos

52.18%

33.90%

18.94%

15.48%

5.76%

3.06%

1.54%

0.85%

2,201

1,430

799

653

243

129

65

36

Page 6: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

Visualization

For this question we used alphabetical ordering for the responses in order to make it easier for respondents to find and select the visualization tools they use.

Matplotlib in all its guises absolutely crushed this category, with 75% using it directly, 47% using it via Pandas Plotting, and 27% via Seaborn (multiple responses were allowed).

Plotly and ggplot (for R) came in almost neck-and-neck at 24.4% and 24.3% respectively. The usage of ggplot far exceeds the percentage of R users represented in the survey, indicating that Python users also like to use it. Tableau came in next at 20%, followed by Bokeh at 14% and D3 at 10%.

The “Other” category was substantial for this answer. In order from most popular to least are:

1. Microsoft Power BI

2. Tibco Spotfire

3. Microsoft Excel

4. Qlik

5. Altair

Matplotlib

Pandas plotting

Seaborn

Plotly

ggplot (R)

Tableau

Bokeh

d3

Other (please specify)

Holoviews

74.82%

46.80%

27.17%

24.37%

24.30%

20.29%

13.92%

10.10%

9.39%

1.92%

3,156

1,974

1,146

1,028

1,025

856

587

426

396

81

Page 7: 2018 Anaconda State of Data Science Report...The Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the

Where users get help

Perhaps unsurprisingly, Anaconda users would rather hit Google Search (4.89) or Stackoverflow (4.42) to find an answer versus reading the docs (4.05)! With this question we randomized the order of responses for each respondent to help minimize the effects of cognitive bias. This result, along with comments from respondents, indicates room for improvement in the Anaconda documentation, which the team will be working on.

What would Anaconda users do differently?We deliberately left this as a final, open-ended question so we could receive candid answers on anything we can do to improve. One common theme was better interoperability between the pip installer and conda, especially when pip-installing packages into conda environments. This is something that the conda team is actively working on, for release later this year. Better documentation was another theme.

Improved interoperability with Docker and container-building in general was also popular, with users looking to build smaller containers more quickly using Miniconda and conda environments.

Finally, it warmed our hearts that the word “Love” was one of the most frequent words encountered in these responses, as in “Love what you do already” and “Nothing, I love it!”

We’d like to thank all the respondents for taking the time to complete our survey. Let’s do it again next year!

About Anaconda, Inc.With 6 million users, Anaconda is the world’s most popular Python data science platform. Anaconda, Inc. continues to leadopen source projects like Anaconda, NumPy, and SciPy that form the foundation of modern data science. Anaconda’s flagshipproduct, Anaconda Enterprise, allows organizations to secure, govern, scale, and extend Anaconda to deliver actionableinsights that drive businesses and industries forward.

Google for an answer

StackOverflow

Read the documentation

Github

Anaconda email list/Google Group

Social media (e.g. Twitter)