sds podcast episode 249: diving into data science … · mike that was back in the middle of 2017...

46
SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE CONSULTING

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

SDS PODCAST

EPISODE 249:

DIVING INTO DATA

SCIENCE

CONSULTING

Page 2: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Kirill Eremenko: This is episode number 249 with the CEO and Co-

Founder at SFL Scientific, Michael Segala.

Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name

is Kirill Eremenko, Data Science Coach and Lifestyle

Entrepreneur. And each week we bring you inspiring

people and ideas to help you build your successful

career in Data Science. Thanks for being here today

and now let's make the complex simple.

Kirill Eremenko: This episode is brought to you by our very own Data

Science Conference, DataScienceGO 2019. There are

plenty of Data Science conferences out there.

DataScienceGo is not your ordinary data science

event. This is a conference dedicated to career

advancement. We have three days of immersive talks,

panels and training sessions designed to teach, inspire

and guide you. There's three separate career tracks

involved. Whether you're a beginner, a practitioner or a

manager, you can find a career track for you and

select the right talks to advance your career. We're

expecting 40 speakers, that's four zero. 40 speakers to

join us for DataScienceGO 2019 and just to give you a

taste of what to expect, here are some of the speakers

that we had in the previous years. Creator of Makeover

Monday, Andy Kriebel. IA thought leader Ben Taylor,

Data Science influencer Randy Lao, Data Science

mentor Kristen Kehrer, Founder of Visual Cinnamon

Nadieh Bremer, Technology Futurist Pablos Holman

and many, many more.

Kirill Eremenko: This year we will have over 800 attendees from

beginners to data scientists to managers and leaders.

There'll be plenty of networking opportunities with our

Page 3: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

attendees and speakers and you don't want to miss

out on that. That's the best way to grow your data

science network and grow your career. And as a bonus

there will be a track for executives. If you're a

executive listening to this, check this out. Last year at

DataScienceGO X, which is our special track for

executives, we had key business decision makers from

Ellie Mae, Levi Strauss, Dell, Red Bull and more.

Kirill Eremenko: Whether you're a beginner, practitioner, manager or

executive, DataScienceGO is for you. DataScienceGO

is happening on the 27th, 28th, 29th of September,

2019 in San Diego. Don't miss out. You can get your

tickets at www.datasciencego.com. I would personally

love to see you there, network with you and help

inspire your career or progress your business into the

space of Data Science. Once again, the website is

www.datasciencego.com. And I'll see you there.

Kirill Eremenko: Welcome back to the SuperDataScience Podcast ladies

and gentlemen, I'm super excited to have you back

here on the show because we've got a returning guest.

For the second time round, Michael Segala is joining

us. He is the CEO and Co-Founder of an AI, Data

Science, Machine Learning consulting firm based out

of Boston but operating globally called SFL Scientific.

Kirill Eremenko: Previously we had a super exciting discussion with

Mike that was back in the middle of 2017 and it was

episode number 65 on the SuperDataScience podcast

if you missed it and today Mike is back with even more

case studies and more inspiration for you guys in the

space of data science. Here are some things that we

talked about, just as last time Mike shared three case

Page 4: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

studies and of course they were different this time.

This time we talked about healthcare imaging and we

delve deep into neural networks and the architecture

and design of neural networks.

Kirill Eremenko: Then we talked about logistics and supply chain and

the challenges there and we talked about things such

as bottlenecks and routes and how machine learning

can help in those spaces and what kind of projects

they're doing in that industry. And we talked about

energy and in the space of energy, Mike actually give

us two case studies, and some of the things that you'll

learn there are dealing with unbalanced data sets,

creating fake data sets, unsupervised learning for

anomaly detection and supervised learning with small

data sets and in general, this challenge of small data.

Those are just a couple of things that you'll learn,

there's plenty, plenty more that Mike shared, including

an overview of the world of Data Science projects and

Data Science Consulting in general, which I think you

will find extremely valuable and why companies in

2019 and 2020 might actually start defunding artificial

intelligence and machine learning and what we can do

about it.

Kirill Eremenko: As you can imagine, this is going to be a very, very

powerful podcast, can't wait to jump into it. But before

we do, I wanted to give a shout out to our fan of the

week. And this one is from Ronnie who says, "If you

have an interest in programming automation, big data,

machine learning, etc., this is a must listen, focuses

on data science, analytics, etc., in the corporate

Page 5: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

world." Thank you very much Ronnie. Very, very

inspiring to hear that.

Kirill Eremenko: And for those of you out there who are listening to the

show and you haven't yet left a review, then head on

over there on your podcast app or just go to iTunes

and leave a review for the SuperDataScience podcast,

that would be just amazing. I'd really appreciate it

because I love reading your reviews. And with that

said, I'm super excited about today's episode. And

without further ado, for the second time round I bring

to you Mike Segala CEO and Co-Founder of SFL

Scientific.

Kirill Eremenko: Welcome back ladies and gentlemen to the

SuperDataScience podcast. Super excited to have you

on the show because we've got a returning amazing

guest with us here. The one and only Mike Segala from

Boston's SFL Scientific. Mike, welcome back. How are

you doing today?

Michael Segala: I'm doing great. Thanks for having me back. It's a

pleasure to talk again.

Kirill Eremenko: The pleasure's all mine, the podcast we had last time

was an amazing success and totally totally rocked it,

so looking forward to having another one today. How's

the weather in Boston these days?

Michael Segala: Well, it's late February, so we're cold and windy, but

not too bad snow this year, I can't complain too much,

but not nearly as nice as where you're at in the world.

Kirill Eremenko: Yeah, man, I'm in Tasmania now and like I was

mentioning before, it was freezing last night, is my first

Page 6: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

time in Tasmania, literally the day I got here, they

have like the worst wind and weather ever and it's

freezing cold. But it's a nice [crosstalk 00:07:05]

Michael Segala: That's how it always is that right? You go on vacation,

and they're like, "Oh, this is the worst seaweed we've

ever had." There's always something but it keeps it

fun, right?

Kirill Eremenko: Yeah, that's true. A bit of variety. That's right. That's

exciting. It's been over one and a half years since we

last spoke, the previous episode, by the way for our

listeners, if you haven't heard it, highly recommend

checking out. Mike shares amazing case studies. It's

episode number 65, so you can find at

superdatascience.com/65, with Mike it was over one

and half years ago. What's been happening since then?

Michael Segala: A lot. Just to kind of recap real quick for the audience.

I've run SFL Scientific, we're a Data Science consulting

company. Unlike a lot of these traditional product

companies or vendors, we're purely focused on really

attacking this Data Science market from a purely kind

of consultative standpoint. Truly kind of service

oriented. What that means for us is we get to have a

lot of really smart folks on staff that get to work across

a really far ranging kind of sets of clients and topics

across the data science and data engineering space.

Michael Segala: For us we're really just continuing to grow and move

with the market. As everything continues to mature

and money gets fed into this AI market, SFL is taking a

really nice ride along with them and continue to kind

of execute on really interesting innovative projects and

Page 7: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

just grow the business. It's been a great time and it's

kind of very similar to yours Kirill. We both kind of

started the companies a couple of years ago in the

beginning of this phase and are doing great stuff so

congrats to you as we've been taking this ride a little

bit together here.

Kirill Eremenko: Thanks for that. Thanks mate. Yeah, I can only to the

same. It's exciting to see the explosive growth you had.

I sometimes go on the SFL Scientific website and even

if you're not, a business owner, if you're a data

scientist or you're a data science manager aspiring,

highly recommend checking out sflscientific.com. I just

go there for inspiration sometimes, you go to solutions

or our work. I like how you have this grid of different

industries you've worked in, from advertising,

marketing, agriculture, insurance. And then like I click

on one of them and I'm like, "Oh, that's really cool."

What have you done in agriculture, satellite imaging,

resource management, crop forecasting, livestock

monitoring. Those are some really cool things. There's

a ton of industries you guys have worked on. It's crazy.

How do you keep up with all these projects?

Michael Segala: Well, keep up with the project is different than

executing. Keep up is a lot of late nights and email

exchanges. But everybody on this podcast listening is

pretty educated at least from a data science

perspective, and as we know, algorithms, data sets,

they all kind of boil down to the same fundamental

data types and challenges. What do we have

fundamentally? We have images, we have time series

data, we have text data and a couple other types of

Page 8: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

fundamental modalities of data. And what you can

start doing is thinking about, all right, if I had an

image and this image came off of an MRI machine or a

satellite image or even a camera in my house, how

would I classify that image? Or how would I segment

that image?

Michael Segala: And if you're really good at thinking through the

fundamental challenge behind capturing, collecting

and storing and then solving the problems of those

data types, you can kind of extract a way some of that

industry vocabulary and difficulties that very industry

specific folks focus on. What we really try to focus on

as a company is saying, "Hey, I want to hire the best in

class folks at computer vision or time series analysis

or NLP analysis." And arm them with that kind of 95%

of the knowledge to solve all problems. And then when

we talk to somebody from Ad tech or from Pharma or

from finance, being able to slot in and solve an NLP

problem or computer vision problem is kind of very,

very similar and almost a rinse and repeat because

you have that core knowledge. And then you can really

apply it across all these verticals very, very easily.

Michael Segala: That's the way that we attack the market. Now granted

that's not for everybody, but we find that to be

extremely successful and we really had no issues with

that so far.

Kirill Eremenko: That's amazing. I love that you mentioned because we

talked many times with many guests about the

transferability of data science skills. That's why I

personally enjoy Data Science which I think it's such a

cool industry to be in is because you develop those

Page 9: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

skills as you mentioned, and then you can take, you

can separate in your mind the data science side of

things and the domain knowledge or the business

knowledge and you can take your data science skills

and transfer them into different areas and very quickly

graft that domain knowledge and consulting is like one

of the places of course where that is the most evident.

Michael Segala: Yeah, absolutely. And I mean for us, we don't think of

data science as a point position around algorithms. I

actually think that's the least interesting thing going

right now in data science. Because when you think

about data science, all these algorithms, take anything

off the shelf, your XGBoost models, your tensor flow

models, right? These are all becoming very commodity

and it's almost trivial at this point to take some data,

run it through XGBoost and get a prediction. Literally

if that takes you more than 20 minutes, if you're just

kind of doing rinse and repeat, you don't know what

you're doing.

Michael Segala: When we're thinking about consulting, it's so much

more than this kind of very singular thought around

algorithms. We like to take that very holistic approach

of saying, "If you're a real organization who needs to

solve a real data problem, how do you do that?"

Michael Segala: And the first way that you do that is as a data scientist

to take a big step back and think about the strategic

vision here, what's the real business use case that

you're thinking about? How would you solve this?

What's that ROI look like? What do I actually get at the

end of these algorithms? And you really thinking

through not just the sciencey algorithm stuff, but also

Page 10: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

the business stuff. And then also thinking about, well,

how would I engineer that solution? How would I do

that in a kind of scalable, secure environment where I

can now go in productionize this thing.

Michael Segala: And kind of having that, and coding that around these

algorithms is really where that interest lies. And again,

the reason that I'm saying that is because if you're a

consultant and if you want to get into this space where

if you really want to be a great data scientist, what we

find is, these just very simple algorithms, they're going

to be commoditing. If you want to stay above that

curve, you have to really think about that larger

picture. And that's also very repeatable across

industries, all of these themes make you an extremely

innovative folk and be able to be used across all these

different problem statements. It just kind of keeps

going and going.

Kirill Eremenko: Yeah, totally agree. And you mentioned just before the

podcast that you have grown to over 30 people. What

kind of roles do you have on your team? Is like

everybody doing data science projects end to end or do

you have some people specialized in certain types of

industries, certain types of areas or parts of the data

science project?

Michael Segala: We have two very different groups of teams, first is

more of the sales and the business folks that sit under

me, but we'll put them further aside for the moment.

They have their great roles, they do their things, but

not really for this podcast, actually let me just stop

there for two seconds. I actually make all of my sales

and business people take your courses.

Page 11: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Kirill Eremenko: Oh no.

Michael Segala: I swear to god. As their first two weeks or three weeks

of their introduction, they have to take your, I think

two of your courses as their introduction to data

science.

Kirill Eremenko: Wow.

Michael Segala: Everybody [inaudible 00:15:30]

Kirill Eremenko: Thanks men that's so exciting to hear.

Michael Segala: Because it's a great resource. It's an absolute great

resource and I feel that everybody on my team, no

matter if you're a sales folk or if you're whoever else,

you have to be a data scientist at least some novice

level. You have great resources so we really appreciate

them.

Kirill Eremenko: That's awesome. It's so exciting to hear as well that, I

think this stands to show that data culture or data

driven thinking and culture. This is on one hand of

course it's about knowing your product and what

you're selling. But on the other hand, this way your

team as a whole can develop this data driven mindset.

If a salesperson is talking to a client, they might be

like, "Oh, this might be helpful." XGBoost or Decision

Trees, Random Forests. Really, really cool. Thanks

man, you put a huge smile on my face.

Michael Segala: I'll answer your other question but I want to get back

to this as well because I think a lot of folks that listen

to your podcast could be from that sales and business

side of the world. And at least me, right in my team, I

run that department in myself ... I'm a physicist. I'm a

Page 12: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

scientist. I'm a data scientist but now basically I'm a

sales guy and I have a very core belief, exactly

parroting what you're saying that if you want to sell

data science, if you want to be in that role of data

science but not a technical employee, it is

phenomenally critical that you have that same

vocabulary. You understand the real challenges and

you can be, at least a five minute conversation where

you're actually conveying real knowledge about the

topic.

Michael Segala: Otherwise you just kind of look silly compared to

people who know what they're talking about. It's

extremely crucial to have a real base line in there. But

anyways, putting that to the side for the moment, on

the technical side of the house, we usually have two

types of individuals. One is our data scientists and our

data scientists, we look for people who are generalists

but extremely gifted generalist, I need you on one day

to be able to solve cutting edge 3D medical imaging

projects and then the next day doing NLP work. We

tend to not hire folks who only know how to do one

thing because you're a consulting company. That

project might be up in six weeks and then you're off to

something else.

Michael Segala: Our goal is to hire really well rounded folks, but we

tend to double down a bit in the healthcare market,

healthcare, Pharma, biotech. It's really nice when

people have that kind of general backgrounds, physics,

biology, chemistry and things of that nature. But really

bright individuals, that kind of know the data science

space. That's kind of the one group of team. And then

Page 13: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

the other side is more of our engineering folks, we call

them like AI engineers because they're not like day-to-

day sad mid folks or SQL people. They're the ones kind

of deploying these solutions at scale, all the way from

very large petabyte size image loads to realtime data

transfer and kind of model deployments.

Michael Segala: We tend to have those two kind of engineering and

data science teams, but they work huge overlaps. Both

can kind of pair each other and do a really nice job.

That's how we set up the teams internally.

Kirill Eremenko: Gotcha. What's the split approximately between the

data scientists and AI engineers?

Michael Segala: I would say 70, 30 maybe 70% data science, 30%

engineering.

Kirill Eremenko: Gotcha.

Michael Segala: Give or take, something like that.

Kirill Eremenko: Am I understanding this correctly that you not only

deliver the insights and find solve the problem for the

client using data science, but you also help

organizations actually deploy their solutions into

production and actually have those models working on

an ongoing basis, hook up all the tools and make sure

that everything's working right. Is my understanding

correct?

Michael Segala: Absolutely. And I think if you don't do that, you're

falling very short of what it actually means to do data

science. Data science isn't running a POC on your

laptop, with a CSV file, it could be, but for most real

organizations, they need something much more robust

Page 14: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

than that. That can fit into a real process and kind of

take in real data and kind of show the results and

kind of fold into more of their business process. It's

really critical for us, obviously the first phase in most

projects is very simple, take this data, show me that

you can predict something, great, show it in a sandbox

environment. And then what we really need to

transition them into where most organizations fail

short and why most data science projects fail is not

because the data's no good or because the models are

no good.

Michael Segala: It's actually because the folks don't know how to

integrate these things and productionize the code.

That's a huge problem we see in the industry. We

really try to be thoughtful, when we kind of prove out

the POC to show them and work with them to deploy

it, 'cause unless you deploy it, it's really a failed

project. Absolutely. It's extremely important.

Kirill Eremenko: It's kind of like a follow through, like getting things

done. I imagine it as American football, imagine one

player throws the ball and the other one has to catch

it, the data science side of things, that's throwing the

ball. How fast you can throw it, how accurately you

can throw it, how you can avoid other players jumping

at you when you're throwing it, all that stuff. But if

there's nobody to catch it, then where's that ball going

to go, is just going to land by itself.

Michael Segala: It's going to hit somebody in the back of the head.

That's all it's going to do.

Kirill Eremenko: Exactly.

Page 15: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: But I agree. Any analogy you want to make.

Absolutely. The fact that we still don't have a culture

in the Data Science space around deployment and

productionization, I think is one of the biggest issues

that I see. And one of the biggest risks of folks not

investing longer term, kind of in their data strategies,

these kind of failed POCs. And a lot of that is really

just kind of comes down to integration and

productionization.

Kirill Eremenko: When you say POC, what do you mean? Just so we're

all in the same boat?

Michael Segala: Yeah, sorry. A POC usually is take any, I don't know,

take a problem, pick a use case, whatever it happens

to be. Predicting churn for my customers, pick

something ... A POC is normally, here's 10,000

historical customers, here's the data. Show me that

you can predict with some given level of probability

that these customers can churn, pretty straight

forward. They give it to you on a CSV file, you fire up

XGBoost within a couple hours you could probably do

something. You need to show the business that, that is

validated and you can do it. But now you need to then

productionize this by saying, "Okay, now I have real

customer data coming in every day, I'm collecting it,

I'm adding external information. How do I integrate

this code and algorithm into my actual workflow?"

That kind of [inaudible 00:22:28] has the POC into

more of a real kind of implementation phase. That

make sense?

Kirill Eremenko: POC is basically proof of concept that you cannot get it

done.

Page 16: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: Proof of concept.

Kirill Eremenko: Okay. Gotcha. All right. Well, this would be an

interesting, to hear from you who's in the field, you

guys work in tons of these projects. What would you

say roughly is the estimated amount of time ratio

between the data science side of things, doing the

work and preparing the model and the productization

of the model? How would you split the time required

by your team on to those two part components?

Michael Segala: It's a very open ended question and it depends

phenomenally on the project, obviously. You have to

realize that for us we tend to work on more innovative

type of projects, because a lot of these low hanging

fruit problems, internal data science teams are doing,

or you can call some API to do it, you might

necessarily need to bring us in for some of the bigger

type of stuff. A lot of our projects are more kind of that

cutting edge bigger projects. For us, I tend to try to

run a first POC in the matter of say 4 to 12 weeks, give

or take that timeframe, if it's fast 4 weeks, if it's a little

longer 12 weeks, in that probably half of that time is

spent getting the data, thinking about it, doing some

kind of exploratory analysis, cleaning it, playing with

it.

Michael Segala: Maybe a quarter is spent modeling it and then the last

quarter is spent explaining to the client, walking it

through, understanding it, validating it and things of

that nature. The first half of the project, maybe only

half of the time is spent with the algorithms. And then

I would say to productionize that, I mean that could

itself take anywhere from a day to a year. It really

Page 17: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

depends on the business and how complex their IT

infrastructure is, how complex the data is. If there's

security issues, if there's compliance issues. That's

when you get into the world wind of just craziness. It

really depends.

Kirill Eremenko: Wow. Sounds like that part is the more uncertain one

from a day to a year. Well it's lots of uncertainty there.

Michael Segala: Yeah. I mean, I'm being a little heavy handed with the

day, call it a couple of weeks. But yeah, I mean it

could be very quick to a very arduous task.

Kirill Eremenko: Okay. Well that's good to know. And that also shows

that there's a massive hidden complexity involved of

data science projects that a lot of executives don't

consider. If you have a data science strategy, that's

something you should have a part of your data science

treasure. If you're just developing your data science

strategy, not only should you include things like, do

you have the data, do you have data silos, how you're

going to break those silos, what kind of team are you

going to hire or who are you going to approach about

these projects? What kind of tools are you going to be

using for these projects, but also you need to include

this whole productization of the models.

Michael Segala: 100% yep. Absolutely.

Kirill Eremenko: All right, let's shift gears a bit. That was an awesome

intro and like awesome overview of the world of data

science consulting and just in general data science

projects. Let's talk about some case studies. Last time

you shared three incredible case studies on the show.

In fact they had multiple components. I would say

Page 18: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

even more than three case studies. Do you have any

new exciting things that you've been working on for

the past one and a half years that you can share with

us?

Michael Segala: I can and I should have remembered which ones I

shared, but I'll pick three ones and will probably be

different and if I repeat myself and remember just tell

me.

Kirill Eremenko: What you shared, first one was on cleaning

unstructured data with NLP pipelines. Then second

one was deep learning to detect cancer. And also we

talked about growing organs with deep learning. And

case study number three was gaining an advantage in

sports betting using machine learning.

Michael Segala: Fair enough. All right, let's actually, let's do a couple of

different ones as well. I like to always go back to

medical imaging. I remember that when I had talked

about last time. We've been working for about the past

year or so and I'll give you three again, just kind of

three or four random ones pretty quickly. We've been

working for about the past, I'll call it a year, a year and

a half with a client who is kind of bleeding edge from a

medical imaging perspective. And medical imaging is

extremely important for lots of different reasons. Let's

take a step back and think about why we care about

automation of medical imaging. Right now you go and

you get an MRI, you get a CT scan, you get a pathology

reading and basically what we're doing, we're detecting

cancer, we're detecting breaks, we're detecting

whatever it happens to be.

Page 19: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: There's this kind of coolness factor of can I use an

algorithm to predict probabilistically is this a tumor

and can I do that at a rate that is more accurate than

a radiologist. That's kind of the cool factor and and

sure, right? We're getting to the point where we can do

that and we're getting to the point where FDA clears it.

But what's really interesting and why we really want to

do it is for two reasons.

Michael Segala: The first reason is reducing variability within the

medical profession, because right now, if I had an MRI

and I gave that to a doctor to predict or for them to tell

me if I have a cancer they technically will disagree with

a group of radiologists and they'll even disagree with

themselves at a pretty large fraction of a clip.

Michael Segala: If we design a system that is unanimous and reduces

that variance, we're now getting to the point where we

can give care to a population in a very unbiased way,

it's a pretty significant kind of implication. The second

implication is, this actually takes doctors lots of time

to do, this could take minutes to hours of their time,

that is not spent with patients.

Michael Segala: Now you're kind of giving them back all of this time

where they can go and do what's really important,

which is seeing and talking with patients. That's really

why we want to do medical imaging and why it's such

a popular field, within deep learning and data sicence.

And I won't go on with this along with all of them, I

really like medical imaging for lots of reasons.

Michael Segala: What we're doing with this medical imaging project is

we have the world's largest collection of 3D CT and

Page 20: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

MRI brain scans looking for different cancers within

the sinus cavities. I think it's like 51 different tumor

types that can just establish within your seven

different cavity regions within the brain or within the

face. What we've done there is amassed large amounts

of data paid, well our client has paid lots and lots of

money for doctors to label it. And we've built extremely

sophisticated algorithms to detect very, very small

signatures of malignant like tumor cells within these

3D images.

Michael Segala: That's the first one, and that's been going on for a

while. Extremely successful. Kind of has shown them

to have accuracies, I can't really say the accuracy

numbers, but far exceeding what they would need to

be to get real kind of clinical validation. Very very

interesting, very profound. If we think about the

implications, so that's kind of the first one.

Kirill Eremenko: Quick question. What do you mean 3D images? Is it

like multiple layers of MRI scans?

Michael Segala: Well, an MRI is a 3D, it's not a single 2D plane. You

actually had a stack of like 128 2D images make up

one 3D image. You have to look across the X, Y, and

the Z plane. And obviously within that Z dimension,

you can have, that's where a tumor might be

embedded within two or three of the actual slices. It's

a very complex problem, because now you've taken a

data set and for every image you basically multiplied it

by a factor of a hundred. Just think of the size of these

datas and the complexity of the algorithms that have

to happen.

Page 21: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Kirill Eremenko: Yeah. Can imagine. If it's possible for you to share

what kind of algorithms or even branches of machine

learning or other areas of AI did you guys use for this?

Michael Segala: This is all deep learning, this is all computer vision

and I just want to make a point here because this is a

great question. You cannot take an off the shelf VGG

16 or 19 or whatever they have out now and do

transfer learning and expect to get a medically viable

algorithm. The stuff that people play with is great from

an education standpoint, if you do it on Kaggle sure,

that's fun. But if you really want to be serious about

solving these problems. You're really starting from

scratch and designing from a research perspective

these algorithms in an extremely deep networks, very

complex systems, and you'd better have access to lots

of really big and powerful GPUs.

Michael Segala: We write all of this from scratch in pure TensorFlow,

because [inaudible 00:31:52] is way too restrictive and

they just go to town and just really, these takes a long

time to do. It's all very custom kind of convolutional

networks and stuff like that. And you do lots of

cleaning and pre-processing and post-processing that,

just go on and on to get the accuracies up and up.

Kirill Eremenko: Gotcha. How'd you guys choose TensorFlow over

PyTorch?

Michael Segala: I mean the team does for whatever reason. Sometimes

the client demands it, sometimes for whatever reasons,

our team chooses it. For this client specifically, I don't

remember why the choice was made but for us, I

mean, it's not a one or the other. It's whatever best fits

Page 22: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

that very specific situation. For this maybe it was,

TensorFlow was better for these 3D images over

PyTorch, but I'm literally making that up. I don't know

why that specific choice was made, but for this client it

was made, I'm sure for a very specific reason.

Kirill Eremenko: Wow. So many questions.

Michael Segala: Sure.

Kirill Eremenko: With deep learning, very interesting. First one would

probably be one of the main parts of deep learning is

architecting the neural network, finding out or

experimenting with how many hidden layers you have,

how many neurons in those layers and things like

that. Do you guys have any approaches that you have

developed in SFL Scientific over the years on what's

the most efficient way to experiment with neural

network architectures to get to the end result faster or

is it completely dependent on the project and it's a

creative component that people, that you rely on your

team to execute.

Michael Segala: I mean it's a little bit of both. It's a lot of experience

and a little bit of creativity. And now I'm speaking for

an area where my team would be much better suited

to speak on than I will but I'll pretend to know a lot

more than I really do. You have to realize that we've

worked in these kinds of medical imaging problems for

years, from a kind of all the way from our graduate

background for the past several years and a lot of our

folks have been working on problems like this for 10 or

20 years. We know computer vision and have deep

learning in the medical space very well. We happen to

Page 23: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

have a pretty good understanding of how to build

architectures around understanding and segmenting

and classifying dicom like CT or MRI images. And we

know kind of the computing power, we know the size

of the data. We can calculate the number of neurons

to say, "Hey, I need to show incrementally that we're

getting better and better accuracies."

Michael Segala: Because you don't start by throwing the kitchen sink

at the problem. You start small and you start quick to

kind of iteratively show that you can make progress.

Design a network that you can do in a couple of hours

and then show it works. Now a couple more hours or a

couple of days or a couple of weeks. You're always

building on that, intentionally moving in a kind of

structured way. It is obviously just knowing some stuff

and then being smart around selecting and kind of fine

tuning your network and growing that as a function of

your accuracy demanding it. Not a great answer, but

it's my answer.

Kirill Eremenko: No, no. I like what you said about starting small, I

think that's important because maybe somebody might

be working on a project and they get an accuracy rate

with a certain architecture of, I don't know, like 60%

and that really is discouraging to them. And they

completely change the approach, they abandon that

first idea that they had and they try something

completely different. But what I'm getting from what

you're saying is that, okay you got to 60%, see if you

can get that to 70%. Can you adjust it rather than

completely abandoning it.

Page 24: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Kirill Eremenko: You might've had a great idea at the start, see if you

can adjust it and increase, increase, increase and get

to that end goal. The point is not to hit the bull's eye

right away, but just like keep throwing the darts until

you get closer and closer and closer. And you finally

hit the bull's eye.

Michael Segala: There's two things. Great data scientist are great

problem solvers, hands down. Being thoughtful about

why things aren't converging or what can be improved

on and then second to your kind of number of 60%. I

challenge a lot of our folks and a lot of our clients,

when we start throwing out numbers like 60%, 70%,

80%, I'll always say, "Well what's good, is 80%

accurate on detecting cancer good?" And it actually

invokes a lot of thought and like what is an actual

good accuracy and what would you do if it was 80% or

60% or 99%. When you're a data scientist and you're

sitting there and you're building these algorithms and

you're getting your accuracy numbers, you really need

to think about, well, what is needed for the business

and what are these accuracy values actually

correspond to in terms of an outcome and what level

do I really need to achieve?

Michael Segala: It's not this kind of playground science laboratory.

You're doing this for a business, for a real purpose so

figure out that purpose then work backwards in terms

of what your accuracy needs to get to. I think that's

such a critical point that most folks just ignore.

Kirill Eremenko: Okay. Totally, totally agree. Thank you for that. That

was case study number one, medical imaging.

Page 25: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: All right, let's see. I have another great case study. I

hope I don't get in trouble for this one. We'll see. I'm

going to be very, very light with the details. We do

some work within the federal government. One of them

happens to be with a client that develops in airports

the baggage screen or stuff that you walk through.

Stuff that you physically put your baggage through

and then stuff that your baggage that you check in

goes underneath and goes through. Those are actually

just large CT scans. They're large CT images. And what

happens is as your bag is going through, like you

know, you go through the airport security, you're

sitting there, it takes a second and then you have a

screener, a TSA agent sitting there and they say, "Hey,

I see an interesting object." It could be a knife, it could

be a gun or they're looking for other objects like

explosives and things of that nature.

Michael Segala: You could imagine that these machines might have

some interesting algorithms built into them. And you

can imagine-

Kirill Eremenko: You'd hope so.

Michael Segala: You could imagine even further that nowadays we

would probably want to enhance those algorithms by

using like a deep learning solution or really innovative

solutions. If you imagined all those things, the TSA

probably works with consulting companies that

designed and developed these types of Algorithms for

folks. We may be one of those companies doing some

really interesting work around detection for the TSA.

Kirill Eremenko: Or maybe not.

Page 26: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: Or maybe not, I don't know. That could be used case

two, but we won't, I don't know how much I'll get in

trouble for that one so we'll skip that one for today.

Kirill Eremenko: Sounds good.

Michael Segala: But very similar. It's object detection, it's

segmentation, it's classification around really

interesting images. And that image could be anything,

it could be, go ahead.

Kirill Eremenko: I was just going to say that, it just shows that your

existing expertise in the medical space with imaging is

very transferable to other industries such as scanning

baggage.

Michael Segala: Yup. Absolutely. Other types of use cases, we're seeing

a lot in these very traditional industries like

manufacturing, retail, consumer goods, where they

have lots of logistical and supply chain problems. This

one's not a real sexy one, but it's something that we

see a tremendous amount of potential lift for.

Increasing logistics and supply chain is an area where

there's a lot of hot press happening at the moment.

Michael Segala: If you're a beverage company, if you're a company that

sells lots of jeans or whatever you happen to do and

you're selling tens of thousands if not hundreds of

thousands of these products, the question becomes

very simple, how can I use an AI solution? Whatever

that means. Some machine learning or deep learning

to actually allocate merchandise in a much more

optimized way. We have a few different clients and a

lot of these big industries that ship, talking hundreds

or millions of individual items every day, every week,

Page 27: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

every month that they want to be able to dynamically

understand, how do I ship them? How do I become

better about not wasting material? How do I increase

my bottom line by just doing that in a more optimal

way. We've seen very recently a lot of these industries

looking out because they're seeing what machine

learning can do over their very traditional kind of rule

based forecasting methods just to enhance these

operations.

Michael Segala: A lot of our use cases just literally in the last couple

months have been around that supply chain and

logistics. If you're somebody who's looking at

interesting problems, I think that most big companies,

most Fortune companies or even even smaller mid

market, they all have very similar types of use cases

around this space. Forecasting, supply chain

manufacturing where you can do a lot of interesting

stuff. That's kind of a not really a single use case, but

lots of use cases baked into one there. A lot of real

great value there, very different, very time series like

data and things of that nature. And what you can start

doing there is coupling from a predictive side if you're

also doing supply chain, you have things that are

failing. Machines are failing, equipment is failing. And

the question becomes in that same supply chain when

I'm doing my forecast, can I also understand failure

events, right? Predictive maintenance and whatever it

happens to be on those same machines.

Michael Segala: We're seeing these companies starting to collect and

analyze all this information to wanting to predict when

their machines are going to fail. How often do we need

Page 28: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

to take them offline? How will that affect their

shipping? How will that affect their logistics? And kind

of solving two problems at once, a better forecasts,

plus being able to augment and not necessarily need

to fix machines before they break, kind of fix them

beforehand. It's kind of two things boiled into one. But

you could potentially do it all together. It's kind of the

second use case we've been seeing a lot of recently.

Kirill Eremenko: Gotcha. On that, with logistics, I see there's lots of

components where data science, or data science can

be applied to solve challenges. But some of the two

challenges that I'm quite familiar with are bottlenecks

in logistics, like where is the bottleneck? Is it at the

factory? Is it at the pickup location? Is it through the

route? Is it at the end? And the other one would be

optimizing routes. Things like the traveling salesman

problem. How do you get to as many destinations?

Like if you delivering the milk to different stores, how

do you get two of them in the most efficient way

possible? Could you give us some examples of, what

kind of algorithms would you use in maybe these two

problems, bottlenecks and optimizing routes or maybe

other problems and challenges in logistics that you

guys have worked with before.

Michael Segala: Yeah. For instance to kind of answer your first

question first, lots of different algorithms could be

used here. Folks are starting to experiment and got a

lot of success with these reinforcement algorithms,

whatever you want to call them, deep reinforcement or

regular reinforcement learning algorithms to do these

kind of difficult optimization problems, right? Because

Page 29: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

that's what it is, it's an optimization. If you kind of

take a step back, a lot of these traditional mark off

models or Monte Carlo simulation, very similar. You

have a very complex dynamical system. How do you

optimize across this entire system. It's not necessarily

the same as just a single kind of prediction variable,

but now you're doing it in a very complex manner.

Michael Segala: We're seeing a lot of interest in movement, especially

in some of our use cases in that manner of kind of

using some of these [inaudible 00:44:17] bleeding edge

methodologies. Others, if you can still turn into a

traditional machine learning algorithm, if you can

predict something, either binary or categorical or

forecasting, you can use whatever traditional ML

algorithm you want, or a deep learning algorithm. This

is a problem that lends itself to lots of different

opportunities, optimizations is a different class of

problems, you can even use like genetic algorithms or

things of that nature. Lots of interesting stuff there.

Michael Segala: And then your second question was more specifically-

Kirill Eremenko: Sorry on that one. Do you guys have any approaches

like genetic algorithms or enforcement algorithms,

deep reinforcement learning or machine learning, do

you have any of those that have shown to be the most

useful, the easiest for you guys to deploy or the

quickest win for your clients? Any comments on that?

Michael Segala: Oh yeah. It always depends, it depends on the problem

statement, it depends on the complexity of the

problem. Quickest wins are always going to be the

easiest algorithms, if you can map anything into a very

Page 30: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

simple machine learning algorithm with a prediction

variable, you're going forward with some features.

Yeah, I mean you could do that pretty easily. If you

need to dynamically optimize a really complex system

and you need to go to like a deep reinforcement to

algorithm. Yeah. I mean that's going to take a lot more

time in the amount of lift you might get there. It might

be extremely incremental.

Michael Segala: Again, right, there's a trade off in all of these things. I

always advocate very heavily for determining a

baseline as fast as possible, literally whatever the

fastest path to getting a number out to set a baseline,

do it, then start experimenting and making it more

complex.

Michael Segala: Whatever that means for this problem, start with an

ML model, then go to a deep learning model, then go to

a genetic algorithm, whatever it happens to be for your

problem. Always kind of think of it in a very

incremental fashion in complexity. That's at least my

opinion.

Kirill Eremenko: Love it.

Michael Segala: I think the best way to approach the problem.

Kirill Eremenko: Gotcha. Love it. Love the establish a best baseline as

fast as possible. I think that's golden advice for data

scientists out there.

Michael Segala: And then what was your second question?

Kirill Eremenko: The different types of problems I think, bottlenecks,

optimizing routes, maybe if you had some more to add

to those.

Page 31: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: Yeah. For instance, one of the problem statements that

we're currently working on, that's a lot of interest. Is in

clinical trials, clinical trials is actually a very

complicated problem because you're a big Pharma

company and you need to run a trial for your

medication against a pretty diverse and large

population of people. Think of something simple like

Tylenol. You're not running clinical trials on Tylenol,

but if you were, you'd have a bunch of Tylenol, you

find a bunch of people at a bunch of medical sites and

you'd ship them some Tylenol or they would take it

and you would monitor how they interact with the

drug and things of that nature. That's how clinical

trials work.

Kirill Eremenko: What is Tylenol, sorry to familiar with the US term.

Michael Segala: Oh geez. Just helps your headache. It's a, what is it, a

acetaminophen whatever that word is.

Kirill Eremenko: So it's for headaches?

Michael Segala: It's for headaches yeah. It's been around for a hundred

years. But that was just kind of a silly example.

Kirill Eremenko: In Australia we have Paranol.

Michael Segala: Sure. I don't know what that is but sure. For clinical

trials you have this problem of needing to send out

things to the medical facilities for them to be able to

run their clinical trials, collect data, collect vital

information, collect even blood or other types of,

whatever you're collecting directly from the patient and

do so in a very complex manner. Because you have

patients in different countries, different ages, races,

Page 32: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

genetic profiles, whatever it happens to be. You're

sending, your shipping, you're receiving these things

that are all highly perishable. It has to happen in a

very kind of dynamic environment.

Michael Segala: For them, this logistic problem with bottlenecks in

things that are highly perishable that are shipping all

over the globe is a huge problem. And it's this very

dynamic system that we're applying these exact type of

algorithms that we've been talking about. Clinical

trials is a good one, but everything's the same. If

you're shipping a pair of jeans, a bottle of coke or a lab

kit for a clinical trial, the methodology is very similar

in the way to attack the problem is very similar. It's

just kind of that end use case. What you call it is a

little bit different.

Kirill Eremenko: Okay. Gotcha. All right, well thank you very much.

Logistics, amazing case study. Do you have one more

for us? I think we have time for one more.

Michael Segala: Jeez, how about this. What do you care about? I'll give

you a use case on something.

Kirill Eremenko: Ooh, good one. Let's do energy, the energy space. Do

you have any of those?

Michael Segala: Of course. I have some on energy. Let's see. I have a

few on energy. What could be interesting? I'll give you

two quick ones in energy.

Kirill Eremenko: Sounds good.

Michael Segala: One quick one is a lot of people, if you have a meter, I

don't know how you guys do it, but I assume you have

an energy meter sitting outside of your house, and

Page 33: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

that energy meter is basically collecting information on

how much energy you used. It turns out that two types

of people tend to want to screw with that energy meter.

One a is people from, sometimes not as affluent

communities who don't necessarily want to pay the bill

or affluent communities who don't want to pay the bill.

It could be here. Or drug dealers who are using an

absurd amount of energy and that at that peak some

kind of alarm, but they need to hide that.

Michael Segala: What you can do is you actually take a magnet to the

outside. I've never done it, but I've been told, you can

take a magnet to the outside and you can actually

trick the smart meter or whatever it is from the

reading. And it shows much less consumption that

what you're actually having. We did some work with a

large energy company out in the UK that was running

into a lot of these problems, people were literally

putting magnets on their meters outside, fooling the

system and they were people fooling it because they

didn't want to pay the bill or drug dealers. And the

question was, can you take all of that time series data,

'cause it's very temporal time series data and look for

patterns that would be anomalous that they think

corresponds to somebody kind of adding these

fraudulent activities.

Michael Segala: We were given a pretty large set of data, but a very,

very small set of labeled data. Literally only like a few

10 or 20 labeled cases of these anomalies. We attacked

the problem a couple of different ways, both in a

supervised and unsupervised manner. We did a lot of

different things, could be really thoughtful about it

Page 34: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

and we were able actually to show, you can spot these

anomalies and you can really see when people are

gaming the system from an energy perspective. That's

kind of a one quick use case-

Kirill Eremenko: Just quickly on that, that's very interesting because

you had only 10 to 20 examples like in a spreadsheet

or like in a database with millions of rows of negatives

results. You had 10 to 20 positives. How'd you deal

with situations like that? What is your advice to data

scientists out there? How do you attack a problem

where you only have under 20 examples of what is a

positive outcome that you are actually trying to

identify?

Michael Segala: This happens a lot of times. We actually fool ourselves

that big data is the challenge. The actual challenge is

small data, big data's not a challenge, It's "Ah, okay,

we have big data." We get into a lot of these cases

where you have very small labeled positive examples.

You have to be very thoughtful about it, you could

theoretically create fake data sets to encapsulate very

similar behavior right through that kind of same

simulation in modeling. You could do that or you can

start attacking the problem because you could treat

this as anomaly detection, pretending you didn't know

any labeled data. Can you actually spot anomalies?

And you have 10 or 20 in your back pocket to think

about, or you turn it into a supervised learning

problem with a very, very small holdout set. And find

and experiment with it.

Michael Segala: There's lots of different scenarios, but again it's really

about being a problem solver and thinking about, can

Page 35: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

you do something that's convincing enough to yourself

from a technology standpoint that it's working and can

you make a business case that it should be

implemented? There's lots of different ways to solve a

problem, but you have to do it in a kind of systematic

way and be thoughtful about it.

Kirill Eremenko: That's so cool. I love your three examples, just to recap

on those, create fake data sets, anomaly detection so

pretend you don't even have those 20 and see what the

algorithm will do, completely unsupervised or

supervised learning with a very small holdout dataset.

I'll probably just add to that, that it's also important to

talk with the clients. And correct me if I'm wrong here,

but I think it's important to talk to a client and

understand how important for them are false positives

and false negatives.

Kirill Eremenko: In your case, in this case of an energy company, is it

really bad if you have a high rate of false positives,

would they prefer high rate of false positives or high

rate of false negatives? If you identify more cases

where people are allegedly trying to trick the meter,

how difficult is it for them to ask the electrician next

time they go out outside to check if there's a magnet

on the box or not?

Kirill Eremenko: Based on that conversation with a client, you can fine

tune your algorithm to either output more bindings in

terms of like these anomalies or less. And in some

cases it might be, I don't think in the case of energy it

would be as bad as saying the case of medicine where

a false positive can actually change somebody's life.

Page 36: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Michael Segala: Absolutely. You're absolutely right. And what you said,

it's the real critical part in what the real kind of

mindset needs to be is how do you tweak this

algorithm to actually fit what we care about capturing

and what does that cost. Because the question is really

what would that cost for the electrician to go back and

report it? And then how would they report that? And

where's that data stored? You get into this kind of

cascading effects of what your algorithm actually

mandates to the business to actually have to

implement. It's not a trivial problem. That's actually

where the real ingenuity and kind of problem solving

comes in and kind of tweaking that outcome to

actually be effective. You're 100% on there.

Kirill Eremenko: Gotcha. Okay. That was the first one on energy. And

second one?

Michael Segala: Second one. Let's see. We have a few. The other one we

were doing, kind of similar, a little bit different. This is

energy as related to internal devices in the home. And

the question for them is, if I had all of the kind of time

series data of the meter coming in, can I understand

which appliance that data is coming from? There's this

concept of energy disaggregation. Meaning, if I only

gave one overall signal, can I see what came from the

refrigerator or the microwave or the TV or whatever

else it happens to be.

Michael Segala: Again, it's a very interesting class of algorithms where

you can kind of look at consumption patterns and

then kind of detangle them in terms of understanding

exactly where your consumption comes from and why

you'd want to do that is because you would be able to

Page 37: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

show, "Hey, your appliances over here are causing

80% of your bill. Get something more efficient or

unplug it or do something of that nature." It's this

really kind of personalization that is happening,

especially within this big energy companies that want

to kind of get consumer buy in and kind of always

have consumers coming back and never leaving them

is to showing them these kind of innovative solutions

towards some of their energy bills and outputs and

things of that nature, especially as we become a

greener and greener society.

Michael Segala: This was a very interesting one and actually showed

the extremely promising results as well that, that

company is using.

Kirill Eremenko: How do you go about a problem like that? How did you

desegregate components of a signal?

Michael Segala: This was actually a while ago, if I remember correctly,

and I can be completely wrong here, I think we had a

pretty small training set as well of, they had a couple

of houses or dozen houses or a hundred houses, I

don't remember at this point, that actually had smart

meters plugged into all of the devices. You were able to

see a real training set of, here's the total consumption

and here's what all the devices were, which was fine,

you can show that. Now the question becomes on a

very new house, does that algorithm actually transfer

over and is it generalized?

Michael Segala: That's really the big question and I think we used two

different approaches. The first being a lot of these

mark off models, hidden mark off models I believe had

Page 38: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

worked really well for this case. This was maybe about

two years ago when deep learning was still kind of in

its infancy, not really infancy, but really being used,

especially for time series. I think we started playing

around with some deep learning at a time series space

there as well. And that was showing some really nice

progress, but we were able to achieve what they

wanted to in those kinds of mark off models and they

kind of took that and ran with it. If I remember

correctly, that's how we attacked that problem back

then.

Kirill Eremenko: Okay. Well, Mike thank you so much for showing

those case studies. Amazing medical imaging logistics,

energy case studies, if our listeners want to ... If you

guys want to check out more case studies as I

mentioned at the start, head on over to

sflscientific.com and they have a tab there called

solutions or the other one is Our work and you can

read quite a bit about different use cases in different

industries.

Kirill Eremenko: Before we finish up, 'cause we are slowly getting to the

end of this super exciting podcast which could

probably go on for a few more hours. But before we

finish up, I wanted to ask you on a question that I, a

more philosophical question I like to ask guests

sometimes and that is, from where you stand and from

all these projects and clients and industries and

approaches and employees, you've seen people, you've

seen the data science. Where do you think the field of

data science is going and what do our listeners need to

Page 39: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

look out for to prepare for the future that's coming

ahead?

Michael Segala: That's a tough question. Are you asking that as

somebody who wants to get into the data science

space, as a data scientist? Or are you asking that in

terms of where do I think industries are going?

Kirill Eremenko: Ooh, that's a good question, how about we do both.

Michael Segala: 'Cause those are very different conversations.

Kirill Eremenko: How about we do both, what's your view on both of

those?

Michael Segala: All right. Both of them will go quick because I don't, we

could talk for a long time. Sometimes I get too

talkative. Let's start easy, data scientists. And we see

this a lot, I'll have an open REQ and by the way, we

have lots of open REQs if somebody wants a job, come

and talk. But we see more and more people wanting to

become data scientists transitioning into this space.

There's a lot of great potential, money being invested

and people honing their skills with courses like the

ones you teach, people going to conferences like the

ones that you guys give, a lot of great mind share,

knowledge share and things of this nature, which are

so much easier than when I started about six years

ago. I think that's gonna continue to happen.

Michael Segala: However, I think algorithms themselves [inaudible

01:00:22] come and already are kind of becoming very

commodity. Everybody nowadays can fire up XGBoost

and run something, that doesn't make you a good data

scientist that makes you extremely commodity in your

Page 40: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

job. I think data science is going to start to become a

wider role that is going to be, as we're talking about

here, it's really a problem solver. How do you take a

business problem and solve it with data? That's really

the big question here. And unless you're capable of

thinking about the larger problem and the impact that

it has on the business and how you're actually going to

take that algorithm and actually allow your business

to generate revenue or cut costs, you're probably not

going to be a very successful data scientist, especially

as these tools become more and more efficient and will

start to automate some of your job away.

Michael Segala: I really think the trend in our industry will also be to

automate out some of our own data scientists who are

doing just kind of very routine type of work. But the

ones that survive and do a great job I think are going

to be probably one of the most critical folks within the

company by far. That's really how I see that transition

happening. And I actually don't think that, that's far

away, I think within the next 12 to 24 months. Maybe

the next time we talk on one of these, we'll start to see

that already.

Michael Segala: In terms of, and let me know if that didn't answer the

question-

Kirill Eremenko: That totally makes sense. I just want to add here that

from my experience 'cause I ... For listeners who don't

know, I worked at Deloitte for two years in the data

science division and what I can definitely say and

probably you've gathered from this conversation we're

having here with Mike that being in consulting really

helps with that, becoming a problem solver,

Page 41: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

understanding how to not just like do a cool project or

a cool algorithm but think of the business as a whole.

Kirill Eremenko: If you are looking for a job, I just want to reiterate

Mike's call, give Mike a shout out and or contact Mike

on Linkedin or somewhere else and chat to him

because in a few years in consulting really puts you in

a whole of game of data science into a different

perspective. Not to say that you can't get there on your

own without consulting, if you're in an industry that's

totally fine as well. Just from experience, I know that

consulting is a great way to get to that type of mindset.

Michael Segala: Yeah. I tell my new employees, within 12 months

they'll probably have more project depth and skills

than somebody who sits in a single kind of vertical for

10 years. Just the breadth of project and the depth

that we get to get into extremely quick. It's exciting.

But it's hard, it's not stagnant and you're always kind

of thinking and moving on your feet. It's not for

everybody, but I love it.

Michael Segala: In terms of businesses, I think we're really at this

critical junction in terms of where data science will go.

We see industry starting to invest for sure. They invest

kind of small pockets of money on a few small

initiatives, the big companies that make the media

hype, the Apples, the Googles, the Airbnbs, those

aren't even relevant, those are the outliers, the

anomalies.

Michael Segala: I'm talking about the other 99% of the market. And we

know, we work with so many of them, they see that

there's a lot of interest out there. There's a lot of

Page 42: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

innovation happening and there's a lot of hype and

potential. They're starting to make strategic bets into

this space by funding a couple POCs, proof of

concepts, hiring a few individuals or a larger team

depending on the organization. But we're really at that

critical point where now in the beginning of 2019 over

the next kind of nine months, a lot of folks have

budgeted Data Science into this 2019 workflow that

need to start paying off. They need to see real revenue

generated or margins decrease by better automation

and cutting costs and things of that nature or margins

increase.

Michael Segala: I think if we don't start delivering past POCs and really

start embedding algorithms into deeper kind of

production workflows, it's actually going to take a big

hit and a big step back and people will start defunding

AI into their 2020 and 2021 plans. And I honestly

think that, there's a lot of folks that, very [inaudible

01:04:50].

Michael Segala: Here's a fun app on Instagram and I just want to go

and repeat that and kind of play off of them and you're

always going to see that in the market, but that's

quickly going to become cannibalized in this AI space.

When you have all these big IBM commercials and

Microsoft commercials that are really hyping AI and

people are investing, they need to see something very

quickly pay off or we're just not going to continue to

get funding and this market will start to slow for sure.

Michael Segala: It's up to you, the listeners, you have so many great

listeners on this podcast that are the ones in the

trenches. And I say that wholeheartedly, like I think

Page 43: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

that your audience is by and large some of the greatest

audience, especially in the data science space that I've

interacted with and I still get, literally every week

people [inaudible 01:05:35] about your podcast and

what you're doing. And they come to me and say, "Oh,

I heard you on Kirill's podcast. That's great."

Michael Segala: You definitely are driving the correct audience. It's

kind of all of our responsibility because as data

scientists to ensure that these projects are successful

and we don't just kind of cannibalize ourselves in the

next year or two and not get any bigger funding.

'Cause then we're all going to be out of a job. That's

honestly how I think the market is going to mature.

Kirill Eremenko: Fantastic. That ties back into that productization

discussion that we had. For data scientists out there

don't just leave your project, it feels very satisfying to

find the insights and deliver them, talk to your

manager, boss, client, whoever it is you're talking to

and consult them, advise them on next steps on how

they can actually put that into production, follow up

with them, go back in a few weeks and check if your

model is performing, if it's deteriorating of it needs

some maintenance. Be proactive in that [inaudible

01:06:32], it's kind of like marriage. If you get married,

you don't just stop there. You have to keep dating,

your wife I mean. Or husband. You have to keep

caring out for each other. It's not like you won the

game once you got married. There's lots more. And

now it's the aftermath and the commitment that comes

afterwards.

Michael Segala: I see you've been well trained as a husband.

Page 44: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

Kirill Eremenko: Not a husband yet my friends.

Michael Segala: Soon to be [crosstalk 01:07:01]

Kirill Eremenko: Yeah one day.

Michael Segala: One day, good for you.

Kirill Eremenko: Mike, wanted to ask you, how many clients have you

guys worked with if it's not a secret, just curious.

Michael Segala: Oh geez. I don't know the number, but it's in the

hundreds.

Kirill Eremenko: Wow. You guys-

Michael Segala: I don't know the number ...

Kirill Eremenko: Are doing really well. All right, well Mike, thanks so

much for coming on the show, been a huge pleasure.

Before I let you go, what are some of the best ways for

our listeners to contact you, whether they are

interested in working with you or whether they're

interested in maybe joining your team.

Michael Segala: Please come to the website, sflscientific.com. There is a

place there. I think you could either chat with us or

you could inbound an email. That all comes directly to

my folks who tell me right away if you're looking for a

job, that HR, I think we have like an HR jobs page,

that gets looked at. I tell you, we interview almost a

person a day at this point. A lot of them are great

candidates, but for whatever reason don't work out.

We're always looking for really great folks. if you

inbound to us, I guarantee one of our folks will see it

in a few minutes and reply back accordingly. So please

Page 45: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

be in touch, that's probably the best way to get in

touch is just through the website.

Kirill Eremenko: Okay, great. And is it okay for people to connect with

your LinkedIn as well?

Michael Segala: Of course. It's my pleasure.

Kirill Eremenko: Awesome. Thanks so much. Of course we'll share all of

those links on the show notes. And on that note, Mike,

thanks so much for joining us today and sharing

amazing case studies and your view on the world of

data set.

Michael Segala: Again, thank you so much. It's always an honor and

pleasure to see your progression as well. Best of luck

with you and hope we can talk again soon.

Kirill Eremenko: There you have it. That was Mike Segala from SFL

Scientific. I hope you enjoyed today's episode and got a

lot of valuable takeaways from the show. If you'd like

to connect with Mike, hit him up on LinkedIn, you can

find the URL as well as all the other materials

mentioned on episode in the show notes at

www.superdatascience.com/249 that's

superdatascience.com/249.

Kirill Eremenko: There you can also find the transcript for this episode

if you'd like to read it. And my personal favorite part

for today was the challenge of small data, dealing with

unbalanced datasets. And the three approaches that

Mike shared with us ranging from creating fake data

sets to unsupervised anomaly detection, to supervised

learning with a small holdout dataset. Some very

exciting stuff. And of course apart from just the

Page 46: SDS PODCAST EPISODE 249: DIVING INTO DATA SCIENCE … · Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and

challenges of small data, there are plenty of other

valuable gems shared by Mike.

Kirill Eremenko: And I'd like to reiterate again the call to action from

Mike and the team at SFL. If you're looking for a job

and you'd like to join consulting, then go, head on over

to SFL Scientific and look for the careers page and

apply there. If you're a business owner, an executive

director and you would you have some challenges that

you think can be solved with machine learning, you'd

like to explore the space of AI and Data Science, then

hit up Mike, don't hold back and see how SFL

Scientific can help your business grow and become

even more competitive.

Kirill Eremenko: And on that note, if you're enjoying the

SuperDataScience show, make sure to head on over to

iTunes or to your favorite app for playing podcast and

leave us a review there. I'll really appreciate it. I love

reading your reviews. Thank so much and I look

forward seeing you back here next time. Until then,

happy analyzing.