integrate machine learning models with python and

INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY

Thank you for participating in a workshop at

MicroStrategy World 2019. If you missed or did

not finish an exercise and want to complete it

after the conference, use this workbook and

access the supporting files at

microstrategy.com/world-2019-workshops

*Workshop files will expire 4/30/2019

4 | MicroStrategy

Integrating Python Machine Learning Models and MicroStrategy

Machine Learning Machine Learning

3 Walk-through: Modeling © 2018 MicroStrategy, Inc.

1

MicroStrategy offers features that enable analysts and data scientists to use machine learning to extract meaningful insights from their data for a variety of use cases and problems, including out-of-the-box functions and basic algorithms. However, data scientists often rely on a wide range of tools, especially open-source coding languages like R and Python. To support those tools, data scientists can now code the language of their choice and continue to use MicroStrategy via open-source packages.

The purpose of this session is to show you how MicroStrategy and Python can work together to produce machine learning results within the context of business intelligence. In this workshop, you will:

• Learn what actually happens when building a machine learning model and explore a framework for thinking about the model building life cycle.

• Train a deep learning network to predict flight delays in Python.

Learn why a BI system is a core piece of the technology stack that enables data science teams to be successful. Machine Learning 101

Broad definition Machine learning (ML) can be loosely defined as statistical and mathematical techniques that allow computer systems to learn from data.

Integrate Machine Learning Models with Python and MicroStrategy

MicroStrategy | 5

MicroStrategy World 2019 Workshop BookMachine Learning Machine Learning

© 2018 MicroStrategy, Inc. Walk-through: Modeling 4

1

Machine learning implies that the performance of a specific task is progressively improved. To achieve this, different algorithms can be exposed to historical data to create a trained model, and then tested on unseen data to evaluate how well the model performs.

Common examples Examples of ML are more common than you think. Some you may be aware of:

• Selecting the next song in a playlist on a streaming service.

• Granting or denying a loan when you apply.

• Curating the news feed on a social media site.

• Product placement ads in your browser.

Exercise 1: Training with iris data To get our feet wet with machine learning, let’s look at an example with a dataset often used to introduce data science techniques: the iris dataset. This data, shown below, contains the dimensions of the sepals and petals of a flower, and the species these sets of measurements belong to within the iris genus.

The dimensions of an iris can be used to learn how to classify it into the species it belongs to. Let’s use an interactive site to explore and observe the data points.

6 | MicroStrategy

Integrating Python Machine Learning Models and MicroStrategy Machine Learning Machine Learning


1

1 Open a browser and navigate to:

https://plot.ly/~AmenRadix/128.embed

The page displays a graph similar to the image below. You can intuitively identify three clusters.

Exercise 1: Training with iris data

Without having to create any algorithms, your brain trained itself to view the clusters. Our brains are good at abstracting up to three dimensions when we have a visual representation of the model. But what happens when you have five dimensions—or ten? It becomes much harder. Now imagine a weather system where thousands of factors are taken into account.

This is why it is important to have a framework to manage, train, and evaluate models. Among those frameworks, many data scientists will operate using the CRISP-DM, discussed below.

MicroStrategy | 7



1

CRISP-DM framework Companies around the world use machine learning to create insights into their businesses. But how does one create a machine learning system?

Many data science teams use a process framework, CRISP-DM, to guide their work. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It lays out the fundamental steps common to nearly every machine learning project. Here’s a diagram that visualizes the framework:

The process is a cycle, and starts with business understanding.

Business understanding Business understanding is about trying to identify both the core drivers of and problems with your business. In this part of the process, it is vital to spend time examining the business problems that might be good candidates to approach with machine learning.

8 | MicroStrategy



1

CRISP-DM framework

You may find it helpful to imagine what it would be like to solve the business problem you’re working on with machine learning. What would it mean to use algorithms to help find a solution? How would they be adapted internally?

Data understanding The next step is data understanding. Once we’ve identified our problem, we need to take inventory of the data that might be useful for analysis. We need to seek out high quality, reliable, and reproducible sources of data. We need to also spend a lot of time understanding what the data contains, and more importantly, what it doesn’t contain.

Sometimes at this stage, you need to go back to the business understanding step and re-examine the problem in light of the available data. You’ll see that this back and forth behavior is common in the CRISP-DM framework and in data science projects in the real world.

Data preparation Many data scientists use a rule of thumb that you should expect to spend about 80% of a data science project on data preparation. This includes data clean-up, creating new variables, writing code to extract data from databases, and reorganizing the data in a way that machine learning algorithms need it to be structured.

This is a critical step of the project because you’re building the “plumbing” that every subsequent step in the machine learning process relies on. It is not uncommon to get to the modeling step of a project only to realize that something critical was missed during data preparation. For this reason, many data scientists build automated pipelines to manage data preparation.

Modeling In the modeling phase, we choose from among hundreds of algorithms that might work for our problem. For example, in a time-series forecasting problem, you might want to use a moving-average based model or something that takes seasonality and time dependence into account, such as ARIMA.

MicroStrategy | 9



1

It’s common to decide on a basket of modeling approaches, rather than relying on just one, and quantitatively evaluating which model is best.

Evaluation In the evaluation stage, you assess how each of the algorithms performed using an objective scoring approach. You may have heard of r-squared or mean squared error for example. These are all evaluation metrics used to help data scientists understand whether the algorithm has been successful in generalizing the problem.

Another important part of this step is checking the quality of the model in business terms. This means that we revisit our business problem and ask ourselves if the model will be helpful in our business context.

Deployment Finally, we have to deploy our model! The real return on investment from machine learning comes from organizations successfully deploying their models into production and integrating them in the decision-making fabric of their organization.

Now that we are familiar with the CRISP-DM framework, let’s put it in action for a specific problem: predicting flight delays.

10 | MicroStrategy



1

Walkthrough: Business understanding

Let’s assume we are members of the analytics team at a major US airport. We have data on every inbound and outbound flight for an entire year—over 5 million flights. We also have the outcome of each flight: whether it was canceled, delayed, or left on-time.

Our primary goal is to train a model that predicts the probability that each flight will be canceled, delayed, or leave on time. Our secondary goal is to use those predictions to display this information to passengers, so they are proactively informed about their flight’s status.

For this workshop, we will be using four datasets. We will also pull external region data into MicroStrategy to create a new Intelligent Cube.

MicroStrategy | 11



1

• Flights: This dataset contains data on over 5 million individual flights from 2015. The data contains dates and times for each flight, flight destination, flight departure airport, the airline, and other core data.

• Airlines dataset: This dataset contains 14 airline names and their corresponding airline code.

• Airport: This dataset contains data on 322 airports, along with their state, city, latitude, and longitude.

• US States: This dataset contains information about each of the 50 US states and territories.

At this step of the process, it’s good practice to start making a list of potential features—new variables—to add that are not available in the raw data. Later in the data prep process, we’ll add that data to the raw data using Python functions.

A good initial hypothesis is that weather—especially winter weather—has a lot to do with flight cancellations. Another hypothesis is that there are groups of states in regions like the northeast that experience intense winter weather, and therefore might have a higher chance of experiencing flight delays. We will use region codes to help our algorithm learn that some states share similar weather patterns.

Walk-through: Data preparation We’ll spend a fair amount of time in the data preparation stage. There are a few sub-steps in the data prep phase.

• Getting data

• Data preparation

• Data splitting

Get data The first step in our process is to physically get data to train our model with. Let’s discuss how this is done when MicroStrategy is the data source for training the model.

12 | MicroStrategy



1

MicroStrategy solutions to these challenges

We’ll need a way to connect our machine learning server with our BI system and extract the data. Using the MicroStrategy REST API, we’ll extract the data in our MicroStrategy cubes and make a copy of the data locally to train our model.

We’ll use Python and a few popular machine learning packages throughout this workshop.

We’ll use scikit-learn and its helper functions to help structure our data and evaluate the accuracy of our final models. We will also use Keras, which is a neural network library that interfaces with Google’s TensorFlow library, to create and train our model.

Exercise 1: Get your own ML environment To start, we need to connect to a MicroStrategy on AWS environment that’s been pre-configured with the ML tools and data needed to complete the exercises below.

Access the provisioning console

1 Navigate to the MicroStrategy on AWS provisioning console at: https://provision.customer.cloud.microstrategy.com/

2 On the provisioning console login page, enter the credentials provided below:

MicroStrategy | 13



1

• Username: [email protected]

• Password: workshopmstr!

3 Find the environment with the number that your instructor provided you with at the beginning of the workshop.

4 In the Actions section, select the ellipses icon and click Edit Contact.

5 In the Edit Contact Information window, replace the information in the

boxes with your first name, last name, and email address. Then click Apply. You will receive an email with your environment credentials.

14 | MicroStrategy



1

Make sure to use an email address that you can access immediately, as you will be sent an email with your environment credentials.

6 From the MicroStrategy on AWS email, select Access MicroStrategy Platform . Login with your MicroStrategy Badge or enter your credentials.

7 On the landing page, scroll down and hover your cursor over

Remote Desktop Gateway, then click the Launch icon that is displayed.

8 In the Remote Desktop Connection window, in the Username and

Password boxes, type the user name and password listed in your Welcome to MicroStrategy on AWS email.

9 Click Login.

10 On the web page, under All Connections, click Developer Instance RDP.

MicroStrategy | 15



1

Your remote desktop session opens. Complete the rest of this workshop in this environment.

Exercise 2: Review datasets and start analysis

Review the datasets in MicroStrategy

1 Log into MicroStrategy Web using the credentials you received in your MicroStrategy on AWS email.

https://env-XXXXX.trial.cloud.microstrategy.com/MicroStrategy/servlet/mstrWeb

The XXXXX above represents the environment number you received in your Welcome to MicroStrategy on AWS email.

2 Click MicroStrategy Tutorial.

3 Select Go to MicroStrategy Web in the right corner of the screen.

4 Follow these steps to create three MicroStrategy cubes from .csv files and

retrieve their IDs. These steps must be accomplished for the three datasets (Airlines, Airports, and States) we want to import into MicroStrategy. We will use the Airlines as an example.

a Click Create , then Add External Data.

16 | MicroStrategy



1

b Click File from Disk.

c Click Choose Files.

d Navigate to C:\Users\mstr\Desktop\Demo\Data\Raw, and click airlines.csv. Click Open.

e For each of the MicroStrategy cubes (Airlines, Airports, and States), do

the following:

a Save in Shared Reports under the name airlines, airports or states. Pay attention to the name, as casing is important. Save the dataset name in lowercase for all three of them. A

b Review the data at the bottom of the screen.

c Close the window.

d Right-click the cube and select Properties.

e Copy the ID and click OK.

f Search for Notepad++ in the search bar. Open a Notepad++ document, paste the ID and write the name of the cube beside it. You’ll use this information later.

Your Notepad++ document should resemble the following:

MicroStrategy | 17



1

We will conduct our analysis via a Windows server in the environment we previously configured, using a Jupyter Notebook to render the Python scripts in a web browser through a distribution called Anaconda. Jupyter Notebooks are interactive, showing code output in real time and making troubleshooting easier.

1 Click the Start menu and start typing the word Anaconda.

2 When you see Anaconda Prompt, right click it and select Run as administrator.

3 Click Yes to accept the message that opens.

4 To ensure that we’re working in the correct directory, type cd C:\Users\mstr\ , and hit Enter.

5 Type jupyter notebook, then click Enter.

18 | MicroStrategy



1

If this is your first time using a Jupyter Notebook, here is a short introduction:

• Each cell contains a snippet of Python code to run.

• To run a cell, click to select it and then either press Shift+Enter on the keyboard or click Run in the toolbar.

• You can use the + in the toolbar to add a cell to the notebook for your own code.

• You can use the Up and Down arrows to move a cell in the notebook.

• When a cell completes its processing, a number appears in the brackets [ ] on the left. When it’s processing, you will see a star (*). You can also look at the circle located at the top right of the page next to the name Python 3, as shown below. An empty circle means it’s ready, while a full circle means it’s busy processing.

MicroStrategy | 19



1

• If there is output from the code in a cell, it appears below the cell.

If you want to clear the output of a cell, use the Cell menu under Current output or All output.

• Use the Kernel menu to reset the notebook to its initial state.

6 Click Desktop, then click Demo, then Code, then Notebooks.

This folder contains the Python scripts we’ll use in our analysis.

7 Click 01_prep_raw_cubes.ipynb.

The notebook will open.

20 | MicroStrategy



1

Import the cubes from MicroStrategy

Let’s examine the code as we execute the pieces of the script.

1 Locate the cell containing:

import warnings

warnings.simplefilter('ignore')

import pandas as pd import os import sys

import time

MicroStrategy | 21



1

from mstrio import microstrategy

home_dir='C:/Users/mstr/Desktop/Demo'

The first two lines disable warnings so we won’t be distracted during this workshop.

The next few lines initialize some of the libraries needed to create the connection between the Intelligence Server and our server.

The next line loads the MicroStrategy mstrio library, which uses MicroStrategy’s REST API to connect Python and the Intelligence Server.

The last line defines our home directory, the location we will run our files from.

2 Execute the cell by pressing Shift+Enter, or click Run.

3 Locate the next cell containing:

# API / server

params username

= "mstr"

password =

"password"

base_url = "https:// env-XXXXX.customer.cloud.microstrategy.com/ MicroStrategyLibrary/api"

project_name = "MicroStrategy Tutorial" # Tutorial Project

Best Practice

This cell contains a series of variables used to connect to your Intelligence Server, including your user credentials. It is a good practice to define variables this way, as changing them will allow you to adapt the code to another environment quickly.

22 | MicroStrategy



1

4 Replace the password value with the password you received in your MicroStrategy on AWS email.

5 Replace XXXXX with your environment number.

6 Execute the cell.


#conn = microstrategy.Connection(base_url=base_url, username=username, password=password, project_name= project_name)

conn = microstrategy.Connection(base_url, username, password, project_name) conn.connect()

This sends a request to the MicroStrategy Intelligence Server establishing a REST API connection between our machine learning server and the Intelligence Server. The mstrio library takes care of managing the authentication token and cookies needed to access the REST API server.

8 Execute the cell.


# Cubes to download cube_names = ['airlines',

'airports', 'states'] cube_ids =

['C983680C11E8D236B87F0080EF35FE86',

'EADC795811E8D236B83F0080EF15BE86',

'5946FA1211E8D237BB800080EFB5FF89']

This cell makes a list of the cube IDs that we need to download from the MicroStrategy environment to our machine learning server. The objects’ values must be in the same order as their titles. For example, the first value in cube_names is “airlines,” the first value of cube_ids must be the ID of the airlines cube, and so on.

10 Execute the cell.


MicroStrategy | 23



1

#Persist cubes on disk

for cube_id, cube_name in zip(cube_ids, cube_names):

print("Fetching the " + cube_name + " cube from the Intelligence Server..." + "\n") cube = conn.get_cube(cube_id=cube_id)

cube.drop(labels=list(cube.filter(like='Row

Count')), inplace=True, axis=1) cube.columns =

cube.columns.str.lower() cube.columns =

cube.columns.str.replace(' ','_')

print("Preview of the data:")

print(cube.head()) print("\n")

time.sleep(3)

#Adjust for missing data

if cube_name=='airports':

cube.ix[cube.iata_code == 'ECP', ['latitude', 'longitude']] = "30.3416666667", "-85.7972222222"

cube.ix[cube.iata_code == 'UST', ['latitude', 'longitude']] = "29.95861111", "-81.33888888"

cube.ix[cube.iata_code == 'PBG', ['latitude', 'longitude']] = "44.65083", "-73.46806"

print("Saving " + cube_name + " data locally..." + "\n")

with pd.HDFStore(os.path.join(home_dir,'Data\\ clean.h5')) as hdf: hdf.append(key=cube_name, value=cube)

# Close MicroStrategy connection conn.close()

24 | MicroStrategy



1

This cell is long, so let’s split our explanation. Focusing on the following lines:

#Persist cubes on disk

for cube_id, cube_name in zip(cube_ids, cube_names):

print("Fetching the " + cube_name + " cube from the Intelligence Server..." + "\n") cube = conn.get_cube(cube_id=cube_id)

cube.drop(labels=list(cube.filter(like='Row

Count')), inplace=True, axis=1) cube.columns =

cube.columns.str.lower() cube.columns =

cube.columns.str.replace(' ','_')

print("Preview of the data:")

print(cube.head()) print("\n")

time.sleep(3)

This is a loop that will iterate through the linked list of cube names and IDs. A line will be printed in the notebook to tell us which cube is being retrieved.

The next line uses our conn object (our connection to MicroStrategy through REST) to extract the cube data from the MicroStrategy Server into Python and stored in a Python dataframe. Then the code drops row count values from each row.

The next line sets every string in the cube’s column names in lowercase. After that, we replace the spaces with underscore.

Then, three print statements offer us some feedback inside the notebook itself. Note the use of the time.sleep() function. The sleep command waits three seconds before moving on to the next cube. This is done deliberately to allow us slow humans to read the output on screen. Focusing on the next portion of this cell:

#Adjust for missing data

if cube_name=='airports':

cube.ix[cube.iata_code ==

MicroStrategy | 25



1

'ECP', ['latitude',

'longitude']] =

"30.3416666667",

"-85.7972222222"

cube.ix[cube.iata_code == 'UST', ['latitude', 'longitude']] = "29.95861111", "-81.33888888"

cube.ix[cube.iata_code == 'PBG', ['latitude', 'longitude']] = "44.65083", "-73.46806"

This code executes within the for loop. For the airports cube, the code inserts three specific values to correct some data quality errors. This is performed here instead of at the source as you may not always have access to the source data.

The next few lines in the cell are as follows:

print("Saving " + cube_name + " data locally..." + "\n")

with pd.HDFStore(os.path.join(home_dir,'Data\\ clean.h5')) as hdf: hdf.append(key=cube_name, value=cube)

This section warns the user that we are about to save the data from the current cube to our server.

Finally, we write the file called clean.h5. Notice the way we used the home directory path defined earlier and join it here. It is in an HDF5 format, a file-based tabular format similar to the Hadoop native format.

The last lines of the cell are:

# Close MicroStrategy connection conn.close()

26 | MicroStrategy



1

These lines terminate the API session after the data has been moved successfully from the Intelligence Server to our machine learning server.

12 Execute the cell. Keep an eye on the output, as shown below:

MicroStrategy | 27



1

13 In File Explorer, navigate to C:\Users\mstr\Desktop\Demo\Data to locate the

file clean.h5.

Notice the small size of the file—the data from our three cubes was not very large. This will change with the flight data we are about to load.

Import the local flights data

Our next notebook will walk us through loading a local file containing our flight data. 1 Return to the main Jupyter notebook tab.

2 Click 02_prep_raw_flights.ipynb.

The notebook opens as shown below.

28 | MicroStrategy



1

Let’s walk through the cells and execute them together.

3 Locate the cell containing the following code:

import pandas as

pd import numpy as

np import os

import sys

home_dir='C:/Users/mstr/Desktop/Demo'

As before, this cell imports a few necessary libraries, including Pandas and numpy, to be used in our code, as well as setting our home directory. We are not loading mstrio here, as this data is not located on an Intelligence Server.

4 Execute the cell.

MicroStrategy | 29



1

5 Locate the cell containing this code:

flights = pd.read_csv(home_dir+'/Data/Raw/ flights.csv') flights.columns = flights.columns.str.lower()

flights.columns = flights.columns.str.replace(' ','_')

The first line retrieves the data from the file stored locally in a CSV file.

The next two lines make every column name lower case and replaces spaces with underscores. A warning is displayed; you can ignore it.

6 Execute the cell.


flights.head()

This line displays a few rows from the flights dataset in the notebook.

8 Execute the line.

The output below the cell should look like the image below:

9 Locate the cell containing this code:

30 | MicroStrategy



1

flights.origin_airport = flights.origin_airport.astype(str)

flights.destination_airport = flights.destination_airport.astype(str)

with pd.HDFStore(os.path.join(home_dir, 'Data\\ clean.h5')) as hdf: airports = hdf.get(key="airports")

# Drop flights to/from airports that are not in the airports list

flights = flights.ix[np.isin(flights.origin_airport, airports.iata_code), :]

flights = flights.ix[np.isin(flights.destination_airport, airports.iata_code), :]

# Drop (6) flights without scheduled time

flights = flights.ix[np.isnan(flights.scheduled_time)==False, :]

# Add a unique ID for each flight flights['FL_ID'] = ["FL_" + str(x) for x in np.arange(0, len(flights))]

# Delete critical future leak. Do not give model info it is looking for.

drop_cols = ['departure_time', 'taxi_out', 'wheels_off', 'elapsed_time', 'air_time',

'wheels_on', 'taxi_in', 'arrival_time', 'arrival_delay', 'diverted',

'cancellation_reason', 'air_system_delay', 'security_delay', 'airline_delay',

'late_aircraft_delay', 'weather_delay', 'year', 'tail_number']

MicroStrategy | 31



1

flights.drop(drop_cols, inplace=True, axis=1)

Let’s digest this long cell in smaller chunks. Focus on the following lines:

flights.origin_airport = flights.origin_airport.astype(str)

flights.destination_airport = flights.destination_airport.astype(str)

These lines change the data type of the origin and destination airports into text data, known as strings, to ensure they retain their original format.

with pd.HDFStore(os.path.join(home_dir, 'Data\\ clean.h5')) as hdf: airports = hdf.get(key="airports")

These lines retrieve the airports table from the clean.h5 file we saved in the previous script.

# Drop flights to/from airports that are not in the airports list

flights = flights.ix[np.isin(flights.origin_airport, airports.iata_code), :]

flights = flights.ix[np.isin(flights.destination_airport, airports.iata_code), :]

These lines use the airports codes to keep only the flights related to the airports in the table.

# Drop (6) flights without scheduled time

flights = flights.ix[np.isnan(flights.scheduled_time)==False, :]

32 | MicroStrategy



1

This line removes flights that do not have a scheduled time listed, so our dataset doesn’t have incomplete data.

flights['FL_ID'] = ["FL_" + str(x) for x in np.arange(0, len(flights))]

This line iterates over the flights table and adds a new identifier for each flight in the FL_ID column.

# Delete critical future leak. Do not give model info it is looking for.

drop_cols = ['departure_time', 'taxi_out', 'wheels_off', 'elapsed_time', 'air_time',

'wheels_on', 'taxi_in', 'arrival_time', 'arrival_delay', 'diverted',

'cancellation_reason', 'air_system_delay', 'security_delay', 'airline_delay',

'late_aircraft_delay', 'weather_delay', 'year', 'tail_number']

flights.drop(drop_cols, inplace=True, axis=1)

These lines delete the listed columns from the flight dataset. These columns contain information about the flights that will bias the model. These variables are often called “future leaks.”

10 Execute the cell. It may take a few minutes to run.


with pd.HDFStore(os.path.join(home_dir,

'Data\\ clean.h5')) as hdf:

hdf.append(key="flights", value=flights)

This cell commits the new flights table and appends it to the clean.h5 file we have locally.

MicroStrategy | 33



1

12 Execute the cell. Saving may take a few moments because the data is large.

13 Click Kernel, then click Shutdown. Confirm the shutdown.

14 Close the browser tab.

15 In File Explorer, locate the clean.h5 file.

Data preparation When we train a model, typically we use a single table that contains all of the data we want the model to use. In that table, we add features—new data that wasn’t present in the raw data—that reflect our knowledge of the business problem or our hypotheses about what is correlated with the business problem.

This is done through joins, transformations, merges, and lookups using the available data sources.

As an example, our Flights table had the origin and destination airports, and our Airports data contained the latitude and longitude for each airport. We want to use the latitude and longitude in our model, in case there’s a relationship between those variables and the outcome for each flight.

To do that, we have to join airports with flights.

After we do this a couple times, we end up with a very large table. In this case, we have our 5 million flights, accompanied by 190 columns! In total these columns reflect all of our ideas, hypotheses, and thoughts we want the model to learn from to understand what causes flights to be delayed.

34 | MicroStrategy



1

Prepare the data

1 On the main Jupyter Notebook page, click

03_prep_training. The notebook will open.

Rather than running each cell individually, we will run the entire script to save time. Please see the Appendix for individual steps.

2 In the Cell menu, select Run All.

3 Once completed, click Kernel, then click Shutdown. Confirm the

shutdown.


Data splitting This is the last step in data preparation.

MicroStrategy | 35



1

We divide our data into different sets:

• We need to create a partition of the data to use for training the model. This is the data the machine learning algorithm uses to fit the model.

• We also need a test set. The test set is used to evaluate the quality of the model after it has been trained with an objective metric like r-squared or root mean squared error.

• The final set is the validation set. The validation set serves a similar purpose to the test set. It is used to provide a confirmation that the error rate from both the validation set and the test set are similar.

The purpose of the training-test split is to estimate the reliability of our model when using unseen data. We do this to get a sense of how well our model will perform when it’s in production.

Another way of doing this is through cross-validation, where this train-test-validation splitting process is repeated numerous times, and produces a distribution of error estimates instead of a single error estimate.

In thjs case, we’ll do a random split of the data. About 65% of the data is used for training, and the remaining data is split evenly across the test and validation sets.

Due to time limitations for this workshop, we will use smaller sets to train and evaluate the data. Normally the model would use the entirety of the data.

Split the data

1 Locate the following cell:

36 | MicroStrategy



1

# ########################### #

# Train-Test-Validation Split #

# ########################### #

prod = flights[np.logical_and(flights.month == 12, flights.day == 31)]

flights = flights[~np.logical_and(flights.month == 12, flights.day == 31)]

train, test = train_test_split(flights, test_size= 0.35) test, oos = train_test_split(test, test_size=0.5)

This cell creates our train, test, and validation splits. You see the creation of the train and test sets, followed by a split of the test dataset to create the validation set. Additionally, a data frame called prod is created, which will represent the data our model will not see: these are all the flights for December 31st. We will use this data later in the workshop as an out of sample test.

2 Execute this cell. This may take a few moments.

3 Locate the cell containing the following:

# ############# #

# Store in HDF5 #

# ############# #

with pd.HDFStore(os.path.join(home_dir, 'Data\\ ready.h5')) as hdf:

hdf.append(key="train", value=train)

hdf.append(key="test", value=test)

hdf.append(key="oos", value=oos)

hdf.append(key="prod", value=prod)

Now that we have some data prepared, we will store in our home directory, so it can be modified later if needed.

4 Execute this cell. Saving may take a few moments.


MicroStrategy | 37



1


7 In the File Explorer, locate the ready.h5 file and notice the file is now two and a half gigabytes.

Walkthrough: Modeling In the modeling phase, you select an algorithm or collection of algorithms suitable for our problem. Thanks to our earlier work, the data is structured for use in our machine learning algorithms, but we might not know exactly what algorithm is going to work the best.

Classes of machine learning problems There are three main classes of machine learning algorithms, which are represented in the diagram below:

38 | MicroStrategy



1

• Unsupervised learning algorithms are used most often for pattern discovery, including, for example, when you have data but aren’t exactly sure what question is being asked.

• Supervised learning algorithms are used when we want to infer a relationship between input and output pairs. They comprise many of the applications you’ve probably read about. For example, supervised learning tasks include regression analysis and data classification. Unlike unsupervised learning, these tasks require labeled data. In other words, we must know the true outcome for each record in our dataset.

• Reinforcement learning algorithms are intended to optimize performance outcomes. Unlike supervised learning algorithms, they do not require the correct input and output pairs. They are used in many industrial and software applications, including in manufacturing and automation.

Neural networks

Since we are trying to predict whether a flight is likely to be on time, delayed, or canceled, and we have labeled data, we are now working on a supervised learning problem classification. To do so, we will use a neural network.

Neural networks are one of the most important machine learning techniques. As diagram below shows, a neural network consists of three parts:

1. An input layer, composed of the labeled real-world observations from our dataset, such as origin airport and departure time.

2. An output layer, which contains our predictions regarding the probable status of a flight given its input characteristics.

3. Multiple hidden layers, composed of sequential algorithms that analyze and process data from the input layer and previous layers in order to generate the outputs. Because we are using multiple hidden layers, this is a deep learning model.

MicroStrategy | 39



1

To use our neural network, we must first “train” it. We do so by feeding input layer data into an activation function. The hidden layers of the neural network will then automatically perform a series of calculations to tune the weights of each node in the network, ensuring that the output layer most closely matches the true output that we observed in our data—in other words, making sure that our model produces the most accurate predictions possible.

Train the model

To train our model, we start in our 04_train.ipnyb notebook.

1 On the Jupyter notebook main page, click 04_train.

The notebook opens.

2 Locate the cell at the top containing the following code:

40 | MicroStrategy



1

# package imports

import sys import

os import gc

import numpy as

np import pandas

as pd import

pickle

from keras.models import Sequential from

keras.layers.core import Dense, Dropout

from keras.layers.normalization

import BatchNormalization from

keras.callbacks import EarlyStopping

from keras import regularizers

from sklearn.preprocessing import MaxAbsScaler Next, load the packages that we need in order to train our model. You will receive some warnings from TensorFlow, but you can safely ignore them.

3 Execute this cell.


# helper function for returning the target

variables from the training data def get_targets(df):

targets = ['cancelled', 'delayed', 'on_time']

return df.filter(items=targets, axis=1)

This contains a support function we will use later. It returns the target variables when we run the model.


MicroStrategy | 41



1


# helper function for dropping columns that

we do not wish to train the model on def drop_cols(df):

drop = ['month', 'day', 'day_of_week', 'airline', 'flight_number', 'iata_code_orig', 'state_orig', 'iata_code_dest',

'state_dest', 'origin_airport', 'destination_airport', 'scheduled_departure',

'departure_delay', 'scheduled_time', 'distance', 'scheduled_arrival', 'FL_ID',

'cancelled', 'delayed', 'on_time']

return df.drop(drop, axis=1) This cell also contains a support function. It removes, or “drops,” some columns from a data frame that we do not need.



# set home directory home_dir =

"C:\\Users\\mstr\\Desktop\\Demo"

# set seed for reproducibility

np.random.seed(91919)

In this cell, we define the path to our files and set a defined seed number that will allow us to reproduce the same results every time. This is because the activation function starts with a pseudo-random value that we can choose.



42 | MicroStrategy



1

# ################ #

# Load in the data

# #

################ #


train = hdf.get(key="train") test = hdf.get(key="test")

This cell loads in our training data as well as in our test data.

11 Execute this cell. Note that these datasets are quite large, so you should not be concerned if they take time to load.


# ######### #

# Data prep #

# ######### #

if True:

# To speed-up model training, we're taking a subset of the data

# If you want to train the full model set the previous line 'False'

train, test = [df.sample(n=25000) for df in [train, test]]

# x-vars and y-vars

x_train, x_test = [np.array(drop_cols(df=df)) for df in [train, test]]

y_train, y_test = [np.array(get_targets(df=df)) for df in [train, test]]

Note that if this condition is set to True, the set is limited to 25,000 rows. To train on the complete data frame, you should set this condition

to False.

*

MicroStrategy | 43



1

We’ll use a subset of our data to train and test our model. Depending on the application and the dataset, neural networks can exhibit a lot of variation in the time they take to fit a model. By only using a sample of the data, we can ensure that this process will only take a few minutes.

We also want to manage the program’s memory profile as it runs. The more data we want analyze, the more system resources the program must consume.

13 Execute this cell. By doing so, we split our data into x and y sets. The x set contains the input observations that the model will use to learn from. The y set contains the outcome observations.


# ####### #

# Scaling # # ####### # scaler =

MaxAbsScaler() x_train =

scaler.fit_transform(x_train)

x_test = scaler.transform(x_test)

# clean-up del train, test gc.collect()

This cell scales our values to make them proportionally smaller so they are easier to calculate.

Note that gc.collect is a garbage collection command used here to help reduce memory usage.



# ############################ #

44 | MicroStrategy



1

# Configure the neural network # #

############################ # dnn = Sequential()

dnn.add(Dense(1024, input_dim=190,

activation='relu',

kernel_initializer='uniform',

bias_initializer='normal',

kernel_regularizer= regularizers.l2(0.001),

MicroStrategy | 45



1

activity_regularizer= regularizers.l2(0.001))) dnn.add(Dropout(0.2))

dnn.add(Dense(512,

activation='tanh',

kernel_initializer='normal',

bias_initializer='uniform',


activity_regularizer=

regularizers.l2(0.01)))

dnn.add(BatchNormalization())

dnn.add(Dropout(0.2))

dnn.add(Dense(128,

activation='relu',

kernel_initializer='zeros',

bias_initializer='uniform',


activity_regularizer= regularizers.l2(0.1))) dnn.add(BatchNormalization())

dnn.add(Dense(3, activation='softmax', kernel_initializer='normal', bias_initializer='ones')) dnn.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['categorical_accuracy'])

46 | MicroStrategy



1

earlystopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto')

This cell configures the parameters of our neural network, which we must do in order to initialize its structure.

Notice the parameter beginning with dnn.add(Dense. This sets the number of nodes in each layer: from 1024, to 512, and ultimately down to 3. The selection of this “network topology” is a common area of debate. Data scientists typically spend a long time selecting and fine tuning these kinds of parameters.



# ######## #

# Training # # ######## # dnn.fit(x=x_train,

y=y_train, batch_size=2500,

epochs=10, verbose=True,

validation_data=(x_test, y_test),

callbacks=[earlystopping])

This cell will initiate the training of our neural network. You should see notifications in the console informing you of what is happening as the model attempts to optimize and achieve the lowest lost statistic. These notifications will look something like this:

Note that as each training “epoch” is completed the loss statistic declines. This means our model is becoming a more accurate predictor compared to previous iterations.

19 Open the Task Manager (right click the taskbar and click Task Manager).

MicroStrategy | 47



1

20 Keep an eye on the name Python in the Processes tab.

This provides a preview of the CPU and memory resources that our calculations are consuming. In general, this demand will scale with amount of data we are trying to process. More demanding tasks, in other words, require more system resources.

Note that training the model will require significant CPU resources. Indeed, you should not be surprised to see Python frequently utilizing close to 100% of those available.

21 Execute the cell from Step 19. This will take a few minutes.


# ############ #

# Save to disk #

# ############ #

48 | MicroStrategy


© 2018 MicroStrategy, Inc. 47

# save the model and pre-processing scaler to disk

dnn.save(os.path.join(home_dir, 'Data\Model\ dnn_weights.h5'))

pickle.dump(scaler, open(os.path.join(home_dir,

'Data\Model\dnn_scaler.pkl'), 'wb'))

Once the model is trained, we must save it somewhere. This cell does so in your home directory’s model subfolder of the data folder.

23 Execute this cell. Saving may take a few moments.



26 In File Explorer, locate the Model folder. You should see two saved model files, as shown below:

• “dnn_scaler.pkl” contains the scaler, a pre-processing utility used to structure the data for the neural network.

• “dnn_weights.h5” contains the weights of the neural network.

Walkthrough: Evaluation Now that we have trained our model, we can evaluate how well it predicts flight departure status.

To help with the discussion, we have calculated the Area Under the Curve (AUC) statistic for the model on each of the outcomes. We have used this statistic to plot the Receiver Operating Characteristics (ROC), which visualizes the ratio of true positives (correctly predicted examples) to false positives (incorrectly predicted examples). An excellent model should look like a sharp curve approaching the

MicroStrategy | 49

MicroStrategy World 2019 Workshop Book


upper left-hand corner of the graph. The dashed line is a reference point for a model that randomly guesses the outcome.

The first ROC graph visualizes the AUC for the canceled flights. The model’s AUC score of 0.71 means it did much better than random guesses. A more accurate model would have an AUC score even closer to 1, and have a curve more sharply sloped towards the upper left-hand corner of the graph.

Next, we have the AUC for delayed flights, which came in at 0.64.

And finally, we have the AUC for on-time flights which was also 0.64.

50 | MicroStrategy


49 © 2018 MicroStrategy, Inc.

Machine Learning

Before putting these models into production, we want to spend more time comprehensively assessing the performance of the model under a number of different assumptions.

Let’s assume for this workshop that we have successfully tested this model.

Congratulations! You just trained a model in Python.

Let’s now use this model to predict departure statuses for flights that we have not yet observed.

Predict flight status

We’ll use our previously trained neural network to predict the departure status of unobserved flights. To do this, we’ll open our 05_predict.ipynb script. This script acquires the new data, applies the neural network to the new data, and then creates a cube containing predictions inside of MicroStrategy, which we can use inside a dossier or dashboard.

1 In the Jupyter notebook main page, click 05_predict.

The notebook should now open and you should see the following:

2 Locate the cell at the top containing the following code:

import pandas as pd import numpy as np

import os import sys import gc import

pickle from keras.models import load_model

MicroStrategy | 51



from mstrio import microstrategy

This cell will import the libraries we need for the notebook. Notice the return of mstrio, as we will be interacting with MicroStrategy towards the end of this phase.



# helper function for returning the target variables from the training data def get_targets(df):

targets = ['cancelled', 'delayed', 'on_time']

return df.filter(items=targets, axis=1)

This filters out the dependent variables (“cancelled,” “delayed,” and “on time”) from our test dataset.



# helper function for dropping columns that we do

not wish to train the model on def drop_cols(df):

drop = ['month', 'day', 'day_of_week', 'airline', 'flight_number', 'iata_code_orig', 'state_orig', 'iata_code_dest',

'state_dest', 'origin_airport', 'destination_airport', 'scheduled_departure',

'departure_delay', 'scheduled_time', 'distance', 'scheduled_arrival', 'FL_ID',

'cancelled', 'delayed', 'on_time']

return df.drop(drop, axis=1)

This cell will drop columns that we do not wish to use when training our model.

52 | MicroStrategy



Machine Learning



# set home directory home_dir =

"C:\\Users\\mstr\\Desktop\\Demo"

# set seed for reproducibility

np.random.seed(91919)

In this cell, we set our path to our files and set the initializing pseudo-random seed. Note that you are using the same number as before.



# ######################################### #

# Load in the network and data

preprocessor # #

#########################################

#

dnn = load_model(filepath=os.path.join(home_dir, 'Data\Model\dnn_weights.h5'))

scaler = pickle.load(open(os.path.join(home_dir, 'Data\Model\dnn_scaler.pkl'), 'rb'))

This cell will load in the model weights and the pre-processor scaler.



MicroStrategy | 53



# ################ #

# Load in the data

# #

################ #


df = hdf.get(key="prod") This cell loads the test flight data. Note that prod is a portion of the flights data for December 31st only.



# ######### #

# Data prep #

# ######### #

new_x = scaler.transform(np.array(drop_cols(df=

df))) new_y = np.array(get_targets(df=df))

# clean-up gc.collect()

This cell will allow us to apply the transformation scaler to the data just as we did when we trained our model. This is crucial, because the neural network expects the data to be in a certain distribution and range.



##################################################

# Predict the outcome for each flight in test and oos sets #

##################################################

pred_score = dnn.predict(x=new_x, batch_size= 1028).astype('float64')

54 | MicroStrategy



Machine Learning

pred_class = dnn.predict_classes(x=new_x, batch_size=1028).astype('float64')

Using this cell, you’ll create the predictions.



# ####################################### #

# Append the new scores to the data frame #

# ####################################### #

# Map predicted class to label

df['predicted_status'] = np.where(pred_class == 0, 'Cancelled', np.where(pred_class == 1, 'Delayed', 'On-Time'))

# Add the probabilities for each status to the data

frame df['prob_cancel'] = pred_score[:, 0]

df['prob_delayed'] = pred_score[:, 1]

df['prob_ontime'] = pred_score[:, 2]

Now that we have generated predictions, we can add them to our flight data. This cell merges our observed data and our predictions—we’ll need it to look at our predictions next to our true flight information in order to facilitate their comparison.



MicroStrategy | 55



# Take a subset of all the columns before

pushing to MSTR cols_to_keep = ['FL_ID',

'month', 'day',

'flight_number', 'origin_airport', 'destination_airport',

'scheduled_departure', 'scheduled_arrival',

'latitude_orig', 'longitude_orig',

'latitude_dest', 'longitude_dest',

'state_orig', 'state_dest',

'predicted_status', 'prob_cancel', 'prob_delay', 'prob_ontime'] df = df.filter(items=cols_to_keep, axis=1)

In this cell, we make a data frame containing the variables from our observations that we think are relevant for analysts and other MicroStrategy end users. Note that we have data on 190 different variables for each observation, and this cell will select a subset of fifteen of them.



# ################################## #

# Push the results to MSTR as a

cube # #

##################################

#

# API / server params

username = "mstr"

password = "password"

base_url = "https:// env-XXXXX.customer.cloud.microstrategy.com/ MicroStrategyLibrary/api"

56 | MicroStrategy



Machine Learning

project_name = "MicroStrategy Tutorial" # Tutorial Project

In this cell, we connect to the Intelligence Server using MicroStrategy’s REST API, and import our table from Python.

23 Replace the password value with the password you received in your MicroStrategy on AWS email.

24 In base_url, replace XXXXX with your environment number.


26 To preview our table, we can look at it in Python using a df.head() function. To do so, insert a cell below the one mentioned above and input the following code:

df.head()


You should see a table that contains our predictions about each flight’s departure status: the expected probabilities that a given flight will be canceled, delayed, or on time.


conn = microstrategy.Connection(base_url, username, password, project_name) conn.connect()

This cell will log into MicroStrategy using the mstrio library.

MicroStrategy | 57





# Convert the data frame to a cube and send it

newDatasetId, newTableId = conn.create_dataset(data_frame=df, dataset_name= "Flights", table_name="Predictions") print(newDatasetId) print(newTableId)

This cell takes the data frame from Python, converts it into a format accessible to MicroStrategy, and sends it to the Intelligence Server in a new cube called Flights.



# Log out

conn.close()

This cell closes our connection to the MicroStrategy Intelligence Server.



Preview the new cube

Let’s see if we can access the new flight status cube.

1 Log into MicroStrategy Web using the credentials you received in your MicroStrategy on AWS email.

https://env-XXXXX.customer.cloud.microstrategy.com/MicroStrategy/servlet/ mstrWeb where XXXXX represents your environment number you used previously.

2 Select the MicroStrategy Tutorial folder, and then select MicroStrategy Web.

58 | MicroStrategy



Machine Learning

3 Click My Reports.

You should see a new cube called Flights, as shown below.

4 Click Edit.

You should see the predicted status attribute. This cube will contain the same data and columns that you previewed in Python earlier.

MicroStrategy | 59



With our predictions now accessible within MicroStrategy, analysts and other end users can make deployable dashboards and dossiers.

5 Click Cancel.

Walkthrough: Deployment Our mission was to make these predictions available to our passengers. To do this, we took test data, predicted the flight status for each flight, and then made a cube in MicroStrategy containing those predictions. Next, we began to build a dossier to show the status probability for each flight.

You can imagine our dossier being presented to users on a large flight information screen, or using the MicroStrategy SDKs to integrate this data into a custom mobile application.

You’ve participated in a full life cycle for machine learning model development:

1 We identified a business problem to solve.

2 We identified relevant data to use.

3 We acquired that data, in this case from MicroStrategy cubes, and cleaned it.

4 We trained a model on that data.

5 We connected our modeling pipeline with our production BI system by pushing a cube containing predictions back into MicroStrategy, so that other business analysts and analytics consumers could use that data.

60 | MicroStrategy



Machine Learning

Conclusion You have completed the Integrate Machine Learning through MicroStrategy and Python workshop!

For more information on using mstrio with Python and R, MicroStrategy offers the 11.711 Data Mining and Predictive Analytics course.

For resources and certifications with MicroStrategy Education, visit: https:// www.microstrategy.com/us/services/education/.

MicroStrategy | 61


integrate machine learning models with python and

Documents