matrices and models c language provides only the most basic of basics, such as addition and...

MATRICES AND MODELSC language provides only the most basic of basics, such as additionand division, and everything else is provided by a library. So before you can do data-oriented mathematics, you will need a library to handle matrices and vectors

this book uses the GNU Scientific Library (GSL). The GSL is recommended because it is actively supported and will work on about as many platforms as C itself. Beyond functions useful for statistics, it also includes few hundred functions useful in engineering and physics,

this book co-evolved with the Apophenia library, which builds upon the GSL for more statistics-oriented work.

THE GSL’S MATRICES AND VECTORS• #include <apop.h>

• int main(){• gsl_matrix *m = gsl_matrix_alloc(20,15);• gsl_matrix_set_all(m, 1);• for (int i=0; i< m->size1; i++){• Apop_matrix_row(m, i, one_row);• gsl_vector_scale(one_row, i+1);• }• for (int i=0; i< m->size2; i++){• Apop_matrix_col(m, i, one_col);• gsl_vector_scale(one_col, i+1);• }• apop_matrix_show(m);• gsl_matrix_free(m);• }

• The matrix is allocated in the introductory section, on line four. It is no surprise that it has alloc in the name, giving indication that memory is being allocated for the matrix.

• In this case, the matrix has 20 rows and 15 columns. Row always comes first, then Column,

• Line five is the first matrix-level operation: set every element in the matrix to one

• The rest of the file works one row or column at a time. The first loop, from lines six to nine, begins with the Apop_matrix_row macro to pull a single row, which it puts into a vector named one_row.

• Given the vector one_row, line eight multiplies every element by i+1. When this happens again by columns on line 12, we have a multiplication table

• Line 14 displays the constructed matrix to the screen• Line 15 frees the matrix.• The system automatically frees all matrices at the end of the program

Naming conventions• Every function in the GSL library will begin with gsl_, and the first

argument of all of these functions will be the object to be acted upon. • Most GSL functions that affect a matrix will begin with gsl_-matrix_and

most that operate on vectors begin with gsl_vector_• 100% of Apophenia’s functions begin with apop_and a great majority of

them begin with a data type such as apop_data_or apop_model_• GLib’s functions all begin with g_-object: g_tree_, g_list_

• If you find the naming scheme to be too verbose, you can write your own wrapper functions that require less typing. For example, you could write a file my_-convenience_fns. , which could include:

• void mset(gsl_matrix * m, int row, int col, double data){• gsl_matrix_set(m, row, col, data);• }• void vset(gsl_vector * v, int row, double data){• gsl_vector_set(v, row, data);• }

• You would also need a header file, my_convenience_fns.h• #include <gsl/gsl_matrix.h>• #include <gsl/gsl_vector.h>• void mset(gsl_matrix * m, int row, int col, double data);• void vset(gsl_vector * v, int row, double data);• #define VECTOR_ALLOC(vname, length) gsl_vector * vname =

gsl_vector_alloc(length);• // For simple functions, you can rename them via #define; see page 212:• #define vget(v, row) gsl_vector_get(v, row)• #define mget(m, row, col) gsl_matrix_get(m, row, col)• After throwing an #include"my_convenience_fns.h“ at the top of your

program, you will be able to use your abbreviated syntax such as vget(v,3

BASIC MATRIX AND VECTOR OPERATIONS

• The simplest operations on matrices and vectors are element-by-element operations such as adding the elements of one matrix to those of another. The GSL provides the functions you would expect to do such things

Apply and map• the function in Listing 4.2 will take in a doubleindicating taxable income

and will return US income taxes owed, assuming a head of household with two dependents taking the standard deduction (as of 2006; see Internal Revenue Service (2007)).

• This function can be applied to a vector of incomes to produce a vector of taxes owed

• Example taxes

• strncpy(apop_opts.db_name_column, "geo_name", 100);• If apop_opts.db_name_column is set (it defaults to being "row_names"),

and the name of a column matches the name, then the row names are read from that column.

• Apop_col_t(d, "income", income_vector);• #define APOP_COL_T( m, col, v )• After this call, v will hold a vector view of the colth column of m

• d−>vector = apop_vector_map(income_vector, calc_taxes)• gsl_vector* apop_vector_map(const gsl_vector * v,double(*)(double) fn )• Map a function onto every element of a vector. The function that you

input takes in a double and returns a double

http://apophenia.info/structapop__opts__type.html

http://apophenia.info/group__mapply.html

• apop_name_add(d−>names, "tax owed", ’v’);• int apop_name_add(apop_name * n,char const * add_me,char type )• Adds a name to the apop_name structure. Puts it at the end of the given

list.• Parameters:• n An existing, allocated apop_name structure.• add_me A string. If NULL, do nothing; return -1.type'r': add a row name

'c‘ : add a column name't‘ : add a text category name'h‘ : add a title (or a header. 't' is taken).'v‘ : add (or overwrite) the vector name

http://apophenia.info/types_8h.html

http://apophenia.info/structapop__name.html



• void apop_data_show(const apop_data * in)• This function prettyprints the apop_data set

to a screen

http://apophenia.info/group__output.html

http://apophenia.info/structapop__data.html


• void apop_vector_apply(gsl_vector * v,void(*)(double *) fn )• Apply a function to every row of a matrix. The function that you input

takes in a gsl_vector and returns nothing. apop_apply will send a pointer to each element of your vector to your function.

http://apophenia.info/group__mapply.html

apop_data• The apop_data set naturally represents a data set. It turns out that a lot of

real-world data processing is about quotidian annoyances about text versus numeric data or dealing with missing values, and the theapop_data set and its many support functions are intended to make data processing in C easy.

• The structure basically includes six parts:• a vector• a matrix• a grid of text elements• a vector of weights• names for everything: row names, a vector name, matrix column names,

text names.• a link to a second page of data

http://apophenia.info/gentle.html


As per the example above, Apophenia will generally assume that one row across all of these elements describes a single observation or data point.

apop_data Struct Reference• Data Fields• gsl_vector * vector• gsl_matrix * matrix• apop_name * names• char *** text• size_t textsize [2]• gsl_vector * weights• struct apop_data * more• char error



• The apop_data structure represents a data set. It primarily joins together a gsl_vector, a gsl_matrix, and a table of strings, then gives them all row and column names. It tries to be minimally intrusive, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.

• here is a diagram showing a sample data set with all of the elements in place. Together, they represet a data set where each row is an observation, which includes both numeric and text values, and where each row/column is named.


• Allocate using apop_data_alloc, free via apop_data_free, or more generally, see the apop_data_... section of the index (in the header links) for the many other functions that operate on this struct.

• There are various means of creating an apop_dataset, including apop_query_-to_data, apop_matrix_to_data, apop_ve ctor_to_data, or creating a blank slate with apop_data_alloc

• apop_query_to_datato read the table into an apop_dataset, and by setting the apop_opts.db_name_colum noption to a column name on line 20, the query set row names for the data.

• You can easily operate on the subelements of the structure. If your matrix_manipulatefunction requires a gsl_matrix, but your_datais an apop_datastructure, then you can call matrix_manipulate(your_data->matrix)

• apop_data * set = apop_query_to_text(...);• for (int r=0; r< set−>textsize[0]; r++){• for (int c=0; c< set−>textsize[1]; c++)• printf("%s\t", set−>text[r][c]);• printf("\n");• }

• apop_data * newdata_m = apop_data_alloc(vector_size, n_rows, n_cols)• apop_data* apop_data_alloc(const size_t size1,const size_t size2,const

int size3 )• Allocate a apop_data structure, to be filled with data.• The typical case is three arguments, like apop_data_alloc(2,3,4): vector

size, matrix rows, matrix cols. If the first argument is zero, you get a NULL vector.

• Two arguments, apop_data_alloc(2,3), would allocate just a matrix, leaving the vector NULL.

• One argument, apop_data_alloc(2), would allocate just a vector, leaving the matrix NULL.

• Zero arguments, apop_data_alloc(), will produce a basically blank set, with out->matrix==out->vector==NULL.




http://apophenia.info/apop__data_8c.html


Get, set, and point• there is a suite of functions for setting and getting an element from an

apop_data set using the names.• Let t be a title and i be a numeric index; then you may refer to the row–column

coordinate using the (i, i), (t, i), (i, t), or (t, t) form:• apop_data_get(your_data, i, j);• apop_data_get_ti(your_data, "rowname", j);• apop_data_get_it(your_data, i, "colname");• apop_data_get_tt(your_data, "rowname", "colname");• apop_data_set(your_data, i, j, new_value);• apop_data_set_ti(your_data, "rowname", j, new_value);• ...• apop_data_ptr(your_data, i, j);• apop_data_ptr_ti(your_data, "rowname", j);• ...

• double apop_data_get(const apop_data * data,const size_t row,const int col,const char * rowname,const char * colname,const char * page )

• Returns the data element at the given point.• In case of error (probably that you asked for a data point out of bounds),

returns GSL_NAN

• double* apop_data_ptr(apop_data * data,const int row,const int col,const char * rowname,const char * colname,const char * page )

• Get a pointer to an element of an apop_data set.• If a NULL vector or matrix (as the case may be), stop

(unless apop_opts.stop_on_warning='n', then return NULL).• If the row/column you requested is outside the bounds of the matrix (or

the name isn't found), always return NULL.

http://apophenia.info/group__data__set__get.html





• int apop_data_set(apop_data * data,const size_t row,const int col,const double val,const char * colname,const char * rowname,const char * page )

• Set a data element.• apop_data_set(d, 3, 8, 5); apop_data_set(d, .row = 3, .col=8, .val=5);

apop_data_set(d, .row = 3, .colname="Column 8", .val=5); //but: apop_data_set(d, .row = 3, .colname="Column 8", 5); //invalid---the value doesn't follow the colname.







Forming partitioned matrices• You can copy the entire data set, stack two data matrices one on top of

the other (stack rows), stack two data matrices one to the right of the other (stack columns), or stack two data vectors:

• apop_data * newcopy = apop_data_copy(oldset);• apop_data *newcopy_tall = apop_data_stack(oldset_one, oldset_two, ’r’);• apop_data*newcopy_wide = apop_data_stack(oldset_one, oldset_two,

’c’);• apop_data*newcopy_vector = apop_data_stack(oldset_one, oldset_two,

’v’);

• apop_data* apop_data_copy(const apop_data * in)• Copy one apop_data structure to another. That is, all data is duplicated.• Parameters:• In the input data• Returns:a structure that this function will allocate and fill. If input is NULL,

then this will be NULL.





• apop_data* apop_data_stack(apop_data * m1,apop_data * m2,char posn,char inplace )

• Put the first data set either on top of or to the left of the second data set.• m1 the upper/rightmost data set (default = NULL)• m2the second data set (default = NULL)• posn If 'r', stack rows of m1's matrix above rows of m2's

if 'c', stack columns of m1's matrix to left of m2's (default = 'r')

• Inplace If 'i' 'y' or 1, use apop_matrix_realloc and apop_vector_realloc to modify m1 in place; see the caveats on those function. Otherwise, allocate a new vector, leaving m1unmolested. (default='n')







Copying structures• void * memmove ( void * destination, const void * source, size_t num );• Move block of memory• Copies the values of num bytes from the location pointed by source to the

memory block pointed by destination. Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.complex first = {.real = 3, .imaginary = −1};

• complex second;• memmove(&second, &first, sizeof(complex));• The computer will go to the location of firstand blindly copy what it finds

to the location of se cond, up to the size of one omplexstruct. • Since firstand secondnow have identical data, their constituent parts are

guaranteed to also be identical.

• void * memcpy ( void * destination, const void * source, size_t num );

• Copy block of memory• Copies the values of num bytes from the

location pointed by source directly to the memory block pointed bydestination.

• The gsl_..._memcpyfunctions assume that the destination to which you are copying has already been allocated; this allows you to reuse the same space and otherwise carefully oversee memory.

• gsl_vector* copy = gsl_vector_alloc(original−>size);• gsl_vector_memcpy(copy, original);• gsl_vector* copy2 = apop_vector_copy(original);• The functions for allocating memory to a vector follow the style

of malloc and free. In addition they also perform their own error checking. If there is insufficient memory available to allocate a vector then the functions call the GSL error handler (with an error number of GSL_ENOMEM) in addition to returning a null pointer

• gsl_matrix *copy = gsl_matrix_alloc(original−>size1, original−>size2);• gsl_matrix_memcpy(copy, original);• gsl_matrix * copy2 = apop_matrix_copy(original);• int gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)• This function copies the elements of the matrix src into the matrix dest.

The two matrices must have the same size.

• gsl_matrix* apop_matrix_copy(const gsl_matrix * in)• Copy one gsl_matrix to another. That is, all data is duplicated.

Unlike gsl_matrix_memcpy, this function allocates and returns the destination, so you can use it like this:

• gsl_matrix *a_copy = apop_matrix_copy(original); • Parameters:• inthe input dataReturns:a structure that this function will allocate and fill.

If gsl_matrix_alloc fails, returns NULL.

http://apophenia.info/group__convenience__fns.html

http://apophenia.info/group__convenience__fns.html

• apop_data * copy1 =apop_data_alloc(original−>vector−>size, original−>matrix−>size1,

• original−>matrix−>size2);• apop_data_memcpy(copy1, original);• apop_data * copy2 = apop_data_copy(original);

• void apop_data_memcpy(apop_data * out,const apop_data * in )• Copy one apop_data structure to another. That is, all data on the first page is

duplicated.

• //Copy the contents of row i of mydata to row j. Apop_data_row(mydata, i, fromrow);

• Apop_data_row(mydata, j, torow); • apop_data_memcpy(torow, fromrow);





http://apophenia.info/stats_8h.html

http://apophenia.info/stats_8h.html


• apop_data* apop_data_copy(const apop_data * in)• Copy one apop_data structure to another. That is, all data is duplicated.





Function call• These are functions designed to convert one format to another.• There are two ways to express a matrix of doubles. The analog to using a

pointer is to declare a list of pointers-to-pointers, and the analog to an automatically allocated array is to use double-subscripts:

• Double ** method_one = malloc(sizeof(double*)*size_1);• for (int i=0; i< size_1; i++)• method_one[i] = malloc(sizeof(double)*size_2);• double method_two[size_1][size_2] = {{2,3,4},{5,6,7}};

• The first method is rather inconvenient. The second method seems convenient, because it lets you allocate the matrix at once.

• apop_text_to_db("original.txt", "tablename", 0 , 1, NULL);• Read a text file into a database table• The first number states whether the file has row names; the second• whether it has column names. Finally, if no colnames are pr esent,• you can provide them in the last argument as a char**

• apop_data* copyd = apop_text_to_data("original.txt", 0 , 1)• Read a delimited text file into the matrix element of an apop_data set.


• double original[] = {{2,3,4}, {5,6,7}};• gsl_vector * copv = apop_array_to_vector(original, original_size); Just copies a one-dimensional array to a gsl_vector. The input array is

undisturbed

gsl_matrix*copm = apop_array_to_matrix(original, original_size1, original_size2);

• Just copies a one-dimensional array to a gsl_matrix. The input array is undisturbed

• double original[] = {2,3,4,5,6,7};• int orig_vsize = 0, orig_size1 = 2, orig_size2 = 3;• gsl_matrix * copym = apop_line_to_matrix(original, orig_size1, orig_size2)• gsl_matrix* apop_line_to_matrix(double * line,int rows,int cols )• Convert a double * array to a gsl_matrix. Input data is copied• Returns:the gsl_matrix, allocated for you and ready to use.

http://apophenia.info/group__conversions.html

• apop_data * copyd = apop_line_to_data(original, orig_vsize, orig_size1, orig_size2)

• apop_data* apop_line_to_data(double * in,int vsize,int rows,int cols )A convenience function to convert a double * array to an apop_data set. It will have no names. The input data is copied, not pointed to




• double *copyd = apop_vector_to_array(original_vec)• Convert gslvector to double

• gsl_matrix * copym = apop_vector_to_matrix(original_vec);• This function copies the data in a vector to a new one-column (or one-

row) matrix and returns the newly-allocated and filled matrix.• Returns:a newly-allocated gsl_matrix with one column (or row)

• apop_data * copydv = apop_vector_to_data(original_vec);• apop_data* apop_vector_to_data(gsl_vector * v)• Wrap an apop_data structure around an existing gsl_vector. The vector is

not copied, but is pointed to by the new apop_data struct.• Parameters:• v• The data vectorReturns:an allocated, ready-to-use apop_data structure.






• apop_data * copydm = apop_matrix_to_data(original_matrix)• apop_data* apop_matrix_to_data(gsl_matrix * m)• Wrap an apop_data structure around an existing gsl_matrix. The matrix is

not copied, but is pointed to by the new apop_data struct.





Printing• Apophenia’s printing functions are actually four-in-one functions: you can

dump your data to either the screen, a file, a database, or a system pipe• you will want to print all of your results to screen, and then later, you will

want to save temporary results to the database, and then next month, a colleague will ask for a text file of the output; you can make all of these major changes in output by changing one character in your code

• The four choices for the apop_opts.output_typevariable are• apop_opts.output_type = ’s’; //default: print to screen.• apop_opts.output_type = ’f’; //print to file.• apop_opts.output_type = ’d’; //store in a database table.• apop_opts.output_type = ’p’; //write to the pipe in

apop_opts.output_pipe

Querying• The only way to get data out of a database is to query it out• apop_query("create table copy as \• Select * from original")• int apop_query(const char * fmt, ... )• Send a query to the database that returns no data.• Returns:0 on success, 1 on failure.

http://apophenia.info/group__queries.html

• apop_data * d = apop_query_to_data("select * from original"); apop_data* apop_query_to_data(const char * fmt, ... )• Queries the database, and dumps the result into an apop_data set.• If apop_opts.db_name_column is set (it defaults to being "row_names"),

and the name of a column matches the name, then the row names are read from that column.

• As with the other apop_query_to_... functions, the query can include printf-style format specifiers, such as apop_query_to_data("select age from %s where id=%i;", tablename, id_number).

• Returns:If no rows are returned, NULL; else an apop_data set with the data in place. Most data will be in the matrix element of the output. Column names are appropriately placed. Ifapop_opts.db_name_column matches one of the fields in your query's output, then that column will be used for row names (and therefore will not appear in the matrix).





http://apophenia.info/structapop__opts__type.html


• double d = apop_query_to_float("select value from original");• double apop_query_to_float(const char * fmt, ... )• Queries the database, and dumps the result into a single double-precision

floating point number.

• gsl_vector *v = apop_query_to_vector("select* from original");• gsl_vector* apop_query_to_vector(const char * fmt, ... )• Queries the database, and dumps the first column of the result into

a gsl_vector.



• gsl_matrix * m = apop_query_to_matrix("select* from original");• gsl_matrix* apop_query_to_matrix(const char * fmt, ... )• Queries the database, and dumps the result into a matrix.


Views• Pointers make it reasonably easy and natural to look at subsets of a

matrix. Do you want a matrix that represents X with the first row lopped off?

• Then just set up a matrix whose datapointer points to the second row. Since the new matrix is pointing to the same data as the original, any changes will affect both matrices, which is often what you want; if not, then you can copy the submatrix’s data to a new location.

• it is not quite as easy as just finding the second row and pointing to it, since a gsl_matrix includes information about your data (i.e., metadata), such as the number of rows and columns. Thus, there are a few macros to help you pull a row, column, or submatrix from a larger matrix.

• Apop_matrix_row(m, 3, row_v);• Apop_matrix_col(m, 5, col_v);• Apop_submatrix(m, 2, 4, 6, 8, submatrix);

• that m is agsl_matrix*, thenwill produce a gsl_vector*named row_vholding the third row, another named col_vholding the fifth column, and a 6×8 gsl_matrix*named submatrix whose (0, 0)th element is at (2, 4) in the original.

LINEAR ALGEBRA• transition matrix, showing whether the system can go from a row state to

a column state.For example, Figure 4.4 was such a transition matrix, showing which formats can be converted to which other formats.

• each transition with a one and each dot in Figure 4.4 with a zero, we get the following transition matrix:

• #include <apop.h>• int main(){• apop_data * t = apop_text_to_data("data−markov", 0, 0);• apop_data * out = apop_dot(t, t, 0, 0);• apop_data_show(out);• }

• The apop_dotfunction takes up to four arguments: two apop_datastructures, and one flag for each matrix indicating what to do with it ('t'=transpose the matrix, 'v'=use the vector element, 0=use the matrix as-is)

• if Xis a matrix, then• apop_dot(X, X, ’t’, 0);

• will find X X: the function takes the dot product of Xwith itself, and the ′first version is transposed and the second is not.

• If a data set has a matrixcomponent, then it will be used for the dot product, and if the matrixelement is NULLthen the vectorcomponent is used.

Regression• Before too long the sky’s overcast, temperatures are dipping, and it looks

like rain. Even worse, ticket sales are hit. The guys are in trouble, and they can’t afford for this to happen again.

• What the guys want is to be able to predict what concert attendance will be given predicted hours of sunshine. That way, they’ll be able to gauge the impact an overcast day is likely to have on attendance. If it looks like attendance will fall below 3,500 people, the point where ticket sales won’t cover expenses, then they’ll cancel the concert

• They need yourhelp

Let’s analyze sunshine and attendance

• Here’s sample data showing the predicted hours of sunshine and concert attendance for different events. How can we use this to estimate ticket sales based on the predicted hours of sunshine for the day?

• The problem this time is, what would we find the mean and standard deviation of ? Would we use the concert attendance as the basis for our calculations, or would we use the hours of sunshine? Neither one of them gives us all the information that we need. Instead of considering just oneset of data, we need to look at both.

• we’ve looked at independent random variables, but not ones that are dependent. We can assume that if the weather is poor, the probability of high attendance at an open air concert will be lower than if the weather is sunny.

• But how do we model this connection, and how do we use this to predict attendance based on hours of sunshine?

Exploring types of data• Univariate dataconcerns the frequency or probability of a single variable.

As an example, univariate data could describe the winnings at a casino or the weights of brides in Statsville. In each case, just one thing is being described.

• What univariate data can’t do is show you connections between sets of data. For example, if you had univariate data describing the attendance figures at an open air concert, it wouldn’t tell you anything about the predicted hours of sunshine on that day. It would just give you figures for concert attendance.

• So what if we do need to know what the connection is between variables? While univariate data can’t give us this information, there’s another type of data that can—bivariate data.

• Bivariate datagives you the value of twovariables for each observation, not just one. As an example, it can give you both the predicted hours of sunshine and the concert attendance for a single event or observation, like this.

• If one of the variables has been controlled in some way or is used to explain the other, it is called the independen tor explanatory variable. The other variable is called the dependentor response variable.

• In our example, we want to use sunshine to predict attendance, so sunshine is the independent variable, and attendance is the dependent.

Visualizing bivariate data• Just as with univariate data, you can draw charts for bivariate data to help

you see patterns. Instead of plotting a value against its frequency or probability, you plot one variable on the x-axis and the other variable against it on the y-axis. This helps you to visualize the connection between the two variables.

• This sort of chart is called a scatter diagram or scatter plot• The independent variable normally goes along the x-axis, leaving the

dependent variable to go on the y-axis. • Once you’ve drawn your axes, you then take the values for each

observation and plot them on the scatter plot

• scatter plot showing the number of hours of sunshine and concert attendance figures for particular events or observations.

• As the predicted number of hours sunshine is the independent variable, we’ve plotted it on the x-axis. The concert attendance is the dependent variable, so that’s on the y-axis.

• Can you see how the scatter diagram helps you visualize patterns in the data?

• Can you see how this might help us to define the connection between open air concert attendance and predicted number of hours sunshine for the day?

Scatter diagrams show you patterns

• scatter diagrams are useful because they show the actual pattern of the data. They enable you to more clearly visualize what connection there is between two variables.

• The scatter diagram for the concert data shows a distinct pattern—the data points are clustered along a straight line. We call this a correlation.

• Scatter diagrams show the correlation between pairs of values.• Correlations are mathematical relationships between variables.• The correlation is said to be linearif the scatter diagram shows the points

lying in an approximately straight line.

• Positive linear correlation Negative linear correlation

We need to predict the concert attendance

• What we need to do next is see how we can use the data to make predictions for concert attendance, based on predicted hours of sunshine

• Predict values with a line of best fit• you’ve seen how scatter diagrams can help you see whether there’s a

correlation between values, by showing you if there’s some sort of pattern.

• But how can you use this to predict concert attendance, based on the predicted amount of sunshine? How would you use your existing scatter diagram to predict the concert attendance if you know how many hours of sunshine are expected for the day?

• One way of doing this is to draw a straight line through the points on the scatter diagram, making it fit the points as closely as possible. You won’t be able to get the straight line to go through every point, but if there’s a linear correlation, you should be able to make sure every point is reasonably close to the line you draw.

Your best guess is still a guess• Imagine if you asked three different people to draw what each of them

think is the line of best fit for the open air concert data. It’s quite likely that each person would come up with a slightly different line of best fit, like this:

We need to find the equation of the line

• The equation for a straight line takes the form y = a + bx, where a is the point where the line crosses the y-axis, and b is the slope of the line. This means that we can write the line of best fit in the form y = a + bx.

• In our case, we’re using x to represent the predicted number of hours of sunshine, and y to represent the corresponding open air concert figures. If we can use the concert attendance data to somehow find the most suitable values of a and b, we’ll have a reliable way to find the equation of the line, and a more reliable way of predicting concert attendance based on predicted hour of sunshine.

We need to minimize the errors• The best fitting line is the one that most accurately predicts the true

values of all the points. This means that for each known value of x, we need each of the y variables in the data set to be as close as possible to what we’d estimate them to be using the line of best fit.

• In other words, given a certain number of hours sunshine, we want our estimates for open air concert attendance to be as close as possible to the actual values

• Let’s represent each of the y values in our data set using yi, and its estimate using the line of best fit as yi

• we need to minimize the total differences between yi and yi. We could try doing this by minimizing

Introducing the sum of squared errors

• the total distance between the actual and expected points, we need to add together the distances squared. That way, we make sure that all the values are positive.

• The total sum of the distances squared is called the sum of squared errors, or SSE. It’s given by:

Find the equation for the line of best fit

• We’ve said that we want to minimize the sum of squared errors, Σ(y - y)2, • where y = a + bx. By doing this, we’ll be able to find optimal values for a and b, and

that will give us the equation for the line of best fit.• Let’s start with b• The value of b for the line y = a + bx gives us the slope, or steepness, of the line. In

other words, b is the slope for the line of best fit

Finding the slope for the line of best fit

NUMBERS• Floating-point numbers can take several special values, the most

important of which are INFINITY, -INFINITY, and NAN• NAN(read: not a number), which is an appropriate way to represent

missing data• Assigning doubled=1.0/0.0will result in d==INFINITY,• and d=0.0/0.0will result in dbeing set to NAN• Comparison to an NANvalue always fails:• double blank = NAN;• blank == NAN; // This evaluates to false.• blank == blank; // This evaluates to false. (!)• isnan ( blank) ; // Returns 1: the correct way to check for an NaN value.

• #include <math.h> //NaN handlers• #include <stdio.h> //printf• int main(){• double missing_data = NAN;• double big_number = INFINITY;• double negative_big_number = −INFINITY;• if (isnan(missing_data))• printf("missing_data is missing a data point.\n");• if (isfinite(big_number)== 0)• printf("big_number is not finite.\n");• if (isfinite(missing_data)== 0)• printf("missing_data isn’t finite either.\n");• if (isinf(negative_big_number)== −1)• printf("negative_big_number is negative infinity.\n");• }

PRECISION• The computer basically stores non-integer numbers using scientific

notation. For those who haveforgotten this notation, π is written as 3.14159 × 100, or 3.14159e0, and 100π as3.14159 × 102, or 3.14159e2.

• Your computer works in binary, so floating-point numbers (of type floatanddouble) are of the form d × 2n, where d is a string of ones and zeros and n is an exponent

• The scale of a number is its overall magnitude, and is expressed by the exponent n.

• it is as easy to express three picometers (3e−12) as three million kilometers (3e9).

• There is a fixed space for d, and when that space is exceeded, n is adjusted to suit, but that change probably means a loss in precision for d.

• example, say that the space for d is only three digits; then 5.89e0×892e0 = 525e1,though 5.89 × 892 = 5, 254. The final four was truncated to zero to fit d into its given space

• The loss of precision becomes especially acute when multiplying together a long list of numbers. maximum likelihood estimation, because the likelihood function involves exactly such multiplication.

• we have a column of a thousand values each around a half.• Then the product of the thousand elements is about 2−1000, which

strains what a double can represent• For i > 1, 000—a modest number of data points—a doublethrows• in the towel and calls 2i= ∞ and 2−i= 0. • These are referred to as an overflow error and underflow error,

respectively.

• #include <math.h>• #include <stdio.h>• int main(){• printf("Powers of two held in a double:\n");• for(int i=0; i< 1400; i+=100)• printf("%i\t %g \t %g\n", i, ldexp(1,i), ldexp(1,−i));• printf("Powers of two held in a long double:\n");• for(int i=0; i< 18000; i+=1000)• printf("%i\t %Lg \t %Lg\n", i, ldexpl(1,i), ldexpl(1,−i));• }• The solution to the problem of finding the product of a large number of

elements is to calculate the log of the product rather than the product itself

Standard Deviation and Variance

• Standard Deviation• The Standard Deviation is a measure of how

spread out numbers are.• Its symbol is σ (the greek letter sigma)• The formula is easy: it is the square root of

the Variance. So now you ask, "What is the Variance?"

• Variance• The Variance is defined as:• The average of the squared differences from the

Mean.• To calculate the variance follow these steps:• Work out the Mean (the simple average of the

numbers)Then for each number: subtract the Mean and square the result (the squared difference).Then work out the average of those squared differences. (Why Square?)

http://www.mathsisfun.com/mean.html

http://www.mathsisfun.com/data/standard-deviation.html

Finding a Central Value

Probability• Life is full of uncertainty.• Sometimes it can be impossible to say what will happen from one minute

to the next. But certain events are more likely to occur than others, and that’s where probability theory comes into play.

• Probability lets you predict the future by assessing how likely outcomes are, and knowing what could happen helps you make informed decisions.

• Probability is a way of measuring the chance of something happening. You can use it to indicate how likely an occurrence is

• Event is any occurrence that has a probability attached to it—• in other words, an event is any outcome where you can say how likely it is

to occur.

• The roulette wheel used in Fat Dan’s Casino has 38 pockets that the ball can fall into. The main pockets are numbered from 1 to 36, and each pocket is colored either red or black.

• There are two extra pockets numbered 0 and 00. These pockets are both green

• You can place all sorts of bets with roulette. For instance, you can bet on a particular number, whether that number is odd or even, or the color of the pocket.

• You’ll hear more about other bets when you start playing. One other thing to remember: if the ball lands on a green pocket, you lose

• Event is any occurrence that has a probability attached to it—in other words, an event is any outcome where you can say how likely it is to occur.

Find roulette probabilities

• Sis known as the possibility space, or sample space. It’s a shorthand way of referring to all of the possible outcomes. Possible events are all subsets of S.

You can visualize probabilities with a Venn diagram

Complementary events• There’s a shorthand way of indicating the event • that A does not occur—AI. AI is known as the Complementary event of A.• There’s a clever way of calculating P(AI). AI covers every possibility that’s

not in event A, so between them, A and AI must cover every eventuality. • If something’s in A, it can’t be in AI , and if something’s not in A, it must be

in AI. This means that if you add P(A) and P(AI) together, • you get 1. In other words, there’s a 100% chance that something will be in

either A or AI

• . This gives us• P(A) + P(AI) = 1

• P(Green)• P(38)• 2 of the pockets are green, and there are 38 pockets total, so:• Probability = 2/38 = 0.053 (to 3 decimal places)

• P(9)• The probability of getting a 9 is exactly t as getting a 7, as there’s an equal

chance of the ball falling into each pocket.• Probability = 1/38 = 0.026 (to 3 decimal places)

• 18 of the pockets are black, and there are 38 pockets, so:• Probability = 18/38= 0.474 (to 3 decimal places)

• This event is actually impossible—there is no pocket labeled 38. Therefore, the probability is 0

• here must be a fix! The probability of getting a black is far higher than • getting a green or 0. What went wrong? I want to win• Probabilities are only indications of how likely events are; they’re not

guarantees.• The important thing to remember is that a probability indicates a long-

term trend only. If you were to play roulette thousands of times, you would expect the ball to land in a black pocket in 18/38 spins, approximately 47% of the time, and a green pocket in 2/38 spins, or 5% of the time.

• Even though you’d expect the ball to land in a green pocket relatively infrequently, that doesn’t mean it can’t happen

Let’s bet on an even more likely event

• Let’s look at the probability of an event that should be more likely to happen. Instead of betting that the ball will land in a black pocket, let’s bet that the ball will land in a black or a red pocket.

• To work out the probability, all we have to do is count how many pockets are red or black, then divide by the number of pockets. Sound easy enough?

• Take a look at your roulette board. There are only three colors for the ball to land on: red, black, or green. As we’ve already worked out what P(Green) is, we can use this value to find our probability without having to count all those black and red pockets.

• P(Black or Red) = P(GreenI)• = 1 - P(Green)• = 1 - 0.053• = 0.947 (to 3 decimal places)

• ParameterA characteristic that describes a population is called a parameter. Because it is often difficult (or impossible) to measure an entire population, parameters are most often estimated.

• Parameters are usually written Greek letters. I’ve already taught you about two: population mean, and population standard deviation.µ and σ

MODELS• estimate the parameters of a model using data• The Apophenia library provides functions and data structures at exactly

this level of abstraction, in the form of the apop_modeland apop_datastructures and the functions that operate on them.

• the apop_modelstructure, which provides similar forms of strength through constraint: it encapsulates model information in a uniform manner, allows models to be used interchangeably in functions that can take any model as an input

• A great deal of statistical work consists of converting or combining existing models to form new ones. That is, models can be filtered to produce models just as data can be filtered to provide new information.

• We can read estimation as filtering an un-parameterized model into a parameterized one.

• Bayesian updating takes in a prior model, a likelihood function, and data, and outputs a new model—which can then be used as the input to another round of filtering when new data comes in.

• Another example discussed below, is the imposition of a constraint: begin by estimating a general model, then generate a new model with a constraint imposed on some of the parameters, and re-estimate.

• The difference in log likelihoods of the constrained and unconstrained models can then be used for hypothesis testing

The structure of the model struct• model intermediates between data and parameters• the model can go in three directions:• i) X β: Given data, estimate parameters.⇒• ii) β X: Given parameters, generate artificial data (e.g., make random ⇒

draws from the model, or find the expected value).• iii) (X, β) p: Given both data and parameters, estimate their likelihood or⇒• probability.

• form (i) is the descriptive problem, such as estimating a covariance or OLS parameters.

• Monte Carlo methods use form (ii): producing a few million draws from the model given fixed parameters.

• Bayesian estimation is based on form (iii), describing a posterior probability given both data and parameters, as are the likelihoods in maximum likelihood estimation

• There are apop_models already written, including distributions like the Normal, Multivariate Normal, Gamma, Zipf, et cetera, and generalized linear models like OLS, WLS, probit, and logit Because they are in a standardized form, they can be sent to model-handling functions, and be applied to data in sequence.

• For example, you can fit the data to a Gamma, a Lognormal, and an Exponential distribution and compare the outcomes

• Every model can be estimated via a form such as• apop_model * est = apop_estimate(data, apop_normal);

• The apop_data structure represents a data set (of course). Data sets are inherently complex, but there are many functions that act on apop_data sets to make life easier.

• The apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs.

• Databases are great, and a perfect fit for the sort of paradigm here. Apophenia provides functions to make it easy to jump between database tables and apop_data set



http://apophenia.info/structapop__model.html


• You could use Apophenia for simple stats-package--like fitting of models, where the user gathers data, cleans it, and runs a series of regressions. Or you could use the library as input to the design of other systems, like fitting a model and then using the fitted model to generate agents in your simulation, or designing hierarchical models built from simpler base models

The workflow of a typical fitting-a-model project using Apophenia's tools goes something like this:

• Read the raw data into the database using apop_text_to_db.• Use SQL queries handled by apop_query to massage the data as needed.• Use apop_query_to_data to pull some of the data into an in-memory

apop_data set.• Call a model estimation such as apop_estimate (data_set, apop_ols) Or

apop_estimate (data_set, apop_probit) • to fit parameters to the data. This will return an apop_model with

parameter estimates.• Interrogate the returned estimate, by dumping it to the screen with

apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the estimate step runs a few for you), or send the model's output to be input to another model.





http://apophenia.info/group__models.html






http://apophenia.info/group__output.html

• #include <apop.h> • int main(void)• { • apop_text_to_db(.text_file="data", .tabname="d"); • apop_data *data = apop_query_to_data("select * from d"); • apop_model *est = apop_estimate(data, apop_ols);

apop_model_show(est); • }• To run this, you will need a file named data in comma-separated form. The

first column is the dependent variable; the remaining columns are the independent:







If you saved the code to sample.c, then you can compile it withgcc sample.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_meand then run it with ./run_me.

NETWORK DATA• data describing ranks (score of first place, second place, . . . ) things get

more interesting, because such data often appears in multiple forms. For example, say that we have a classroom where every student wrote down the ID number his or her best friend, and we tallied this list of student numbers:

• 1 1 2 2 2 2 3 4 4 4 6 7 7 7.• First, we would need to count how often each student appeared:

• In SQL:• select id_no, count(* ) as ct• from surveys• group by id_no• order by ct desc• If we were talking about city sizes (another favorite for rank-type analysis),

we would list the size of the largest city, the second largest, et cetera. The labels are not relevant to the analysis; you would simply send the row of counts for most popular, second most popular, et cetera:

• 4 3 3 2 1 1.

• you can add groups of settings to a model to tweak its behavior. In the case of the models commonly used for rank analysis, you can signal to the model that it will be getting rank-ordered data. For example:

• apop_model * rank_version = apop_model_copy(apop_zipf);• Apop_settings_add_group (rank_version, apop_rank, NULL);• apop_model_show(apop_estimate(ranked_draws, rank_version));

Maximum likelihood

• In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.

• The method of maximum likelihood corresponds to many well-known estimation methods in statistics.

http://en.wikipedia.org/wiki/Estimator

http://en.wikipedia.org/wiki/Parameter

http://en.wikipedia.org/wiki/Statistical_model

http://en.wikipedia.org/wiki/Statistical_model

http://en.wikipedia.org/wiki/Estimator

• For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed with some unknown mean andvariance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population.

• MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model).

http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Mean

http://en.wikipedia.org/wiki/Variance

matrices and models c language provides only the most basic of basics, such as addition and...

Documents