dat manual.pdf

1

M.Sc. I.T. Part I Semester I

Data Analysis Tools

MANUAL FOR PRACTICAL

2013 – 2014

2

M.Sc in Information Technology Part I

Course III : Data Analysis Tools

Practical based on the Book “Modelling with Data”

Practical Problems Prepared and Implemented by

Mr. Mahesh Naik, Valia College, Andheri

&

Mr. Jayesh Shinde, UDIT, Santacruz

Compiled By

R. Srivaramangai, UDIT, Santacruz

3

INDEX

S.NO DESCRIPTION PAGE NUMBER 1 List of Practical

4

2 Installation procedure for cygwin

6

3 Installation procedure for ubuntu

8

4 Practical 1

11

5 Practical 2

21

6 Practical 3

24

7 Practical 4

28

8 Practical 5

41

9 Practical 6

46

10 Practical 7

49

11 Practical 8

54

12 Practical 9

57

13 Practical 10

58

14 References

60

4

List of Practical

1. SQL queries based on Unit I a. DDL commands of SQL b. Select clause

i. Simple select ii. Select queries with where clause

iii. Select queries with arithmetic, relational and logical operators

iv. Select queries with order by, group by, having, limit and offset

v. Select queries with aggregation functions and distinct vi. Select queries with sub queries and Joins

2. Implementing gsl matrices and vectors

a. Illustration of gsl Matrix multiplication b. Illustration of gsl vector with database query embedded

3. Graph Plotting a. Gnu plot for plotting vectors 1 b. Gnu plot for plotting vectors 2 c. Gnu plot for plotting vectors 3

4. Implementing Statistical Distributions Discrete distributions a) Bernoulli distribution b) Binomial distribution c) Poisson distribution d) Multinomial distribution e) Hyper geometric distribution

Continuous distributions a) Normal distribution b) Lognormal distribution c) Gamma distribution d) Exponential distribution

5

e) Beta distribution

5. Implementing Regression and goodness of fit a. Implementing OLS regression b. Implementing goodness of fit –chi square

6. Illustrating the maximum likelihood 7. Generating random numbers with Monte Carlo method using

a. Exponential distribution b. Uniform distribution c. Binomial distribution

8. Implementing Parametric testing a. Using t-test b. Using f-test

9. Illustrating the method of Inference 10. Implementing non-parametric testing - ANOVA

6

Installation of cygwin

1) Download the Cygwin software from the site named as

http://www.cygwin.com/

The most recent version of the Cygwin DLL is 1.7.20-1.

2) Download one more library of functions named as apophenia from the

website http://apophenia.info/

3) Now Install cygwin by running its setup.exe.

4) There are numerous packages in cygwin ans so select those packages

which are required for the practical, namely gcc compiler, make, gsl ,

gnu, sqlite

5) Now the apophenia library is to be included in the cygwin software.

When we install cygwin ,the cygwin folder is created in the C: drive.

Within the cygwin folder , go to home directory and sub directory for

example C:\cygwin\home\yourname (C:\cygwin\home\Jayesh).

6) Copy the apophenia library to that directory named Jayesh

7) Double click on the Cygwin terminal icon and the terminal will open.

you will be taken to the cygwin terminal window as shown below

which displays the present working directory


http://cygwin.com/ml/cygwin-announce/2013-06/msg00008.html

http://apophenia.info/

http://cygwin.com/setup.exe

7

8) Configure the apophenia library by typing:

tar xvzf apophenia-0.99-09_Jul_13.tgz cd apophenia-0.99

9) . /configure To test :

1. Once cygwin installation is complete, we can check the same by running a test program.

2. To run a test program with “abc.c” 3. Run the following command in bash…… 4. gcc –std=gnu99 abc.c –o abc.out –lapophenia –lgsl –lsqlite3

./abc.out

8

Ubuntu Installation as per the free download.

How to install the Sqlite on ubuntu 13.04

1) Download the archive package of sqlite database named sqlite-autoconf-

3071700.tar.gz from the htpp:// www.sqlite.org.

2) After download of the sqlite-autoconf-3071700.tar.gz package ,copy the

package in the Home folder of Ubuntu 13.04

3) Open the Terminal. It will open in the Current Directory. We have to Extract

the package sqlite-autoconf-3071700.tar.gz

Then type the Command

tar xvfz sqlite-autoconf-3071700.tar.gz

4) After the Extraction of the package, the folder is created in the Current

Directory is known as sqlite-autoconf-3071700

5) Move to that new folder which has been created

jayesh@jayesh-G31M-S2L:~$ cd sqlite-autoconf-3071700

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$

6) It is needed to configure all the files present in the sqlite-autoconf-3071700

folder

type the Command:

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ ./configure

7) After the configuration has been done,

Type the Command

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make

It will ask the password ,type the passwoord and press the Enter Key

8) Now we need to install the “make” using the following command:

jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make install

9

9) jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo ldconfig

How to install the apophenia on ubuntu 13.04

1) Download the archive package of gsl named gsl-1.16.tar.gz from the

htpp:// www.gnu.org/s/gsl/‎

2) After download of the gsl-1.16.tar.gz package , copy the package in the

Home folder of Ubuntu 13.04

3) Open the Termina. It will open in the Current Directory. We have to Extract

the package gsl-1.16.tar.gz


tar xvfz gsl-1.16.tar.gz


Directory is known as gsl-1.16


jayesh@jayesh-G31M-S2L:~$ cd gsl-1.16

jayesh@jayesh-G31M-S2L:~/gsl-1.16$

6) It is needed to configure all the files present in the gsl-1.16 folder

type the Command:

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ ./configure


Type the Command

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make

It will ask the password ,type the password and press the Enter Key

8) After the Make has been done it need to install the gsl

jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make install

10

9) jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo ldconfig

How to install the gsl on ubuntu 13.04

1) Download the archive package of apophenia named apophenia-0.99.tar.gz

from the htpp:// apophenia.info/‎‎

2) After download of the apophenia-0.99.tar.gz package, copy the package in

the Home folder of Ubuntu 13.04

3)Open the Termina. It will open in the Current Directory. We have to Extract

the package apophenia-0.99.tar.gz


tar xvfz apophenia-0.99.tar.gz


Directory is known as apophenia-0.99


jayesh@jayesh-G31M-S2L:~$ cd apophenia-0.99

jayesh@jayesh-G31M-S2L:~/apophenia-0.99$

6) It is needed to configure all the files present in the gsl-1.16 folder

type the Command:

jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ ./configure


Type the Command

jayesh@jayesh-G31M-S2L:~/apophenia-0.99 $ sudo make install

It will ask the password ,type the password and press the Enter Key

9) jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ sudo ldconfig

Installation of GNUPLOT On Ubuntu 13.04

sudo apt-get install gnuplot-x11

11

Practical No.1 - SQL queries based on Unit I

For all database related practical, create a database in Sqlite3

jayesh@jayesh-G31M-S2L:~$ sqlite3 testDB.db

SQLite version 3.7.17 2013-05-20 00:56:22

Enter ".help" for instructions

Enter SQL statements terminated with a ";"

To Check the database created or not

sqlite> .databases

seq name file

--- --------------- ----------------------------------------------------------

0 main /home/jayesh/testDB.db

sqlite>

Problem statement : To execute SQL queries in order to store and retrieve the data under study in a database. Sqlite is used for executing the queries.

i) Queries for performing DDL commands. DDL commands are used to create, modify and delete database objects. The data is stored in an RDBMS in the form of tables. Following are the queries to be performed for DDL commands in Sqlite

sqlite> CREATE TABLE COMPANY(

ID INT PRIMARY KEY NOT NULL,

NAME TEXT NOT NULL,

AGE INT NOT NULL,

ADDRESS CHAR(50),

SALARY REAL

);

12

sqlite> CREATE TABLE DEPARTMENT(

ID INT PRIMARY KEY NOT NULL,

DEPT CHAR(50) NOT NULL,

EMP_ID INT NOT NULL

);

You can verify if your table has been created successfully using SQLIte

command .tables command

sqlite>.tables

COMPANY DEPARTMENT

ii) Insertion value into the COMPANY and DEPARTMENT Table

INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 ); INSERT INTO COMPANY VALUES (7, 'James', 24, 'Houston', 10000.00 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (1, 'IT Billing', 1 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (2, 'Engineering', 2 );

13

INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (3, 'Finance', 7 );

iii) Select clause is a data manipulation command used for retrieving the

data in the desired format from the database objects. The syntax of the various select clause and its purpose is given below:

Select * from company;

a) list down all the records where AGE is greater than or equal to 25 AND salary is greater than or equal to 65000.00:

14

sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 AND SALARY >= 65000;

a) list down all the records where AGE is greater than or equal to 25 ORsalary is greater than or equal to 65000.00:

sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 OR SALARY >= 65000;

list down all the records where AGE is not NULL which means all the records because none of the record is having AGE equal to NULL:

sqlite> SELECT * FROM COMPANY WHERE AGE IS NOT NULL;

list down all the records where NAME starts with 'Ki', does not matter what comes after 'Ki'.

sqlite> SELECT * FROM COMPANY WHERE NAME LIKE 'Ki%';

15

list down all the records where AGE value is either 25 or 27: sqlite> SELECT * FROM COMPANY WHERE AGE IN ( 25, 27 );

list down all the records where AGE value is neither 25 nor 27: sqlite> SELECT * FROM COMPANY WHERE AGE NOT IN ( 25, 27 );

list down all the records where AGE value is in BETWEEN 25 AND 27: sqlite> SELECT * FROM COMPANY WHERE AGE BETWEEN 25 AND 27;

finds all the records with AGE field having SALARY > 65000 sqlite> SELECT AGE FROM COMPANY WHERE EXISTS (SELECT AGE FROM COMPANY WHERE SALARY > 65000);

16

Find the total amount of salary on each customer sqlite> SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME;

Company Table Have a multiple record INSERT INTO COMPANY VALUES (8, 'Paul', 24, 'Houston', 20000.00 ); INSERT INTO COMPANY VALUES (9, 'James', 44, 'Norway', 5000.00 ); INSERT INTO COMPANY VALUES (10, 'James', 45, 'Texas', 5000.00 );sqlite> sqlite>

b) Order by Clause

17

SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME ORDER BY NAME;

Consider COMPANY table is having following records:

c) Following is the example which would display record for which name

count is less than 2: SELECT * FROM COMPANY GROUP BY name HAVING count(name) < 2;

sqlite > SELECT * FROM COMPANY GROUP BY name HAVING count(name) > 2;

18

d) which would sort the result in Ascending order by SALARY: sqlite> SELECT * FROM COMPANY ORDER BY SALARY ASC;

e) which would sort the result in descending order by NAME: sqlite> SELECT * FROM COMPANY ORDER BY NAME DESC;

f) Following is an example which limits the row in the table according to

the no of rows you want to fetch from table: sqlite> SELECT * FROM COMPANY LIMIT 6;

19

sqlite> SELECT * FROM COMPANY LIMIT 3 OFFSET 2;

g) Joins sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY CROSS JOIN DEPARTMENT;

20

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY INNER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY LEFT OUTER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

21

Practical 2 i) Multiplication Table

#include <apop.h> int main(){ gsl_matrix *m = gsl_matrix_alloc(20,15); gsl_matrix_set_all(m, 1); for (int i=0; i< m->size1; i++){ Apop_matrix_row(m, i, one_row); gsl_vector_scale(one_row, i+1); } for (int i=0; i< m->size2; i++){ Apop_matrix_col(m, i, one_col); gsl_vector_scale(one_col, i+1); } apop_matrix_show(m); gsl_matrix_free(m); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 multiplicationtable.c -o multiplicationtable.out -lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./multiplicationtable.out

22

ii) the function in will take in a double indicating taxable income and will return US income taxes owed, assuming a head of household with two dependents taking the standard deduction

#include <apop.h> double calc_taxes(double income){ double cutoffs[] = {0, 11200, 42650, 110100, 178350, 349700, INFINITY}; double rates[] = {0, 0.10, .15, .25, .28, .33, .35}; double tax = 0; int bracket = 1; income -= 7850; //Head of household standard deduction income -= 3400*3; //exemption: self plus two dependents. while (income > 0){ tax += rates[bracket] * GSL_MIN(income, cutoffs[bracket]-cutoffs[bracket-1]); income -= cutoffs[bracket]; bracket ++; } return tax; } int main(){ apop_db_open("data-census.db"); strncpy(apop_opts.db_name_column, "geo_name", 100); apop_data *d = apop_query_to_data("select geo_name, Household_median_in as income\

23

from income where sumlevel = '040'\ order by household_median_in desc"); Apop_col_t(d, "income", income_vector); d->vector = apop_vector_map(income_vector, calc_taxes); apop_name_add(d->names, "tax owed", 'v'); apop_data_show(d); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 taxes.c -o taxes.out -lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./taxes.out

24

Practical III Plotting a vector

#include <apop.h> void plot_matrix_now(gsl_matrix *data){ static FILE *gp = NULL; if (!gp) gp = popen("gnuplot -persist", "w"); if (!gp){ printf("Couldn't open Gnuplot.\n"); return; } fprintf(gp,"reset; plot '-' \n"); apop_matrix_print(data, .output_pipe=gp); fflush(gp); } int main(){ apop_db_open("data-climate.db"); plot_matrix_now(apop_query_to_matrix("select (year*12+month)/12., temp from temp")); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 pipeplot.c -o pipeplot.out -lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./pipeplot.out

25

Eigen vector

#include "eigenbox.h" apop_data *query_data(){ apop_db_open("data-census.db"); return apop_query_to_data(" select postcode as row_names, " " m_per_100_f, population/1e6 as population, median_age " " from geography, income,demos,postcodes " " where income.sumlevel= '040' " " and geography.geo_id = demos.geo_id " " and income.geo_name = postcodes.state " " and geography.geo_id = income.geo_id "); } void show_projection(gsl_matrix *pc_space, apop_data *data){ fprintf(stderr,"The eigenvectors:\n"); apop_matrix_print(pc_space, .output_pipe=stderr); apop_data *projected = apop_dot(data, apop_matrix_to_data(pc_space)); printf("plot '-' using 2:3:1 with labels\n"); apop_data_show(projected); }

26

int main(){ apop_plot_lattice(query_data(), "out"); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 eigenbox.c -o eigenbox.out -lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./eigenbox.out jayesh@jayesh-G31M-S2L:~$ gnuplot -persist < out

Query out the month, average, and variance, and plot the data using errorbars. Prints to stdout, so pipe the output through Gnuplo

#include <apop.h>

int main(){

apop_db_open("data−climate.db");

apop_data *d = apop_query_to_data("select \

(yearmonth/100. − round(yearmonth/100.))*100 as month, \

avg(tmp), stddev(tmp) \

27

from precip group by month");

printf("set xrange*0:13+; plot ’−’ with errorbars\n");

apop_matrix_show(d−>matrix);

}

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 errorbars.c -o errorbars.out -lapophenia -lgsl -lsqlite3

jayesh@jayesh-G31M-S2L:~$ ./errorbars.out | gnuplot –persist

28

Practical 4

Implement the statistical distributions

Discrete distributions

1. Bernoulli distribution 2. binomial distribution 3. Poisson distribution 4. Multinomial distribution 5. hypergeometric distribution

Continous distributions

1. Normal distribution 2. Lognormal distribution 3. Gamma distribution 4. Exponential distribution 5. Beta distribution

bernoulli distribution (bernoulli.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i; double p = 0.6; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= 1; i++) { float k = gsl_ran_bernoulli_pdf (i,p); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0;

29

}

binomial distribution (binomial.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i,n=5; double p = 0.6; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= n; i++) { float k = gsl_ran_binomial_pdf (i,p,n); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0; }

30

Poisson distribution (poi.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i, n = 10; double mu = 3.0; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= n; i++) { float k = gsl_ran_poisson_pdf (i,mu); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0; }

31

Uniform distribution(uniform.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { double x; int a,b ; printf("enter vaue for x ,a,b \n"); scanf("%f",&x); scanf("%d",&a); scanf("%d",&b); float sum=0; /* prints probability distibution table*/ printf("random variable|||probability \n"); printf("-------------------------------------------------------\n"); float k = (float)gsl_ran_flat_pdf (x,a,b); printf("%f\t\t%f\n",x,k);

32

return 0; }

Multinomial distribution (multinomial.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int k=3; const double p[]={0.2,0.4,0.4}; const unsigned int n[]={2,3,4}; /* prints probability */ printf("random variable|||probability \n"); printf("-------------------------------------------------------\n"); double pmf =gsl_ran_multinomial_pdf(k,p,n); printf("%3.9f\n",pmf); return 0; }

33

The following formula gives the probability of obtaining a specific set of

outcomes when there are three possible outcomes for each event:

where

p is the probability, n is the total number of events n1 is the number of times Outcome 1 occurs, n2 is the number of times Outcome 2 occurs, n3 is the number of times Outcome 3 occurs, p1 is the probability of Outcome 1 p2 is the probability of Outcome 2, and p3 is the probability of Outcome 3.

For the chess example,

n = 12 (12 games are played), n1 = 7 (number won by Player A), n2 = 2 (number won by Player B), n3 = 3 (the number drawn), p1 = 0.40 (probability Player A wins) p2 = 0.35(probability Player B wins) p3 = 0.25(probability of a draw)

34

The formula for k outcomes is

Hypergeometric distribution (hyper.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int x,s,f,n; n=6; x=2;//random variable s=13;//success f=39;//failure /* prints probability */ printf("random variable|||probability \n"); printf("-----------------------------------\n"); double pmf =gsl_ran_hypergeometric_pdf(x,s,f,n); printf("%d %3.6f\n",x,pmf); return 0; }

35

continous distributions (contdist.c)

#include <stdio.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> #include <gsl/gsl_cdf.h> void normal(); void beta(); void gamma1(); void exponential(); void lognormal(); int main() { int choice; printf("continous distributions\n"); printf("-----------------------\n"); printf("1:Normal distribution\n"); printf("2:Gamma distribution\n"); printf("3:Exponential distribution\n"); printf("4:Beta distribution\n"); printf("5:Lognormal distribution\n"); printf("enter your choice\n"); scanf("%d",&choice); switch(choice) {case 1: normal(); break; case 2: gamma1(); break; case 3: exponential();

36

break; case 4: beta(); break; case 5: lognormal(); break; default: printf("wrong choice\n"); } return 0; } void normal() { double P, Q; double x = 10; double sigma=5; double pdf; printf("Normal distribution :x=%f sigma=%f\n",x,sigma); pdf = gsl_ran_gaussian_pdf (x,sigma); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_gaussian_P (x,sigma); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_gaussian_Q (x,sigma); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_gaussian_Pinv (P,sigma); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_gaussian_Qinv (Q,sigma); printf ("Qinv(%f) = %f\n", Q, x); } void gamma1() { double P, Q; double x = 1.5; double a=1; double b=2; double pdf;

37

printf("Gamma distribution :x=%f a=%f b=%f\n",x,a,b); pdf = gsl_ran_gamma_pdf (x,a,b); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_gamma_P (x,a,b); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_gamma_Q (x,a,b); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_gamma_Pinv (P,a,b); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_gamma_Qinv (Q,a,b); printf ("Qinv(%f) = %f\n", Q, x); } void exponential() { double P, Q; double x = 0.05; double lambda=2; double pdf; printf("Exponential distribution :x=%f lambda=%f\n",x,lambda); pdf = gsl_ran_exponential_pdf (x,lambda); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_exponential_P (x,lambda); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_exponential_Q (x,lambda); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_exponential_Pinv (P,lambda); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_exponential_Qinv (Q,lambda); printf ("Qinv(%f) = %f\n", Q, x); } void beta()

38

{ double P, Q; double x = 0.8; double a=0.5; double b=0.5; double pdf; printf("Beta distribution :x=%f a=%f b=%f\n",x,a,b); pdf = gsl_ran_beta_pdf (x,a,b); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_beta_P (x,a,b); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_beta_Q (x,a,b); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_beta_Pinv (P,a,b); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_beta_Qinv (Q,a,b); printf ("Qinv(%f) = %f\n", Q, x); } void lognormal() { double P, Q; double x = 4; double zeta=2; double sigma=1.5; double pdf; printf("Lognormal distribution :x=%f zeta=%f sigma=%f\n",x,zeta,sigma); pdf = gsl_ran_lognormal_pdf (x,zeta,sigma); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_lognormal_P (x,zeta,sigma); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_lognormal_Q (x,zeta,sigma); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_lognormal_Pinv (P,zeta,sigma); printf ("Pinv(%f) = %f\n", P, x);

39

x = gsl_cdf_lognormal_Qinv (Q,zeta,sigma); printf ("Qinv(%f) = %f\n", Q, x); }

41

Practical No. 5 Implement regression and goodness of fit Implementing regression Steps : Functions used : int gsl_fit_wlinear (const double * x, const size_t xstride, const double * w, const size_t wstride, const double * y, const size_t ystride, size_t n, double * c0, double * c1, double * cov00, double * cov01, double * cov11, double * chisq)

This function computes the best-fit linear regression coefficients (c0,c1) of the model Y = c_0 + c_1 X for the weighted dataset (x, y), two vectors of length n with strides xstride and ystride. The vector w, of length n and stride wstride, specifies the weight of each datapoint. The weight is the reciprocal of the variance for each datapoint in y.

The covariance matrix for the parameters (c0, c1) is computed using the weights and returned via the parameters (cov00, cov01, cov11). The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq.

int gsl_fit_linear_est (double x, double c0, double c1, double cov00, double cov01, double cov11, double * y, double * y_err)

This function uses the best-fit linear regression coefficients c0, c1 and their covariance cov00, cov01, cov11 to compute the fitted function y and its standard deviation y_err for the model Y = c_0 + c_1 X at the pointx.

program computes a least squares straight-line fit to a simple dataset, and outputs the best-fit line and its associated one standard-deviation error bars. #include <stdio.h> #include <gsl/gsl_fit.h> int main (void) { int i, n = 4; double x[4] = { 1970, 1980, 1990, 2000 }; double y[4] = { 12, 11, 14, 13 }; double w[4] = { 0.1, 0.2, 0.3, 0.4 };

42

double c0, c1, cov00, cov01, cov11, chisq; gsl_fit_wlinear (x, 1, w, 1, y, 1, n, &c0, &c1, &cov00, &cov01, &cov11, &chisq); printf ("# best fit: Y = %g + %g X\n", c0, c1); printf ("# covariance matrix:\n"); printf ("# [ %g, %g\n# %g, %g]\n", cov00, cov01, cov01, cov11); printf ("# chisq = %g\n", chisq); for (i = 0; i < n; i++) printf ("data: %g %g %g\n", x[i], y[i], 1/sqrt(w[i])); printf ("\n"); for (i = -30; i < 130; i++) { double xf = x[0] + (i/100.0) * (x[n-1] - x[0]); double yf, yf_err; gsl_fit_linear_est (xf, c0, c1, cov00, cov01, cov11, &yf, &yf_err); printf ("fit: %g %g\n", xf, yf); printf ("hi : %g %g\n", xf, yf + yf_err); printf ("lo : %g %g\n", xf, yf - yf_err); } return 0; }

43

B. Implementing goodness of fit Chi Square

int apop_db_open ( char const * filename )

If you want to use a database on the hard drive instead of memory, then call this once and only once before using any other database utilities.

When you are done doing your database manipulations, be sure to call apop_db_close if writing to disk.

Parameters:

filename The name of a file on the hard drive on which to store the

database.

Returns:

0: everything OK

1: database did not open.

apop_model* apop_estimate ( apop_data * d,

apop_model m

)

http://apophenia.info/db_8h.html#a4d81aff912df2982697038c51795e358

http://apophenia.info/db_8h.html#acf205f62e9432fe293b05c077d1b61d4

http://apophenia.info/structapop__model.html

http://apophenia.info/group__models.html#ga0a3be4075a89f7119be95a4718e01ade

http://apophenia.info/structapop__data.html


44

estimate the parameters of a model given data.This function copies the input model, preps it, and calls m.estimate(d,&m). If your model has no estimate method, then I assume apop_maximum_likelihood(d, m), with the default MLE params.

Parameters:

d The data

m The model

Returns: A pointer to an output model, which typically matches the input

model but has its parameters element filled in.

apop_model* apop_model_to_pmf ( apop_model * model,

apop_data * binspec,

long int draws,

int bin_count,

gsl_rng * rng

)

Make random draws from an apop_model, and bin them using a binspec in the style of apop_data_to_bins. If you have a data set that used the same binspec, you now have synced histograms, which you can plot or sensibly test hypotheses about.

The output is normalized to integrate to one.

Parameters:

binspec A description of the bins in which to place the draws;

see apop_data_to_bins. (default: as in apop_data_to_bins.)

model

The model to be drawn from. Because this function works via

random draws, the model needs to have a draw method. (No

default)


http://apophenia.info/group__histograms.html#ga12a8860fd05be2540b701fe7ec5acae4




http://apophenia.info/asst_8h.html#a02cef4243593cf905fa54cb48d8b04c2



45

draws The number of random draws to make. (arbitrary default =

10,000)

bin_count If no bin spec, the number of bins to use (default: as

per apop_data_to_bins, )

rng The gsl_rng used to make random draws. (default: see note

on Auto-allocated RNGs)

Returns:

An apop_pmf model.

This function uses the Designated initializers syntax for inputs.

#include <apop.h> int main(){ apop_db_open("data-climate.db"); apop_data *precip = apop_query_to_data("select PCP from precip"); apop_model *est = apop_estimate(precip, apop_normal); apop_data *precip_binned = apop_data_to_bins(precip/*, .bin_count=180*/); apop_model *datahist = apop_estimate(precip_binned, apop_pmf); apop_model *modelhist = apop_model_to_pmf(.model=est, .binspec=apop_data_get_page(precip_binned, "<binspec>"), .draws=1e5); double scaling = apop_sum(datahist->data->weights)/apop_sum(modelhist->data->weights); gsl_vector_scale(modelhist->data->weights, scaling); apop_data_show(apop_histograms_test_goodness_of_fit(datahist, modelhist)); }


http://apophenia.info/autorng.html

http://apophenia.info/group__models.html#gaca8f3323c57e0223a9f3c0f991c9760e

http://apophenia.info/designated.html

46

Prac 6. Implement testing with likelihood

1. Building an optimized model & then solving the same for maximum.( a

function can be provided in this case)

APOP_SIMPLEX_NM Nelder-Mead simplex (gradient handling rule is irrelevant)

APOP_CG_FR Conjugate gradient (Fletcher-Reeves) (default)

APOP_SIMAN simulated annealing

APOP_RF_NEWTON Find a root of the derivative via Newton's method

#include <apop.h> double sin_square(apop_data *data, apop_model *m){ double x = apop_data_get(m->parameters, 0, -1); return -sin(x)*gsl_pow_2(x); } apop_model sin_sq_model ={"-sin(x) times x^2",1, .p = sin_square}; #include "sinsq.c" void do_search(int number, char *name, char *trace){ apop_model *out; double p[] = {0}; double result; char *outf; asprintf(&outf, "localmax_out/%s.gplot", trace); Apop_model_add_group(&sin_sq_model, apop_mle, .starting_pt= p, .method= number, .tolerance= 1e-4, .mu_t= 1.25, .trace_path= outf); out = apop_estimate(NULL, sin_sq_model); result = gsl_vector_get(out->parameters->vector, 0); printf("The %s algorithm found %g.\n", name, result); Apop_settings_rm_group(&sin_sq_model, apop_mle); } int main(){

http://apophenia.info/group__mle.html

47

system ("mkdir -p localmax_out; rm -f localmax_out/*.gplot"); apop_opts.verbose ++; do_search(APOP_SIMPLEX_NM, "N-M Simplex", "simplex"); do_search(APOP_CG_FR, "F-R Conjugate gradient", "fr"); do_search(APOP_SIMAN, "Simulated annealing", "siman"); do_search(APOP_RF_NEWTON, "Root-finding", "root"); fflush(NULL); system("sed -i \"1iplot '-'\" localmax_out/*.gplot"); }

2. Comparing 2 models using likelihood ratio

#include <apop.h> apop_model * dummies(int slope_dummies){ apop_data *d = apop_query_to_mixed_data("mmt", "select riders, year-1977, line \ from riders, lines \ where riders.station=lines.station"); apop_data *dummified = apop_data_to_dummies(d, 0, 't', .append='y', .remove='y'); if (slope_dummies){ Apop_col(d, 1, yeardata); for(int i=0; i < dummified->matrix->size2; i ++){ Apop_col(dummified, i, c); gsl_vector_mul(c, yeardata); } } apop_model *out = apop_estimate(dummified, apop_ols);

48

apop_model_show(out); return out; } #ifndef TESTING int main(){ apop_db_open("data-metro.db"); printf("With constant dummies:\n"); dummies(0); printf("With slope dummies:\n"); dummies(1); } #endif

#define TESTING #include "dummies.c" void show_normal_test(apop_model *unconstrained, apop_model *constrained, int n){ double statistic = (apop_data_get(unconstrained->info, .rowname="log likelihood") - apop_data_get(constrained->info, .rowname="log likelihood"))/sqrt(n); double confidence = gsl_cdf_gaussian_P(fabs(statistic), 1); //one-tailed. printf("The Normal statistic is: %g, so reject the null of no difference between models " "with %g%% confidence.\n", statistic, confidence*100); } int main(){ apop_db_open("data-metro.db"); apop_model *m0 = dummies(0); apop_model *m1 = dummies(1); show_normal_test(m0, m1, m0->data->matrix->size1); }

49

Prac 7. Generate random numbers using Monte Carlo method using

1. Exponential distribution 2. uniform distribution 3. binomial distribution

some functions used for random number generation the functions used for random number generation are declared in the header file `gsl_rng.h'.

const gsl_rng_type * T : holds static information about each type of generator.

gsl_rng_env_setup() : This function reads the environment variables GSL_RNG_TYPE and GSL_RNG_SEED and uses their values to set the corresponding library variables gsl_rng_default and gsl_rng_default_seed.

program to create a global generator using the environment variables GSL_RNG_TYPE and GSL_RNG_SEED,

#include <stdio.h> #include <gsl/gsl_rng.h> gsl_rng * r; /* global generator */ int main (void) { const gsl_rng_type * T; gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); printf ("generator type: %s\n", gsl_rng_name (r)); printf ("seed = %lu\n", gsl_rng_default_seed); printf ("first value = %lu\n", gsl_rng_get (r)); gsl_rng_free (r); return 0; }

50

Running the program without any environment variables uses the initial defaults, an mt19937 generator with a seed of 0 as follows:

By setting the two variables on the command line we can change the default generator and the seed as follows:

using exponential distribution

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> int main(int argc, char *argv[]) { int i,n; float x,alpha; gsl_rng *r=gsl_rng_alloc(gsl_rng_mt19937); /* initialises GSL RNG */ n=atoi(argv[1]); alpha=atof(argv[2]); x=0; for (i=0;i<n;i++) { x=alpha*x + gsl_ran_exponential(r,1);

51

printf(" %2.4f \n",x); } return(0); }

Generating uniform random numbers in the range [0.0, 1.0) using uniform distribution

#include <stdio.h> #include <gsl/gsl_rng.h> int main (void) { const gsl_rng_type * T; gsl_rng * r; int i, n = 10; gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); for (i = 0; i < n; i++) { double u = gsl_rng_uniform (r); printf ("%.5f\n", u); }

52

gsl_rng_free (r); return 0; }

Using binomial distribution

#include <stdio.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> int main (void) { const gsl_rng_type * T; gsl_rng * r; int i, n = 10; /* create a generator chosen by the environment variable GSL_RNG_TYPE */ gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); float p=0.3;

53

/* print n random variates chosen from the binomial distribution with mean parameter mu */ for (i = 0; i < n; i++) { unsigned int k = gsl_ran_binomial(r, p,n); printf (" %u", k); } printf ("\n"); gsl_rng_free (r); return 0; }

Following functions can be used to generate random numbers using different distributions by knowing the parameters required.

54

Practical No. 8 Implementing Parametric testing 1. t test

#include <apop.h> int main(){ apop_db_open("data-census.db"); gsl_vector *n = apop_query_to_vector("select in_per_capita from income " "where state= (select state from geography where name ='North Dakota')"); gsl_vector *s = apop_query_to_vector("select in_per_capita from income " "where state= (select state from geography where name ='South Dakota')"); apop_data *t = apop_t_test(n,s); apop_data_show(t); //show the whole output set... printf ("\n confidence: %g\n", apop_data_get(t, .rowname="conf.*2 tail")); //...or just one value. }

2. F test

apop_data* apop_f_test ( apop_model * est,

apop_data * contrast

)

Runs an F-test specified by q and c. Your best bet is to see the chapter on hypothesis testing in Modeling With Data, p 309. It will tell you that:


http://apophenia.info/stats_8h.html#a1902b3fd6a6682d1453a80b788e44ebb



http://modelingwithdata.org/

55

and that's what this function is based on.

Parameters:

Est an apop_model that you have already calculated. (No default)

contrast

The matrix and the vector , where each row represents a hypothesis. (Defaults: if matrix is NULL, it is set to the identity matrix with the top row missing. If the vector is NULL, it is set to a zero matrix of length equal to the height of the contrast matrix. Thus, if the entire apop_data set is NULL or omitted, we are testing the hypothesis that all but are zero.)

Returns: An apop_data set with a few variants on the confidence with which we can reject the joint hypothesis.

Todo: There should be a way to get OLS and GLS to store . In fact, if you did GLS, this is invalid, because you need , and I didn't ask for .

There are two approaches to an -test: the ANOVA approach, which is typically built around the claim that all effects but the mean are zero; and the more general regression form, which allows for any set of linear claims about the data. If you send a NULL contrast set, I will generate the set of linear contrasts that are equivalent to the ANOVA-type approach. Readers of {Modeling with Data}, note that there's a bug in the book that claims that the traditional ANOVA approach also checks that the coefficient for the constant term is also zero; this is not the custom and doesn't produce the equivalence presented in that and other textbooks.

Exceptions:

out->error='a' Allocation error.

out->error='d' dimension-matching error.

out->error='i' matrix inversion error.

out->error='m' GSL math error.

#include "eigenbox.h" int main(){ double line[] = {0, 0, 0, 1}; apop_data *constr = apop_line_to_data(line, 1, 1, 3); apop_data *d = query_data();




http://apophenia.info/todo.html#_todo000004

56

apop_model *est = apop_estimate(d, apop_ols); apop_model_show(est); apop_data_show(apop_f_test(est, constr));

}

57

Practical No. 9 Drawing an Inference

Obtaining mean ,standard error & p value for the given data.

#include <apop.h> void one_boot(gsl_vector *base_data, gsl_rng *r, gsl_vector* boot_sample); void one_boot(gsl_vector * base_data, gsl_rng *r, gsl_vector* boot_sample){ for (int i =0; i< boot_sample−>size; i++) gsl_vector_set(boot_sample, i, gsl_vector_get(base_data, gsl_rng_uniform_int(r, base_data−>size))); } int main(){ int rep_ct = 10000; gsl_rng *r = apop_rng_alloc(0); apop_db_open("data-census.db"); gsl_vector *base_data = apop_query_to_vector("select in_per_capita from income where sumlevel+0.0 =40"); double RI = apop_query_to_float("select in_per_capita from income where sumlevel+0.0 =40 and geo_id2+0.0=44"); gsl_vector *boot_sample = gsl_vector_alloc(base_data->size); gsl_vector *replications = gsl_vector_alloc(rep_ct); for (int i=0; i< rep_ct; i++){ one_boot(base_data, r, boot_sample); gsl_vector_set(replications, i, apop_mean(boot_sample)); } double stderror = sqrt(apop_var(replications)); double mean = apop_mean(replications); printf("mean: %g; standard error: %g; (RI-mean)/stderr: %g; p value: %g\n", mean, stderror, (RI-mean)/stderror, 2*gsl_cdf_gaussian_Q(fabs(RI-mean), stderror)); }

58

Practical No 10.Implement Non-parametric Testing

1. Anova

apop_data* apop_anova ( char * table,

char * data,

char * grouping1,

char * grouping2

)

2. This function produces a traditional one- or two-way ANOVA table. 3. It works from data in an SQL table, using queries of the form select data

from table group by grouping1, grouping2. 4. Parameters:

table The table to be queried. Anything that can go in an SQL from clause is OK, so this can be a plain table name or a temp table specification like (select ... ), with parens.

data The name of the column holding the count or other such data

grouping1 The name of the first column by which to group data

grouping2 If this is NULL, then the function will return a one-way ANOVA. Otherwise, the name of the second column by which to group data in a two-way ANOVA.

#include <apop.h> int main(){ apop_db_open("data-metro.db"); char joinedtab[] = "(select year, riders, line \ from riders, lines \ where riders.station = lines.station)"; apop_data_show(apop_anova(joinedtab, "riders", "line", "year"));

}


http://apophenia.info/stats_8h.html#a1309242dbb7f148916c71a9a70c6b64c

60

References

1. Modelling with data, Ben Klemens, Princeton University Press 2. Computational Statistics, James E. Gentle, Springer 3. Computational Statistics, Second Edition, Geof H. Givens and

Jennifer A.Hoeting, Wiley Publications

4. www.cygwin.com

5. http://apophenia.info/


http://apophenia.info/

dat manual.pdf

Documents