isc 233 computer science midterm (96 points total) … 233 computer science midterm (96 points...

12
ISC 233 Computer Science Midterm (96 points total) Due: Monday, March 31, 5:59 pm The take-home midterm project consists of four parts: 0. Honor pledge (hard copy, signed and dated) 1. Written Questions 2. Programming Exercise: Analyzing Protein-Protein Interaction Network 3. Programming Exercise: MapColors No collaboration is permitted on this project. Note that this is stricter than the usual policy you may not discuss the project at all with anyone else, even to clarify your understanding of a question. If you need clarifications on the questions or programming exercises, please contact the preceptors. In the interests of fairness, we will make all clarifications available to the entire class. Please note that clarifications are limited to inaccuracies or ambiguities in the wording of the questions. We will not provide help, as this project is intended to demonstrate your full grasp of the material. This exam is open book and open notes, but “closed” internet. You cannot search the internet for help with this exam the only exceptions are the course website, the booksite, and any direct links from those two. Note that no projects or part of the exam may be handed in late. There is no grace period. The project should be submitted by Monday, March 31 at 5:59 pm. Projects handed in after this deadline will receive a grade of 0. You have a week (plus optionally another week) to work on this exam, so start early. You should submit the programming exercises (parts 2 and 3) on Dropbox. You may submit the written exercises (part 1) as a PDF document on Dropbox, or if you prefer, you may submit it on paper to the Integrated Science submission mailbox in Carl Icahn Laboratory. In addition, everyone, including those submitting electronically, should print out this page. Please write and sign the honor pledge in full, and leave a copy in the assignment submission mailbox in Carl Icahn Laboratory. You will not receive credit for the midterm project if you do not turn in your honor pledge. Honor Pledge Please write out and sign the honor pledge: “I pledge my honor that I have not violated the honor code during this examination.” Write: Sign:

Upload: lamtuyen

Post on 14-May-2018

223 views

Category:

Documents


2 download

TRANSCRIPT

ISC 233 Computer Science Midterm (96 points total) Due: Monday, March 31, 5:59 pm

The take-home midterm project consists of four parts: 0. Honor pledge (hard copy, signed and dated) 1. Written Questions 2. Programming Exercise: Analyzing Protein-Protein Interaction Network 3. Programming Exercise: MapColors

No collaboration is permitted on this project. Note that this is stricter than the usual policy – you may not discuss the project at all with anyone else, even to clarify your understanding of a question. If you need clarifications on the questions or programming exercises, please contact the preceptors. In the interests of fairness, we will make all clarifications available to the entire class. Please note that clarifications are limited to inaccuracies or ambiguities in the wording of the questions. We will not provide help, as this project is intended to demonstrate your full grasp of the material. This exam is open book and open notes, but “closed” internet. You cannot search the internet for help with this exam – the only exceptions are the course website, the booksite, and any direct links from those two. Note that no projects or part of the exam may be handed in late. There is no grace period. The project should be submitted by Monday, March 31 at 5:59 pm. Projects handed in after this deadline will receive a grade of 0. You have a week (plus optionally another week) to work on this exam, so start early.

You should submit the programming exercises (parts 2 and 3) on Dropbox. You may submit the written exercises (part 1) as a PDF document on Dropbox, or if you prefer, you may submit it on paper to the Integrated Science submission mailbox in Carl Icahn Laboratory. In addition, everyone, including those submitting electronically, should print out this page. Please write and sign the honor pledge in full, and leave a copy in the assignment submission mailbox in Carl Icahn Laboratory. You will not receive credit for the midterm project if you do not turn in your honor pledge.

Honor Pledge

Please write out and sign the honor pledge:

“I pledge my honor that I have not violated the honor code during this examination.” Write:

Sign:

Part 1: Questions (36 points total, 3 per question) Object-Oriented Programming

1) a) What is a constructor in java and what is it used for? b) What is the API of a class? c) What is a main method used for? d) The API for a class is given below. Fill in the code. FYI, the Fibonacci sequence looks like {1, 1, 2, 3, 5, …}, where f(n) = f(n-1) + f(n-2). We want to compute the ratio between the nth Fibonacci number and the n-1th Fibonacci number. // Returns f(n)/f(n-1), where f(n) is the value of the nth fibonacci number. Takes an // integer n as an argument, and returns the value f(n)/f(n-1). public static double fibonacciRatio(int n){ } // returns the nth Fibonacci number public static int fibonacci(int n){

}

(zero-point bonus) - What is this ratio an approximation of? e) When should one use private vs public for both a variable and a method? Binary Search

3) How many times would the search function be called to find the A in the array below?

A D E J U W Z The code for binary search from class is given below. Hint: it may be helpful to draw out the recursion tree (i.e. a tree of all function calls). public static int search(String key, String[] a) { return search(key, a, 0, a.length); } public static int search(String key, String[] a, int lo, int hi) {

if (hi <= lo) return -1; int mid = (hi + lo) / 2; int cmp = a[mid].compareTo(key); if (cmp > 0) return search(key, a, lo, mid); else if (cmp < 0) return search(key, a, mid+1, hi); else return mid;

} Queues/Stacks

4) a) Describe in words how to implement a queue using an array, which maintains the O(1) runtimes for enqueues (insertions) and dequeues (deletions). Your solution shouldn’t require shifting the elements in the array (which would require O(n) time). b) Two words are anagrams if they are composed of the same letters with the same frequency. For example DISEASE and SEASIDE are anagrams. Using a HashTable, develop an algorithm (pseudocode) that takes two words as arguments and outputs a boolean term reflecting whether or not they are anagrams of each other. Argue briefly that your solution has linear complexity.

K-means

5) Recall that kmeans initializes cluster centroids randomly. Consequently, different results (clusterings) could be obtained with each run of the algorithm. Describe a potential way (assuming the algorithm could be run any number of times) to use multiple kmeans runs to get a clustering that’s more likely to be correct than any of the single runs.

Circular Doubly Linked List 6) a) Briefly explain the advantages of using Arrays over LinkedLists, and vice versa (maximum of 2 sentences each). Hint: In the Guitar Hero assignment, why did we implement the Queue as an Array and not a LinkedList? Array: LinkedList: b) In the Traveling Salesman assignment you implemented a singly linked circular list (each node contained a single reference to the next node in the list). Consider the following class that implements a circular doubly linked list where each node contains a reference to the next node and the previous node. Fill in the code for the method insertBefore which inserts a node before another node in the list. Assume that beforeNode and iNode are NOT null. Hint: use your insertion method for the Traveling Salesman assignment as a guide but remember that in this case you must update two references (one to the previous node and one to the next node) instead of just one node (the next node).

public class CircularDoublyLinkedList {

private Node head = null; /* Insert node iNode before beforeNode in the circular doubly linked list */ public void insertBefore( Node beforeNode, Node iNode ) { //Fill in code } private class Node { Point p; Node prev; Node next; } }

Theory

7) Give regular expressions for the following languages on {a, b, c} (a) all strings containing exactly one a (b) all strings containing no more than three a’s. 8) Consider the set of strings on {0, 1} defined by the requirements below. For each construct an equivalent DFA

(a) Every 00 is followed immediately by a 1. For examples the strings 101, 0010, 0010011001 are accepted, but 0001 and and 00100 are not.

(b) The leftmost symbol differs from the rightmost one.

Turing Machine

9) Given the following Turing machine, describe it in English. The ‘#’ character represents a blank space on the tape. The state in yellow is the starting state, and the H state is the stopping state. (maximum 1 sentence).

Symbol Table

10) A symbol table is a data type that we use to associate values with keys. Symbol tables can be implemented in several ways.

a) Binary search implementation maintains two parallel arrays of keys and values, keeping them in key-sorted order. It uses binary search for retrieving values based on keys. The running time of binary search depends on the shape of BSTs. Considering a symbol table that stores 7 key-value pairs, draw a BST that represents the best case shape and another BST that represent worst case shape.

b) Hash table implementation uses a hash code function to compute an index into an array

of buckets or slots, from which the correct value can be found. Explain in words 1. Why can

hash table implementation be faster than Binary search implementation, and 2. Why does

the performance of hash table implementation depend on the hash function?

Trees and Graphs 11) MULTIPLE CHOICE (answer true or false)

a. DFA and NFA are equivalent in terms of which languages they can express. b. A Turing machine is equivalent to any computer for solving search problems. c. To prove that a problem is NP complete, we must show that its running time is exponential. d. To prove that a problem is in P, it’s sufficient to propose one polynomial time algorithm for it. e. If a problem is proven to be NP complete, we cannot possibility address it with computer science approaches. f. All problems are solvable given a sufficiently powerful computer and a really smart algorithm. 12)

a) Describe the output of the DFS pseudocode below on the tree drawn blow. Assume the search always starts with vertex A, and the for loop over a set of vertices follows alphabetical order (e.g. for loop over neighbour vertices of B follows A->D->E order)

DFS(G,v) ( v is the vertex where the search starts ) Stack S := {}; ( start with an empty stack ) for each vertex u, set visited[u] := false; push S, v; while (S is not empty) do u := pop S; print name of u; if (not visited[u]) then visited[u] := true; for each unvisited neighbour w of u push S, w; end if end while

END DFS()

b) Describe the output on the graph below if the pseudocode above employs a queue instead of a stack. Assume the search always starts with vertex A, and the for loop over a set of vertices follows alphabetical order (e.g. for loop over neighbour vertices of B follows A->D->E order)

Part 2: Programming: Analyzing Protein-Protein Interaction Network (30 points)

It would be a shame if we didn’t have a problem in computational biology, and it would be sad if we didn’t have a chance to analyze cool biological data, specifically Protein-Protein Interactions networks. Just to make it a little more interesting, lets work with human data and save the world. Database of Interacting Proteins (DIP) is a database that documents experimentally validated protein interactions from many species ranging from yeast to human. Every once in a while, the database is updated as more and more protein interactions are experimentally verified by biologists. You will be analyzing the human protein-protein interactions network. Glossary:

A path in a graph is a sequence of vertices such that from each of its vertices there is an

edge to the next vertex in the sequence.

The length of a path for our purposes – the number of edges in it

The degree of the vertex v – number of vertices adjacent/connected by edges to v

Download Midterm.zip found in the midterm section of the course website. Your goal here is to implement the following methods of the Graph class given in the Graph.java file from the course website: // returns the vertex with the biggest degree in the graph public String theMaxDegreeVertex() // return the length of the shortest path, the number of edges, between vertices start and end, if there’s no path – return Integer.MAX_VALUE public int distanceTo(String start, String end)

Part 3: Programming Exercise 2: MapColors(30 points)

You’ve seen two types of clustering this semester: K-means, and hierarchical. Both have been

applied to problems in computational biology: clustering genes based on expression profile, and generating a phylogenetic tree from protein sequences. Clustering has applications outside computational biology, as well, and is widely used in computer science. For example, you could cluster the colors in a picture, and then assign a single color to all the colors in the same cluster. This can be used for artistic effect (music videos in the 1980s), or to take up less space (many graphics on the web are color-reduced). Here is what happens when you apply color-reduction to one of the rotating photos on the Princeton Computer Science department’s home page.

Original photo (millions of colors): cs.png

32 colors: out32.png

8 colors: out8.png

4 colors: out4.png

Write a program MapColors that takes in three arguments: the name of the picture to read in, the name of the picture to write out, and the number of colors. java MapColors cs.png out4.png 4 We provide a skeletal MapColors.java, which supplies the basic outline of K-Means clustering.

Thus, this question tests your conceptual understanding and your ability to read and understand code, rather than your ability to write code from scratch. In addition to the program, please also turn in your answers to the following questions (many of which are hints for the assignment).

1. A Color is just a triplet (R, G, B), where each of the three color channels ranges from 0 to

255. How do you find the distance between two Colors?

2. K-Means consists of setting initial clusters randomly, assigning items to the closest cluster, then shifting the centroids based on the items in the cluster. But for a photographic image,

there could be millions of different colors. It might take hours to converge perfectly (such

that no clusters change). Why might K-Means take a long time to converge How do you

relax the convergence criterion to speed up the process?

3. Suppose that we used single-link hierarchical clustering instead. How long would it take to

cluster an M×N pixel image down to k colors? Assume that each pixel is a different color.

4. Why is there a big line in the sky across the 32-color photo of the CS building, that wasn’t

present in the 8-color and 4-color. Shouldn’t more colors deliver a more realistic photo?