proposing a recommendation system for donorschoose€¦ · and love to donate. implementing the...

Dataset chosen: Donors Choose.org

Report by Avengers

SAS GLOBAL FORM 2016: SAS STUDENT SYMPOSIUM

Proposing a Recommendation system for DonorsChoose.org

SAS GLOBAL FORUM 2016: SAS STUDENT SYMPOSIUM

Resources on the

Donors Choose website

CURRENT FEATURES News on the Donors Choose http://www.donorschoose.org/

TIMELINE Detailed timeline of their history http://www.donorschoose.org/about/stor

y.html

API AND OPEN DATA

There is a complete data available free

for download. There is data available

for five major categories: Projects: All

classroom projects that have been

posted so far on the website,

including lots of school info such as its

NCES ID (government-issued),

latitude/longitude, and city/state/zip.

Donations: All donations, including

donor city, state, and partial-zip. Gift

cards: All website-purchased gift cards,

including donor and recipient city,

state, and partial-zip. Project

resources: All materials/resources

requested for the classroom projects,

including vendor name. Project

written requests / essays: Full text of

the teacher-written requests

accompanying all classroom projects.

http://data.donorschoose.org/open-

data/overview/

BLOG Interesting articles and success stories http://www.donorschoose.org/blog/

HOW IT WORKS Brief video on how this works http://www.donorschoose.org/about

IMPACT Informative visualizations on the

impact of Donors Choose http://www.donorschoose.org/about/stor

y.html

PARTNERS A complete list of their partners and

supporters http://www.donorschoose.org/about/part

nership_center.html

VOLUNTEER WITH

DONORSCHOOSE Volunteers review the thank-you notes

to protect student privacy, check them

into our computer system, repackage

them, and mail them off to eager

donors — all within 24 hours. http://www.donorschoose.org/volunteer

STAFF AND BOARD Detailed list of people associated with

Donors Choose – The Team http://www.donorschoose.org/about/mee

t_the_team.html

OPEN PROJECTS Comprehensive list of projects that

needs funding http://www.donorschoose.org/donors/sea

rch.html

Table of Contents

INTRODUCTION ..................................................................................................................... 4

PROBLEM STATEMENT .......................................................................................................... 5

DATA CLEANING AND VALIDATION ........................................................................................ 5

VISUALIZATION ..................................................................................................................... 6

ANALYSIS .............................................................................................................................. 6

PREDICTIVE MODELING ..................................................................................................................6

DONOR SEGMENTATION ................................................................................................................7

RECOMMENDATION SYSTEM ..........................................................................................................8

SUGGESTION FOR FUTURE STUDY .......................................................................................... 9

CONCLUSIONS ....................................................................................................................... 9

APPENDICES ........................................................................................................................ 10

APPENDIX A: DATA EXTRACTION AND CLEANING .......................................................................... 10

APPENDIX B: EXPLORATORY ANALYSIS .......................................................................................... 12

APPENDIX C: PREDICTIVE MODELING AND SEGMENTATION RESULTS ............................................ 15

DonorsChoose.org is a United States–based nonprofit organization that allows individuals to donate directly to public school

classroom projects. Founded in 2000 by former public school teacher Charles Best, DonorsChoose.org was among the first civic

crowd funding platforms of its kind. The organization has received Charity Navigator’s highest rating every year since 2005.

Charles Best, a social studies teacher at Wings Academy in the Bronx, New York, founded donorsChoose.org. Charles and his

colleagues often spent their own money on school supplies for their students, and discussed materials they wished they could

afford in the teachers’ lunchroom.

Charles envisioned a platform for individuals to connect directly with classrooms in need, providing materials requested by

teachers. With the help of his students, he built the first version of the site in his classroom and invited colleagues to post

material requests. Charles anonymously funded the first 10 project requests to demonstrate the effectiveness of the site.

- Wikipedia

Contents

INTRODUCTION Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school

classroom projects. Teachers from public schools post request for funding project with a short essay

describing it. Donors all around the world can look at these projects when they login to

Donorschoose.org and donate to projects of their choice. The idea is to have personalized

recommendation webpage for all the donors, which will show them the projects, which they prefer, like

and love to donate. Implementing the recommender system for DonorsChoose.org website will improve

user experience and help more projects to meet their funding goals. It also will help us in understanding

the donors’ preferences and delivering to them what they want or value. One type of recommendation

system can be designed by predicting projects that will less likely to meet funding goal, segmenting and

profiling the donors and using that information for recommending right projects when the donors login

to DonorsChoose.org.

Figure 1: Process overview of how DonorsChoose.org

DATA UNDERSTANDING DonorsChoose.org is a United States based 501 nonprofit organization that allows individuals to donate

directly to public school classroom projects. Figure 1 gives a process overview of DonorsChoose.org. The

official website has a schema diagram that gives a brief understanding of the relationships among the

datasets. There are five different datasets related to the Donorschoose.org. They are “Projects”,

“Essays”, “Gift cards”, “Resources” and “Donations”. The “Projects” dataset contained school level,

teacher level, project level and demographic information about the project and data about when a

certain project was posted online, the completion date and the date when thank you packet was sent to

the donors. It also provided the funding status of the projects. It contained around 900K records and 50

different variables. “Donation” dataset contained several details about donors such as demographics

(city, state), behavior (donation message), and donation amount. Donation dataset had 5 Million

observations and around 20 variables. “Essays” datasets contained details about the essays, which were

posted by teachers regarding their projects. Essays dataset had 900K observations and 7 variables out of

which project ID is the unique identifier and rest 6 variables were plain text. “Resources” dataset

contained information regarding resources required for a certain project. “Gift cards” dataset contained

information regarding the gift cards bought on the website along with the mode of payment, city-state

and other details. Initial analysis suggested that most of the important information was contained in

Projects, donations and essays datasets and hence greater emphasis was placed on working with these

three datasets. A field project ID was used as the unique identifier for joining the projects and essays

dataset.

PROBLEM STATEMENT It is important that DonorsChoose.org understands what Donors like donating to and their preferences.

This in turn will help in getting a higher number of projects meet their funding goals. Literature review

on Donorschoose.org dataset suggested that substantial amount of research has been done on ranking

the projects that are posted on the website. There has been very little work done for building any kind

of recommendation system. Exploratory research revealed that approximately 70% of the projects meet

the funding goal where as 30% do not. There is a need to understand factors, which affect the funding

status of projects. In order to build a recommendation system for DonorsChoose.org, the team came up

with three-step process.

1. Analyze the impact of different factors on project funding and try to build a predictive model to find

whether a project will get funding or not.

2. Attempt to understand the donors’ preferences and needs to group them into clusters. Segmenting,

profiling the existing donors helps us to understand the new donors as well

3. Target the specific donor segments with the predicted unfunded projects [from 1st part], so that

number of funded projects increases.

By implementing this recommendation system, projects that are not likely to meet its funding goal can

be recommended to Donors based on their preferences.

DATA CLEANING AND VALIDATION

The datasets were in .csv format. There were many punctuations marks in the text fields which included

characters such as “,”,”/”,”//” and “.”. The issue was addressed with the help of UNIX shell scripting and

SAS codes. A composite delimiter “$|” and dummy indicator “<new _ line>” was included to maintain

the data consistency. In order to convert.csv files into SAS datasets, DLMSTR=’$|’ option was used. After

the SAS datasets were created, validation of data was done to confirm that none of the observations

and variables were truncated. Many new variables were created in SAS datasets. The steps followed to

convert the .csv files in SAS files can be viewed in Appendix A: Data cleaning and validation part.

Projects dataset: Exploratory analysis showed that 67.45% of the projects were completed, 28.83% had

expired and the rest were into other (re-allocated and go-live. Basic descriptive statistics show that

variables such as students _ reached, total _ price _ excluding _ price have an extremely high skewness

and kurtosis values. Looking at the box plots and histogram of these variables, suggested that there

were many outliers. For example, the highest value in total_price_excluding_optional was 10,250,017

whereas the 99th percentile was only 2,567.49. Therefore, the data was filtered and transformed in

order to reduce the skewness and kurtosis to permissible values (closer to zero). Essays dataset: It

contained many text variables. We made sure that the text was imported successfully into the SAS

datasets. Donations dataset: This dataset contained data regarding donations by donors. Several

variables such as donation_amount had to be cleaned after examining their distributions. Resources and

gift cards: These datasets revealed information about resources for each project and various gift cards

given. Several high performance procedures like HPIMPUTE, HPSUMMARY were used to clean

donations, essays, resources, projects and gift cards datasets.

VISUALIZATION The visualizations done in SAS studio can be seen in Appendix B: Exploratory Analysis

ANALYSIS Analysis and approach to this problem statement is three fold.

1. Predict the projects which will not meet funding goals; involves the finding the factors responsible

for predicting the project funding status and building a predictive model for the same.

2. Segment donors based on the preferences and behavior by looking at their past donations.

3. Suggest the unfunded projects to the donors who may like it and fund it.

The first part of the analysis involves the finding the factors responsible for predicting the project

funding and building a predictive model. The step-by-step analysis done can be seen below in figure 2.

Figure 2: Steps followed to solve the problem statement

Predicting if a project meets its funding goal or not:

In order to predict if a project will reach its funding goal or not, we created a target variable called as

‘funding _ coded’. This takes a value of ‘1’ if a particular project meets its funding goal and takes a value

of ‘0’ if it is expired. There were four categories of funding status in the project. Completed, expired, live

and re-allocated. We have a value of ‘1’ for completed projects and a value of ‘0’ for expired and re-

allocated projects. The live projects were filtered and used as scoring dataset. Initial distribution of

projects, which met its funding goal, and those, which expired, was approximately 70% and 30%

respectively. Stratified sampling was done with 60% for training the model, 20% for validating the model

and 20% for testing purpose. Prior to predictive modeling variables were imputed using various kinds of

imputation techniques. Filtering and transformation techniques were used to deal with outliers,

reducing skewness and kurtosis of different variables. Various kinds of variable selection and variable

reduction techniques were used. Few important variables we choose from projects dataset were

resource _ type, poverty _ level, grade _ level, number _ of _ students _ reached and total _ price _

excluding _ optional.

PREDICTIVE MODELING Various kinds of predictive models such as Logistic regression, Decision trees, Random forest, Neural

Networks, Ensemble and Rule-based models were built in order to predict if a given project will meet its

funding goal or not. Statistics such as ROC index and Misclassification rate were used to pick the best

models. Input variables used were structured data from projects dataset, text clusters, text topics

created from essays datasets and a combination of these together. Rule-based models were built using

essays text field.

Dealing with text variables: Essays dataset contained many text fields such as title, short _ description,

need statement and essays. These are the text fields that are filled up by the teachers when they fill out

the projects on donorschoose.org. The way these essays were written can affect if the project would

reach its funding goal or not. Initially the dataset was parsed and interactive filtering was done to drop

few words and view concept links. We used the HPTMINE node in text miner to build text cluster and

text topics. These text topics and clusters were used as input variables in the predictive modeling.

Predictive Modeling results: Analysis reveals that for every $10 increase in cost of the project, the odds

of project reaching funding goal is likely to reduce by 10%. The odds for grades level 6-8 are 12 % lower

than the odds of pre k-2 schools. Whereas the odds of pre k-2 are 8% higher than the projects belonging

to grades 3-5. The odds of project belonging to applied sciences primary subject is 22% higher than the

odds of visual arts. The odds of projects belonging to literature & writing are 25% lower than the

projects belonging to visual arts. Similarly, the odds of projects, which have primary subject of music, is

37% higher than that of visual arts. We observe that projects posted in summer have higher probability

of project reaching funding goal that those posted in the rest of the year. The odds for projects posted in

May are 20% higher compared to the projects posted in the month of December. The projects which

have a teacher prefix of ‘Mrs.’ have 13 % lower odds compared to those which have a teacher prefix of

‘Ms.’. Technology projects have a 60% lower odds compared to projects which have resource type as

‘Books’. Projects belonging to schools from highest poverty area have 21 % higher odds compared to

projects belonging to moderate poverty. Modeling results can be seen in Appendix 4.4

The best model turned out to be Neural Network with a Misclassification rate of 27.68% and a ROC

index of 0.74. Model performance was verified in order to avoid any kind of over-fitting and under-

fitting of models. In the final model the variables used were resource_type, grade_level, poverty_level,

total cost of project excluding optional, number of students reached, month when project was posted,

state to which the project belonged to and the cluster membership of various text variables in essays

dataset.

Scoring the live projects: There are currently around 26,000 projects which are under live phase in

projects dataset. The best model stated above, was used to score this dataset in order to predict if these

projects will reach its funding goal or not. The analysis results can be seen in appendix C.

DONOR SEGMENTATION The next step done was segmenting the existing donors from the Donors dataset. There were around 5

million records in the Donors dataset. Aggregation was done at a donor level using Donor_ID as the

primary key and various new variables were created. There are approximately 1.6 million unique donors

for DonorsChoose.org.

New variables created: Many new variables such as number of total donations, average time between

each donation, number of times donor donated to projects in same state, number of times donor

donated to each kind of grade level, poverty level, resource needed to projects etc. These variables were

used as input variables for donor segmentation. All the necessary variable transformation for the

clustering and segmentation is done and details are attached in the appendix C. Segmentation and

segment profiling was done. The entire donors in Donations dataset were divided into 7 segments.

Table 1: Table explaining the weighted importance of each variable with respect to its segments

Table 2: Table showing the different segments and the values they care about

Details of the seven segments and the characteristics in each of these segments can be seen in table 1

and table 2. Names were assigned to each of these segments in order to uniquely define them.

RECOMMENDATION SYSTEM Approximately 75% of the Donors have donated only

once. Therefore, there are not enough data points to

create a recommender system on an individual donor

level. That is the reason why segmentation of donors was

done. Using the characteristics of the segments, of

suitable projects were recommended. For example,

consider this case: the donor [ID:

00000ce845c00cbf0686c992fc369df4] belongs to the

Technology lovers’ segments. Based on the results of the

predictive models, a list of live projects, which are likely

not going to meet the funding goal, was generated. These

were filtered with the conditions based on the

characteristics from Technology lover segments and

sorted it by the probability of the project being unfunded. Then, based on the location where the

project exists, recommendations were given to the donor when he or she login to DonorsChoose.org.

For the donorID: 00000ce845c00cbf0686c992fc369df4 [Technology lover], the recommended live

projects title is shown in the table. In this way, the donor does not have to put much effort to find the

project, which donor likes, and the projects, which are in the verge of unfunded, will get full funding.

This is just an example of 12 projects recommended to the donor segment 6, who love donating to

projects involving technology. This way we created a recommendation for all the seven segment of

donors. A few projects might exist in one or more recommendation for segments. This way after the

donor has donated one time, the next time he or she login to DonorsChoose.org, based on previous

donation history, segmentation can be done as shown above and recommendation of projects can be

done. This recommendation system works for donors who have done a prior donation. Before a donor

donates for the first time, no prior account is created, nor is the donor asked to give any preferences.

Hence, with the existing process in operation of the website, it is difficult to build a recommendation

system for first time donors.

SUGGESTION FOR FUTURE STUDY In the existing system, a recommendation cannot be given to donor who is about to donate for first time

as there is no data about their preferences. Instead the system can be slightly modified such that prior

to first donation, preferences of donor can be asked. Based on preferences, we can map the donor to

one of the segments demonstrated above and a recommendation can be made. Using click stream data,

the results of predictive as well as segmentation models used can be improved. There are groups of

donors who donate only because they are teachers in the school, or their children study in the school, or

they know the teacher personally, or the teacher is a relative. The donation message can be text mined

to improve the segmentation results.

CONCLUSIONS Wherever a transaction is done in online, everyone come across “You may like this”, “Users Who Viewed

This Item Also Viewed” section. It makes the users’ life easier by showing the related products which

they may like and help them in choosing the right product effectively. If this can be done for a monetary

benefit in all domains, why can’t we do the same for a good cause as well? This project is a simple

prototype for our idea. Using the best model, the live projects were scored. 1.6 million donors were

segmented into seven segments, and based on the characteristics of each segment, recommendations

were made from the set of live projects to each segments. These recommended projects match the

likes, and preferences of donors based on previous donation data. If the donor has donated previously

then this recommendation system can be used to recommend projects that donors like. This

recommendation system gives a good experience to donors, by reducing the time taken to search for

projects they like. By implementing this recommendation system, we can understand what the donor is

looking at and what they like to donate for. It can be made better by adding the details mentioned in the

scope for the future section. As demonstrated in this document, DonorsChoose.org website can be

made better by implementing this recommendation system. It will not only help in donors having a

better experience, but also can increase the percentage of projects that meet funding goal.

APPENDICES

APPENDIX A: DATA EXTRACTION AND CLEANING 1.1: Code used to extract and clean csv files and convert them into SAS files

UNIX shell scripting for data preparation.

We have used following scripts to address data issues of .csv files. We have converted the .csv files to

text files.

We had a unique issue of presence of new line characters in the text columns. We replaced these new

line characters with <new_line> to facilitate SAS dataset creation.

Additionally the comma delimiter is replaced by the composite delimiter $|

/////////////// Code begins for opendata_projects.csv ///////////////////////////

cat opendata_projects.csv | awk 'gsub(/\r/,"") {printf "%s" , $0;next} {print}' > new.txt

// this is to remove additional new line characters in the data

cat new.txt | awk '{ if (gsub(/\\$/, " ")) printf "%s",$0; else print }' > 1.txt

sed 's/ \\\\\\\\/ <new_line> /g' 1.txt > 2.txt

// this replaces the comma delimiter by $| to address the parsing issue of any .csv files

sed 's/\",\"/\"$|\"/g' 2.txt > 3.txt

sed '1s/\,/$|/g' 3.txt > opendata_projects.txt

////////////////////////////// Code ends //////////////////////////////////////////

Similar code is used for modifying the other 6 datasets.

The Donation dataset has quantitative data pertaining to donations as well as the donations message

sent by a donor while donating to a project. We have separated the textual information (donation

message) from the quantitative data by creating two separate datasets as per below.

cut -d "," -f1-22 opendata_donations.txt > donations.txt

cut -d "," -f1,23 opendata_donations.txt > donations_message.txt

1.2: Code used to extract files from and push files into HDFS

1.3: Code used to clean the datasets and convert variables into appropriate data types

APPENDIX B: EXPLORATORY ANALYSIS 2.1: Distribution of project status in projects dataset

2.2: Average total price excluding optional for different project status

2.3: Cross tab between the resource_type needed for project and the project funding status

2.4: One-way frequencies of different secondary focus area in the project

2.5: Summary of missing values of different categorical variables in projects dataset

2.6: Summary statistics of different continuous variables in dataset

APPENDIX C: PREDICTIVE MODELING AND SEGMENTATION RESULTS 3.1: Data partition result for modeling purpose (60:20:20)

3.2: Imputation results of different categorical variables

3.3: Filtering bounds for different continuous variables

3.4: iteration plot and ROC index of logistic regression, used to predict if a given project meets funding

3.5: Model comparison of different models built

3.6: Concept links for word help present in essays variable of essays dataset

3.7: Different words present in text topics built for essays text variable in essays dataset

3.8: Number of documents by topics and number of terms by topics analysis

3.9: Different rules for rule-based models built to predict the funding status of different projects

4.0: Fit statistics of rule-based models

4.1: Segmentation of donors output

4.2: Variable worth for of different variables in various segments of donors

4.3: Donor segment profiling output

4.4: SAS Enterprise Miner modeling diagram

proposing a recommendation system for donorschoose€¦ · and love to donate. implementing the...

Documents

stakeholder communication in proposing change

proposing a marketing disaster severity index: measuring

proposing e-health in indonesia

insights from donorschoose

shady grove swsa x donorschoose.org fundraiser€¦ ·...

www.donorschoose.org requesting resources for your school at...

jessica sinclair urban assembly school for wildlife ... for...

proposing varanasi(1)

optimizely experience - donorschoose.org

how to use donorschoose.org and get the classroom supplies...

district toolkit - casbo€¦ · crowdfunding best practice...

proposing an interpolated pipeline adc

lessons from a crowdfunding platform: how donorschoose.org...

proposing sql statement coverage metrics

proposing institution: south carolina state university

proposing a model for collective stupidity

proposing for vla and vlba time

evaluating existing and proposing new seismic design

what the district is proposing……

beehive accelerator program at universitas...