proposing a recommendation system for donorschoose€¦ · and love to donate. implementing the...
Post on 01-May-2020
2 Views
Preview:
TRANSCRIPT
1
Dataset chosen: Donors Choose.org
Report by Avengers
SAS GLOBAL FORM 2016: SAS STUDENT SYMPOSIUM
Proposing a Recommendation system for DonorsChoose.org
SAS GLOBAL FORUM 2016: SAS STUDENT SYMPOSIUM
2
Resources on the
Donors Choose website
CURRENT FEATURES News on the Donors Choose http://www.donorschoose.org/
TIMELINE Detailed timeline of their history http://www.donorschoose.org/about/stor
y.html
API AND OPEN DATA
There is a complete data available free
for download. There is data available
for five major categories: Projects: All
classroom projects that have been
posted so far on the website,
including lots of school info such as its
NCES ID (government-issued),
latitude/longitude, and city/state/zip.
Donations: All donations, including
donor city, state, and partial-zip. Gift
cards: All website-purchased gift cards,
including donor and recipient city,
state, and partial-zip. Project
resources: All materials/resources
requested for the classroom projects,
including vendor name. Project
written requests / essays: Full text of
the teacher-written requests
accompanying all classroom projects.
http://data.donorschoose.org/open-
data/overview/
BLOG Interesting articles and success stories http://www.donorschoose.org/blog/
HOW IT WORKS Brief video on how this works http://www.donorschoose.org/about
IMPACT Informative visualizations on the
impact of Donors Choose http://www.donorschoose.org/about/stor
y.html
PARTNERS A complete list of their partners and
supporters http://www.donorschoose.org/about/part
nership_center.html
VOLUNTEER WITH
DONORSCHOOSE Volunteers review the thank-you notes
to protect student privacy, check them
into our computer system, repackage
them, and mail them off to eager
donors — all within 24 hours. http://www.donorschoose.org/volunteer
STAFF AND BOARD Detailed list of people associated with
Donors Choose – The Team http://www.donorschoose.org/about/mee
t_the_team.html
OPEN PROJECTS Comprehensive list of projects that
needs funding http://www.donorschoose.org/donors/sea
rch.html
3
Table of Contents
INTRODUCTION ..................................................................................................................... 4
PROBLEM STATEMENT .......................................................................................................... 5
DATA CLEANING AND VALIDATION ........................................................................................ 5
VISUALIZATION ..................................................................................................................... 6
ANALYSIS .............................................................................................................................. 6
PREDICTIVE MODELING ..................................................................................................................6
DONOR SEGMENTATION ................................................................................................................7
RECOMMENDATION SYSTEM ..........................................................................................................8
SUGGESTION FOR FUTURE STUDY .......................................................................................... 9
CONCLUSIONS ....................................................................................................................... 9
APPENDICES ........................................................................................................................ 10
APPENDIX A: DATA EXTRACTION AND CLEANING .......................................................................... 10
APPENDIX B: EXPLORATORY ANALYSIS .......................................................................................... 12
APPENDIX C: PREDICTIVE MODELING AND SEGMENTATION RESULTS ............................................ 15
DonorsChoose.org is a United States–based nonprofit organization that allows individuals to donate directly to public school
classroom projects. Founded in 2000 by former public school teacher Charles Best, DonorsChoose.org was among the first civic
crowd funding platforms of its kind. The organization has received Charity Navigator’s highest rating every year since 2005.
Charles Best, a social studies teacher at Wings Academy in the Bronx, New York, founded donorsChoose.org. Charles and his
colleagues often spent their own money on school supplies for their students, and discussed materials they wished they could
afford in the teachers’ lunchroom.
Charles envisioned a platform for individuals to connect directly with classrooms in need, providing materials requested by
teachers. With the help of his students, he built the first version of the site in his classroom and invited colleagues to post
material requests. Charles anonymously funded the first 10 project requests to demonstrate the effectiveness of the site.
- Wikipedia
Contents
4
INTRODUCTION Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school
classroom projects. Teachers from public schools post request for funding project with a short essay
describing it. Donors all around the world can look at these projects when they login to
Donorschoose.org and donate to projects of their choice. The idea is to have personalized
recommendation webpage for all the donors, which will show them the projects, which they prefer, like
and love to donate. Implementing the recommender system for DonorsChoose.org website will improve
user experience and help more projects to meet their funding goals. It also will help us in understanding
the donors’ preferences and delivering to them what they want or value. One type of recommendation
system can be designed by predicting projects that will less likely to meet funding goal, segmenting and
profiling the donors and using that information for recommending right projects when the donors login
to DonorsChoose.org.
Figure 1: Process overview of how DonorsChoose.org
DATA UNDERSTANDING DonorsChoose.org is a United States based 501 nonprofit organization that allows individuals to donate
directly to public school classroom projects. Figure 1 gives a process overview of DonorsChoose.org. The
official website has a schema diagram that gives a brief understanding of the relationships among the
datasets. There are five different datasets related to the Donorschoose.org. They are “Projects”,
“Essays”, “Gift cards”, “Resources” and “Donations”. The “Projects” dataset contained school level,
teacher level, project level and demographic information about the project and data about when a
certain project was posted online, the completion date and the date when thank you packet was sent to
the donors. It also provided the funding status of the projects. It contained around 900K records and 50
different variables. “Donation” dataset contained several details about donors such as demographics
(city, state), behavior (donation message), and donation amount. Donation dataset had 5 Million
observations and around 20 variables. “Essays” datasets contained details about the essays, which were
posted by teachers regarding their projects. Essays dataset had 900K observations and 7 variables out of
which project ID is the unique identifier and rest 6 variables were plain text. “Resources” dataset
contained information regarding resources required for a certain project. “Gift cards” dataset contained
information regarding the gift cards bought on the website along with the mode of payment, city-state
and other details. Initial analysis suggested that most of the important information was contained in
Projects, donations and essays datasets and hence greater emphasis was placed on working with these
5
three datasets. A field project ID was used as the unique identifier for joining the projects and essays
dataset.
PROBLEM STATEMENT It is important that DonorsChoose.org understands what Donors like donating to and their preferences.
This in turn will help in getting a higher number of projects meet their funding goals. Literature review
on Donorschoose.org dataset suggested that substantial amount of research has been done on ranking
the projects that are posted on the website. There has been very little work done for building any kind
of recommendation system. Exploratory research revealed that approximately 70% of the projects meet
the funding goal where as 30% do not. There is a need to understand factors, which affect the funding
status of projects. In order to build a recommendation system for DonorsChoose.org, the team came up
with three-step process.
1. Analyze the impact of different factors on project funding and try to build a predictive model to find
whether a project will get funding or not.
2. Attempt to understand the donors’ preferences and needs to group them into clusters. Segmenting,
profiling the existing donors helps us to understand the new donors as well
3. Target the specific donor segments with the predicted unfunded projects [from 1st part], so that
number of funded projects increases.
By implementing this recommendation system, projects that are not likely to meet its funding goal can
be recommended to Donors based on their preferences.
DATA CLEANING AND VALIDATION
The datasets were in .csv format. There were many punctuations marks in the text fields which included
characters such as “,”,”/”,”//” and “.”. The issue was addressed with the help of UNIX shell scripting and
SAS codes. A composite delimiter “$|” and dummy indicator “<new _ line>” was included to maintain
the data consistency. In order to convert.csv files into SAS datasets, DLMSTR=’$|’ option was used. After
the SAS datasets were created, validation of data was done to confirm that none of the observations
and variables were truncated. Many new variables were created in SAS datasets. The steps followed to
convert the .csv files in SAS files can be viewed in Appendix A: Data cleaning and validation part.
Projects dataset: Exploratory analysis showed that 67.45% of the projects were completed, 28.83% had
expired and the rest were into other (re-allocated and go-live. Basic descriptive statistics show that
variables such as students _ reached, total _ price _ excluding _ price have an extremely high skewness
and kurtosis values. Looking at the box plots and histogram of these variables, suggested that there
were many outliers. For example, the highest value in total_price_excluding_optional was 10,250,017
whereas the 99th percentile was only 2,567.49. Therefore, the data was filtered and transformed in
order to reduce the skewness and kurtosis to permissible values (closer to zero). Essays dataset: It
contained many text variables. We made sure that the text was imported successfully into the SAS
datasets. Donations dataset: This dataset contained data regarding donations by donors. Several
variables such as donation_amount had to be cleaned after examining their distributions. Resources and
gift cards: These datasets revealed information about resources for each project and various gift cards
given. Several high performance procedures like HPIMPUTE, HPSUMMARY were used to clean
donations, essays, resources, projects and gift cards datasets.
6
VISUALIZATION The visualizations done in SAS studio can be seen in Appendix B: Exploratory Analysis
ANALYSIS Analysis and approach to this problem statement is three fold.
1. Predict the projects which will not meet funding goals; involves the finding the factors responsible
for predicting the project funding status and building a predictive model for the same.
2. Segment donors based on the preferences and behavior by looking at their past donations.
3. Suggest the unfunded projects to the donors who may like it and fund it.
The first part of the analysis involves the finding the factors responsible for predicting the project
funding and building a predictive model. The step-by-step analysis done can be seen below in figure 2.
Figure 2: Steps followed to solve the problem statement
Predicting if a project meets its funding goal or not:
In order to predict if a project will reach its funding goal or not, we created a target variable called as
‘funding _ coded’. This takes a value of ‘1’ if a particular project meets its funding goal and takes a value
of ‘0’ if it is expired. There were four categories of funding status in the project. Completed, expired, live
and re-allocated. We have a value of ‘1’ for completed projects and a value of ‘0’ for expired and re-
allocated projects. The live projects were filtered and used as scoring dataset. Initial distribution of
projects, which met its funding goal, and those, which expired, was approximately 70% and 30%
respectively. Stratified sampling was done with 60% for training the model, 20% for validating the model
and 20% for testing purpose. Prior to predictive modeling variables were imputed using various kinds of
imputation techniques. Filtering and transformation techniques were used to deal with outliers,
reducing skewness and kurtosis of different variables. Various kinds of variable selection and variable
reduction techniques were used. Few important variables we choose from projects dataset were
resource _ type, poverty _ level, grade _ level, number _ of _ students _ reached and total _ price _
excluding _ optional.
PREDICTIVE MODELING Various kinds of predictive models such as Logistic regression, Decision trees, Random forest, Neural
Networks, Ensemble and Rule-based models were built in order to predict if a given project will meet its
funding goal or not. Statistics such as ROC index and Misclassification rate were used to pick the best
models. Input variables used were structured data from projects dataset, text clusters, text topics
created from essays datasets and a combination of these together. Rule-based models were built using
essays text field.
Dealing with text variables: Essays dataset contained many text fields such as title, short _ description,
need statement and essays. These are the text fields that are filled up by the teachers when they fill out
the projects on donorschoose.org. The way these essays were written can affect if the project would
reach its funding goal or not. Initially the dataset was parsed and interactive filtering was done to drop
7
few words and view concept links. We used the HPTMINE node in text miner to build text cluster and
text topics. These text topics and clusters were used as input variables in the predictive modeling.
Predictive Modeling results: Analysis reveals that for every $10 increase in cost of the project, the odds
of project reaching funding goal is likely to reduce by 10%. The odds for grades level 6-8 are 12 % lower
than the odds of pre k-2 schools. Whereas the odds of pre k-2 are 8% higher than the projects belonging
to grades 3-5. The odds of project belonging to applied sciences primary subject is 22% higher than the
odds of visual arts. The odds of projects belonging to literature & writing are 25% lower than the
projects belonging to visual arts. Similarly, the odds of projects, which have primary subject of music, is
37% higher than that of visual arts. We observe that projects posted in summer have higher probability
of project reaching funding goal that those posted in the rest of the year. The odds for projects posted in
May are 20% higher compared to the projects posted in the month of December. The projects which
have a teacher prefix of ‘Mrs.’ have 13 % lower odds compared to those which have a teacher prefix of
‘Ms.’. Technology projects have a 60% lower odds compared to projects which have resource type as
‘Books’. Projects belonging to schools from highest poverty area have 21 % higher odds compared to
projects belonging to moderate poverty. Modeling results can be seen in Appendix 4.4
The best model turned out to be Neural Network with a Misclassification rate of 27.68% and a ROC
index of 0.74. Model performance was verified in order to avoid any kind of over-fitting and under-
fitting of models. In the final model the variables used were resource_type, grade_level, poverty_level,
total cost of project excluding optional, number of students reached, month when project was posted,
state to which the project belonged to and the cluster membership of various text variables in essays
dataset.
Scoring the live projects: There are currently around 26,000 projects which are under live phase in
projects dataset. The best model stated above, was used to score this dataset in order to predict if these
projects will reach its funding goal or not. The analysis results can be seen in appendix C.
DONOR SEGMENTATION The next step done was segmenting the existing donors from the Donors dataset. There were around 5
million records in the Donors dataset. Aggregation was done at a donor level using Donor_ID as the
primary key and various new variables were created. There are approximately 1.6 million unique donors
for DonorsChoose.org.
New variables created: Many new variables such as number of total donations, average time between
each donation, number of times donor donated to projects in same state, number of times donor
donated to each kind of grade level, poverty level, resource needed to projects etc. These variables were
used as input variables for donor segmentation. All the necessary variable transformation for the
clustering and segmentation is done and details are attached in the appendix C. Segmentation and
segment profiling was done. The entire donors in Donations dataset were divided into 7 segments.
8
Table 1: Table explaining the weighted importance of each variable with respect to its segments
Table 2: Table showing the different segments and the values they care about
Details of the seven segments and the characteristics in each of these segments can be seen in table 1
and table 2. Names were assigned to each of these segments in order to uniquely define them.
RECOMMENDATION SYSTEM Approximately 75% of the Donors have donated only
once. Therefore, there are not enough data points to
create a recommender system on an individual donor
level. That is the reason why segmentation of donors was
done. Using the characteristics of the segments, of
suitable projects were recommended. For example,
consider this case: the donor [ID:
00000ce845c00cbf0686c992fc369df4] belongs to the
Technology lovers’ segments. Based on the results of the
predictive models, a list of live projects, which are likely
not going to meet the funding goal, was generated. These
were filtered with the conditions based on the
characteristics from Technology lover segments and
sorted it by the probability of the project being unfunded. Then, based on the location where the
project exists, recommendations were given to the donor when he or she login to DonorsChoose.org.
For the donorID: 00000ce845c00cbf0686c992fc369df4 [Technology lover], the recommended live
projects title is shown in the table. In this way, the donor does not have to put much effort to find the
9
project, which donor likes, and the projects, which are in the verge of unfunded, will get full funding.
This is just an example of 12 projects recommended to the donor segment 6, who love donating to
projects involving technology. This way we created a recommendation for all the seven segment of
donors. A few projects might exist in one or more recommendation for segments. This way after the
donor has donated one time, the next time he or she login to DonorsChoose.org, based on previous
donation history, segmentation can be done as shown above and recommendation of projects can be
done. This recommendation system works for donors who have done a prior donation. Before a donor
donates for the first time, no prior account is created, nor is the donor asked to give any preferences.
Hence, with the existing process in operation of the website, it is difficult to build a recommendation
system for first time donors.
SUGGESTION FOR FUTURE STUDY In the existing system, a recommendation cannot be given to donor who is about to donate for first time
as there is no data about their preferences. Instead the system can be slightly modified such that prior
to first donation, preferences of donor can be asked. Based on preferences, we can map the donor to
one of the segments demonstrated above and a recommendation can be made. Using click stream data,
the results of predictive as well as segmentation models used can be improved. There are groups of
donors who donate only because they are teachers in the school, or their children study in the school, or
they know the teacher personally, or the teacher is a relative. The donation message can be text mined
to improve the segmentation results.
CONCLUSIONS Wherever a transaction is done in online, everyone come across “You may like this”, “Users Who Viewed
This Item Also Viewed” section. It makes the users’ life easier by showing the related products which
they may like and help them in choosing the right product effectively. If this can be done for a monetary
benefit in all domains, why can’t we do the same for a good cause as well? This project is a simple
prototype for our idea. Using the best model, the live projects were scored. 1.6 million donors were
segmented into seven segments, and based on the characteristics of each segment, recommendations
were made from the set of live projects to each segments. These recommended projects match the
likes, and preferences of donors based on previous donation data. If the donor has donated previously
then this recommendation system can be used to recommend projects that donors like. This
recommendation system gives a good experience to donors, by reducing the time taken to search for
projects they like. By implementing this recommendation system, we can understand what the donor is
looking at and what they like to donate for. It can be made better by adding the details mentioned in the
scope for the future section. As demonstrated in this document, DonorsChoose.org website can be
made better by implementing this recommendation system. It will not only help in donors having a
better experience, but also can increase the percentage of projects that meet funding goal.
10
APPENDICES
APPENDIX A: DATA EXTRACTION AND CLEANING 1.1: Code used to extract and clean csv files and convert them into SAS files
UNIX shell scripting for data preparation.
We have used following scripts to address data issues of .csv files. We have converted the .csv files to
text files.
We had a unique issue of presence of new line characters in the text columns. We replaced these new
line characters with <new_line> to facilitate SAS dataset creation.
Additionally the comma delimiter is replaced by the composite delimiter $|
/////////////// Code begins for opendata_projects.csv ///////////////////////////
cat opendata_projects.csv | awk 'gsub(/\r/,"") {printf "%s" , $0;next} {print}' > new.txt
// this is to remove additional new line characters in the data
cat new.txt | awk '{ if (gsub(/\\$/, " ")) printf "%s",$0; else print }' > 1.txt
sed 's/ \\\\\\\\/ <new_line> /g' 1.txt > 2.txt
// this replaces the comma delimiter by $| to address the parsing issue of any .csv files
sed 's/\",\"/\"$|\"/g' 2.txt > 3.txt
sed '1s/\,/$|/g' 3.txt > opendata_projects.txt
////////////////////////////// Code ends //////////////////////////////////////////
Similar code is used for modifying the other 6 datasets.
The Donation dataset has quantitative data pertaining to donations as well as the donations message
sent by a donor while donating to a project. We have separated the textual information (donation
message) from the quantitative data by creating two separate datasets as per below.
cut -d "," -f1-22 opendata_donations.txt > donations.txt
cut -d "," -f1,23 opendata_donations.txt > donations_message.txt
1.2: Code used to extract files from and push files into HDFS
11
1.3: Code used to clean the datasets and convert variables into appropriate data types
12
APPENDIX B: EXPLORATORY ANALYSIS 2.1: Distribution of project status in projects dataset
2.2: Average total price excluding optional for different project status
13
2.3: Cross tab between the resource_type needed for project and the project funding status
2.4: One-way frequencies of different secondary focus area in the project
14
2.5: Summary of missing values of different categorical variables in projects dataset
2.6: Summary statistics of different continuous variables in dataset
15
APPENDIX C: PREDICTIVE MODELING AND SEGMENTATION RESULTS 3.1: Data partition result for modeling purpose (60:20:20)
3.2: Imputation results of different categorical variables
3.3: Filtering bounds for different continuous variables
16
3.4: iteration plot and ROC index of logistic regression, used to predict if a given project meets funding
goal
17
3.5: Model comparison of different models built
3.6: Concept links for word help present in essays variable of essays dataset
18
3.7: Different words present in text topics built for essays text variable in essays dataset
3.8: Number of documents by topics and number of terms by topics analysis
19
3.9: Different rules for rule-based models built to predict the funding status of different projects
4.0: Fit statistics of rule-based models
20
4.1: Segmentation of donors output
4.2: Variable worth for of different variables in various segments of donors
21
4.3: Donor segment profiling output
4.4: SAS Enterprise Miner modeling diagram
top related