content analysis on healthy dish recipes
Post on 15-Apr-2017
173 Views
Preview:
TRANSCRIPT
Content Analysis on
Healthy Main Dish Recipes
Professor Yilu ZhouXiayu Zeng, Hong Zhang, Feifei Chen, Yixi Zhang
IntroductionBackground● Allrecipes.com provides a platform and forum for everyone to search for
recipes, post recipes and make comments. The website is concerned about attracting more website visits by improving the quantity and quality of recipes, since people now are more health-conscious and prefer a healthy balanced diet.
Problem Statement● The research studies 718 healthy main dish recipes and explores:
1. Which ingredients and/or cooking methods are most popular for a healthy diet?
2. Which factors contribute most to the popularity of a recipe?3. What writing style of recipes attracts most reviews?
Methodology
Data Source
Data Preprocessing
AnalyticsApplication
Text Mining
Data Mining
● Weka(Ingredients)● Stanford Parser (Directions)● Python
● R● SPSS Modeler● Tableau
● Variables Selection● Data Cleansing
● Website: allrecipes.com
Top influence
factors
Variable lists:
● Number of reviews● Video(Y/N)● Photo(Y/N)
* Number of reviews is used to measure the popularity of the recipe
● Number of Ingredients● Ingredients(text)
Example of a recipe:
● Prep time● Cook time● Ready time
* Ready time does not necessarily equals to (Prep time + Cook time)
● Directions(text)● Length of directions● Number of steps● Average length of directions
● Calories● Cholesterol● Fiber● Sodium● Carbohydrates● Fat● Protein
Analysis on Directions
Tool: Stanford Parser, PythonSteps:● Adopted VB (including VBD,VBG,VBN,VBP,VBZ)● Neglect useless verbs, eg. cook, use, take● Combine similar verbs related to cooking methods ● Calculate total count of verbs
Most Popular Cooking Methods
● Top 1: “Bake”
● Healthy Cooking Methods:eg. “Simmer”(108), “Saute”(93), “Boiling”(56)
● Unhealthy Cooking Methods:eg. “Grill”(37), “Fry”(25), “Burn”(7)
Top 20 Verbs of Cooking
Top 30 Verbs
Analysis on IngredientsTool: WekaSteps:● Split a string into an n gram with 1- 4 grams● Output word count● Delete meaningless words or phrases● Calculate total word frequency● Drill down analysis● Associations within ingredients
Word FrequencyTop main ingredients:Top seasonings:
seasoning Freqgarlic 425olive oil 298black pepper 214cheese 195lemon 147basil 97ginger 92vegetable oil 92wine 92lemon juice 88parsley 88vinegar 80cilantro 71cumin 68
ingredient Freqonion 384chicken 328bell pepper 149pasta 131mushroom 97tomato 95bean 81carrot 58rice 58potato 54orange 53flour 50pork 49zucchini 44
● 65% of top main ingredients are vegetable and fruit, 21% are staple food and 14% are meat
● Chicken meat is a healthy kind of protein compared to other kinds of meat (like beef and pork)
● Chicken breast has 3 times lower in fat and 25% in calories compared to drumsticks and wings
● Skin and internal organs are the fattest parts, thus 87% of chicken breast are skinless and boneless
Drill-Down Analysis
Brown sugar:● more minerals● vitamin B enricher● more flavors and textures
Honey isn’t just Sugar!● energy source● source of vitamins and
minerals● weight loss
● Parmesan Cheese has much more calcium than any other cheese, and it is also low-lactose
● 67% of Cheddar cheese are reduced-fat
● Benefits of olive oil for heart, skin and hair
● Canola oil that is not healthy for cardiovascular system takes only 3%
Cheese Oil
Association - The overall associations are categorized into 4 groups: chicken, garlic, onion and pasta- Offer ideas for users to cook healthy food by providing a list of common ingredients
Data Mining - Data Preprocessing(target variable)
● Number of reviews is a continuous variable
● The majority of number of reviews falls into the range of [0,600]
● Existence of outliers
Solution:Divide the variable evenly into three categories
Distribution of Number of reviews
Variables that have strong correlations within each other:● Number of ingredients & length of
directions● Ready time & Cooking time● Length of direction & Average length of
recipe● Calories & fat● Calories & carbohydrate● Calories & protein● Protein & Cholesterol
Solution:● Remove following input variables:
cooking time, length of direction, calories
Data Mining - Data Preprocessing (Correlation between Numeric variables)
Average Length vs. Ready Time
Fat vs. Calories
No correlation
strong correlation
Data miningNeural Networks(3 layers):
Decision tree(C 5.0):
Top 5 important predictors:● Recipe photo(Y/N)● Video(Y/N)● Sodium/mg● Ready Time/min● Carbohydrate/mg
* Results in R is very similar
Recommendations
In order to attract more website traffics, health main dish recipes are suggested to have:
● video/photos ● sodium: 200 ~ 800 mg● # of ingredients: 7-12● <80-minute ready time● avg.length of direction: 20~40 words
top related