overnourishment & undernourishment data analysis · overnourishment & undernourishment data...

Post on 29-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Overnourishment & Undernourishment

Data AnalysisSevval Boylu Selin Eyupoglu Sena Necla Cetin

What causes malnutrition?● Overnourishment and undernourishment: both major health issues and

considered malnutrition

● Aim: To find out factors that trigger both problems.

● Period: 1999-2013

● Region: USA and Africa → extremities

We know that obesity is one of the critical health issues that cause a wide scale of

diseases, whereas hunger and undernutrition is detrimental for human development

in African countries. Thus, a correlation between the two would give a set of

efficient, mediating solutions.

Undernourishment DataYear Prevalence

1991 27.6

1992 27.3

1993 27.2

1994 27.1

1995 26.8

1996 26.6

1997 26.4

1998 26.2

Year

2007

Prevalence

22.3

2008 21.9

2009 21.6

2010 21.1

2011 20.7

2012 20.2

2013 20.0

2014 19.8

2015 19.8

Year

1999

Prevalence

25.9

2000 25.7

2001 25.4

2002 25.1

2003 24.7

2004 24.2

2005 23.5

2006 22.7

Undernourishment Graphs

Overnourishment Data Period Overweight Obese Extremely Obese

1999 34.0 30.5 4.7

2001 35.1 30.5 5.1

2003 34.1 32.2 4.8

2005 32.6 34.3 5.9

2007 34.3 33.7 5.7

2009 33.0 35.7 6.3

2011 33.6 34.9 6.4

2013 32.5 37.7 7.7

Overnourishment Data (obesity + extreme obesity)

Period Obesity

1999 35.2

2001 35.6

2003 37.0

2005 40.2

2007 39.4

2009 42.0

2011 41.3

2013 45.4

Overnourishment Graphs (obesity + extreme obesity)

The Correlation Matrix (before filling in missing data)

(before filling in)

Filling in Missing Data● Filled missing data points between 1999 and 2013 by using means of the

previous and next data points

● Eliminated data before and after these years

Filling in Missing Obesity Data Using MeansYear Obesity

AfterObesity Before

1991 NaN NaN

1992 NaN NaN

1993 NaN NaN

1994 NaN NaN

1995 NaN NaN

1996 NaN NaN

1997 NaN NaN

1998 NaN NaN

Year Obesity After

Obesity Before

2000 35.40 35.40

2001 35.60 NaN

2002 36.10 35.40

2003 37.00 NaN

2004 37.90 35.85

2005 40.20 NaN

2006 38.20 36.30

2007 39.40 NaN

Year Obesity After

Obesity Before

2009 42.00 NaN

2010 40.35 38.60

2011 41.30 NaN

2012 43.70 38.05

2013 45.40 NaN

2014 NaN NaN

2015 NaN NaN

Spearman’s and Pearson’s Correlation● Temperature(Celsius)● GDP_USA(American dollar)● Agriculture_Africa(Percentage of agriculture in African economy)● Africa_GDP(American dollar)

● Our null hypothesis: ‘There is at least one pair of variables which is statistically

insignificant.’ ● Our goal was to eliminate one of the potential factors. But, all pairwise p<0.05,

thus we say all pairs are statistically significant.

Spearman’s Correlation (1999-2013 only)

Strong negative correlation, makes sense

Does not suggest dependence between two variables since USA is almost always developing and undernourishment in Africa is mostly decreasing; no sensible one-to-one correlation between the two.

● Temperature is poorly correlated to other variables, which also makes sense.

● Not a good indicator of Obesity in USA and Undernourishment in Africa but we still kept it as a variable in our model because its correlations were around -0.5 or 0.5.

Pearson’s Correlation

Pearson’s Correlation

Pearson’s Correlation

Pearson vs. SpearmanSince Pearson’s correlation evaluates the linear relationship between two continuous variables, rate of change is constant whereas Spearman’s correlation evaluates monotonic relationships, our data was evaluated better in Spearman’s correlation.

We had no clue about the relationships and what kind of function it resembles. Thus we choose Spearman’s because it is more comprehensive.

Moreover, in real life situations we don’t expect to have a linear relationship between the variables.

Obesity Data Prediction

Undernourishment Prediction

Decision Tree Regressor Model● Specifically used Regressor functions because our data is numerical.

● 0.909 accuracy score for predicting Undernourishment percentage

● 0.86 accuracy score for Obesity in USA

● Mean ≈ 0.88

Decision Tree Regressor Model● We also tried to add Undernourishment as a predictor to see if altogether (all

columns) make a good model and make a proper estimation for Obesity values.

We reached the highest accuracy score with this model (0.86). It doesn’t

suggest that Obesity and Undernourishment percentages are strongly

dependent.

● Our conclusion: With Decision Tree for Regression model, Obesity and

Undernourishment values work well together. It also may suggest that this

model could predict future values better as well.

Training Our Model with Random ForestWe found 0.77 accuracy score which is not as good as Decision Tree. However, it can be considered successful in estimation.

Logistic RegressionLastly we tried Logistic Regression as a machine learning technique, found F1 scores of 0.86, 0.8.

Although high precision and low recall is preferable our model gives the opposite. In fact, all of the recall values are 1 .

F1 scores suggest that our model predicts with an average of 0.82 accuracy.

ConclusionIn early stages of our project, our aim was to suggest a single factor that possibly causes both problems, however giving single predictors to models didn’t yield efficient results (i.e. high accuracy scores) By including all possible predictors that we suggested, accuracy score got higher. Thus we conclude that our model with 4 predictors and 2 target is yielding.

Comparing scores of several machine learning techniques, we think Decision Tree

for Regression as a model gives the best model for our data because it has the highest accuracy score (0.88).

Failures

● We failed to build a proper decision tree with nodes and leaves. ● We spent too much time trying to fill the missing data on

obesity data(1991-1998), we ended up dropping these NaN values.

top related