impact of remittances on the country of origin. … · 2020. 2. 19. · romanian statistical review...

IMPACT OF REMITTANCES ON THE COUNTRY OF ORIGIN.

MULTIDIMENSIONAL ANALYSIS AT MACRO AND MICROECONOMIC

LEVEL. CASE STUDY ROMANIA AND MOLDOVA 3Valentina Vasile, Professor dr.

Institute of National Economy, Romanian Academy

Elena Bunduchi, Teaching Assistant drd.

University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania

Ștefan Daniel, Associate Professor dr.


Călin-Adrian Comes, Associate Professor dr.


ESTIMATION OF NUMBER OF PERSONS PER HOUSEHOLD BASED

ON CHARACTERISTICS OF CONSUMPTION ITEMS - UTILIZATION

OF BIG-DATA TO IMPROVE THE CONSUMPTION TREND INDEX IN

JAPAN- 23Anri Mutoh

National Statistics Center, Japan

Masayo Yamashita


Yoshiyasu Tamura


Masahiro Matsumoto


R TOOLS FOR ILOSTAT: RILOSTAT AND SMART 39M. Villarreal-Fuentesa

Department of Statistics, International Labour Organization (ILO)

S. Dingb


Romanian Statistical Review nr. 4 / 2019

CONTENTS 4/2019

ROMANIAN STATISTICAL REVIEW www.revistadestatistica.ro


MACROECONOMIC STATISTICAL FORECASTING FOR ENGINE

DEMAND 63Ankit Kamboj

Cummins Technologies India Pvt. Ltd, Pune, India

Debojyoti Samadder


Ambica Rajagopal


Sarat Sindhu Mukhopadhyay


UNDERSTANDING PATTERNS IN THE CONSUMPTION OF

AGRO-FOOD PRODUCTS IN ROMANIA - AN ANALYSIS

AT REGIONAL LEVEL 81Andreea MIRICĂ, PhD. Assistant LecturerBucharest University of Economic Studies

Roxana-Ionela GLĂVAN, PhD. Assistant LecturerBucharest University of Economic Studies

Iulia Elena TOMA, PhD. Candidate Bucharest University of Economic Studies

Lucian PĂTRAȘCU, PhD.Bucharest University of Economic Studies

Romanian Statistical Review nr. 4 / 2019 3

Impact of remittances on the country of origin. Multidimensional analysis at macro and microeconomic level. Case study Romania and MoldovaValentina Vasile, Professor dr. Institute of National Economy, Romanian Academy

Elena Bunduchi, Teaching Assistant drd. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania

Ștefan Daniel, Associate Professor dr. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania

Călin-Adrian Comes, Associate Professor dr. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania

ABSTRACT

This research investigates the remittances impact, from the country of origin

perspective, on economic growth at macro and micro level of the household in Roma-

nia and Moldova. We decided to carry out a comparative analysis due to the impor-

tance of these external fi nancial fl ows to the economy. Although the share of remit-

tances in the GDP of the two states diff ers due to the level of economic development,

the constantly increasing labor migration is a common characteristic. In this research

we applied time series regression model using tseries packages in R. The expected

results of the research are to highlight the indicators infl uenced by the remittances in

Romania compared to Moldova at macro and microeconomic level as well as the type

and intensity of the generated impact. This research demonstrates that remittance-

based economic growth is unsustainable and highlights the long-term negative impact

on the country of origin of these fi nancial fl ows.

Key words: Remittances, Time-Series Models, R packages, Romania, Moldova

JEL Classifi cation: F24, C22, O52


INTRODUCTION

Researchers’ opinion is divided regarding the impact of migration and

remittances on the origin country, some considering that remittances generates

economic growth (Meyer et al, 2017; Matuzeviciute et al, 2016, Imai et al,

2014), others say there is no connection between the two variables (Lim et al,

2015; Barajas et al, 2009) and in the third category are experts who argue that

these fl ows have a negative impact (Lartey et al, 2008).

The fi rst group consider that remittances contribute to a better

allocation of resources in the country of origin, thus stimulating aggregate

demand for goods and services by increasing productivity generated by

consumption and investment (Kumar et al, 2018). Other opinions argue that

remittances contribute to increased income and productivity by reducing the

unemployment rate in the country of origin as a result of the mobility of the

unemployed (Boboc et al, 2012).

The third group, however, sees remittances as a factor stimulating

the entry of substitution imports of domestic products into the home market

(Javed et al, 2017), and on the other hand, consumption of imported products

is higher than “indigenous consumption” of similar products in these countries

(Bayar, 2015).

As a result of inequality in resource distribution, employment

opportunities and income levels, migration and remittances can act as

mechanisms for adjusting labor resource fl ows between countries of origin

and destination. On the one hand, migration and remittances represent the

consequence of the failure of national policy in the country of origin, to meet

individual needs in terms of decent employment opportunities and labor

income (Bunduchi et al, 2019). On the other hand, remittances can be a tool

for supporting economic policy in the development process by enhancing

demand for consumption and / or stimulating entrepreneurship. The economic

and social impact of remittances for countries of origin is signifi cantly

positive, at least from the perspective of the benefi ciary households. The

level of poverty, inequality and the structure of households’ expenditures are

some of the channels through which the fl ow of migration transfers reveals its

eff ects on growth and economic and social development.

The free movement of people and the opening of the labor market

(globalization and the need to cover the demographic defi cit in developed

countries with an aging population) have stimulated labor mobility for the

working population in less developed countries. Mobility for work and

emigration aimed both improving the worker’s employment status (mainly on

the side of the labor gain level) and the fi nancial support of the household in the


country of origin, through remittances. Statistical data research has shown that

the relationship between people working abroad and remittances received in

the country of origin is not homogeneous and/or balanced. There are countries

of origin with a large share of the labor migrant population, and signifi cant

remittances in GDP (such as Moldova) and countries with a large number of

migrant workers and a low share of remittances in GDP (such as Romania).

As indicator, the average remittance reveals a distorted picture because: a) not

all labor migrants remit money to the remaining family, b) not all the migrants

choose offi cial channels, so some of the remittances remain unregistered; c)

the amount of remittance is very diff erent, being determined by the level of

earnings, the cost of living in the host country, the mobility model (alone

or with the family) and the amount actually transferred on offi cial channels.

In addition, the model of remittance as a quantum, period and frequency is

strongly infl uenced by the individual occupational and human development

plan and mobility expectations (post-repatriation, naturalization in the

destination country, long-term and very long-term mobility, continuing the

journey search for the optimal mobility solution - income and / or profession

- associated with a later decision on utility etc.). In view of the potentially

diversifi ed impact of labor mobility with economic, social, behavioral, eff ects

etc., in this research, we try to identify the impact of remittances received in

terms of the importance of their total volume as share in GDP. We are looking

if there is a link between the level of development of the recipient country and

the impact of remittances fl ows on economic growth, as origin countries are

facing signifi cant mobility.

CONSIDERATIONS IN THE LITERATURE ON THE IMPACT OF REMITTANCES IN THE COUNTRY OF

ORIGIN

So far, there has been substantial researches on the importance of

remittances on the country of origin, addressing each fi eld of remittance. In

the following we have compiled a synthesis of the most recent research results.


Recent research on the impact of remittances on the country of origin

Table 1

Authors DatabaseEmpirical

approachResearch fi ndings

Economic development

(Eggoh, Bangake, &

Semedo, 2019)

49 developing

countries

Panel Smooth

Transition

Regression

Remittances have a positive

impact on the level of economic

development.

(Fromentin, 2017)

our results show

that a positive long-

run relationship

between remittances

and fi nancial

development

coexists with a

signifi cant (and

slightly positive

102 developing

countries

Pooled Mean

Group

In the short term, the study

fi nds that remittances have a

positive impact on fi nancial

development (except for low-

income countries). In the long

run, the assumption is that

households receiving remittances

abroad are more likely to use

offi cial fi nancial services for their

transactions and payments.

(Meyer & Shera,

2017)

Albania, Bulgaria,

Macedonia, R.

Moldova, Romania

and Bosnia

Herzegovina

OLS with fi xed

eff ects

The presence of positive

remittances between GDP and

GDP growth in the research

countries.

(Imai et al., 2014)24 Asian Pacifi c

countries

Panel model with

autoregressive

vector

Remittances generate economic

growth in the analyzed countries

and contribute to poverty

reduction.(Giuliano & Ruiz-

Arranz, 2009)

100 developing

countries

Generalized

moments method

Remittances contribute to GDP

growth in the analyzed countries.Labor market

(Vadean, Randazzo,

& Piracha, 2019) Tajikistan 3SL

Remittances lead to a reduction

in the number of employees,

in favor of self-employees,

especially in the fi eld of

agriculture.

At the same time, it generates

small-scale family investments,

which could have positive

household eff ects, without eff ects

at national level.

(Azizi, 2018)122 developing

countries

Dynamic panel

data with fi xed

eff ects

Remittances generate a reduction

in women’s participation in work

but do not aff ect men.

(Boboc et al., 2012) Romania

Risk assessment

to each mobility

profi le

Migration and remittances have

positive eff ects on the reduction

of unemployment and generate a

reduction in employment.




(Leon-Ledesma &

Piracha, 2004)

Central and Eastern

Europe

Panel data with

fi xed eff ects;

Generalized

moments method

Remittance infl ows positively

infl uence the employment rate

of the population in the country

of origin as a result of investing

these fl ows in the development of

entrepreneurship.Consumption

(Beaton et al., 2017)Latin America and

the Caribbean

Dynamic panel

data

Remittances contribute to

increased consumption,

especially as a result of

facilitating access to funding

sources.

(Lim & Simmons,

2015)

CARICOM

Member States

except the Bahamas

and Montserra

Cointegration

tests of panel data

The absence of any link between

remittances and GDP per capita,

but there was a positive infl uence

of remittances on consumption,

which means that remittances are

directed towards consumption

rather than productive

investments.

(Medina &

Cardona, 2010)Colombia Panel data model

The lack of impact of remittances

on current consumption, but

a positive infl uence on the

improvement of the living

standards of the benefi ciary

households was observed.Health


countries

Dynamic panel

data with fi xed

eff ects

Households receiving remittances

register increases in health

expenditure. At the same time,

the mortality rate is decreasing as

the remittances increase.

(Jr, Cuecuecha, &

Tlaxcala, 2013)Ghana

Two-stage

multinomial

selection model

Remittances cause an increase in

health expenditure.

(Zhunio,

Vishwasrao, &

Chiang, 2012)

69 developed

and developing

countries

GLS with side

eff ects

Remittance-receiving households

are experiencing an increase

in life expectancy and an

improvement in living standards.Education


countries

Dynamic panel

data with fi xed

eff ects

Remittances help increase

school enrollment in both public

institutions and private and

graduation rate.




(Ambler, Aycinena,

& Yang, 2015)El Salvador

Panel data with

fi xed eff ects

For each $ 1 remittance received

by benefi ciary households,

education spending increased by

$ 3.72

(Jr et al., 2013) Ghana

Two-stage

multinomial

selection model

Remittances increase spending on

education.

(Zhunio et al., 2012)

69 developed

and developing

countries

GLS with side

eff ects

Remittances in households

increased the tuition rate.

(Adams &

Cuecuecha, 2010)Guatemala

Multinomial

model in 2 stages

Households receiving remittances

recorded much higher spending

on education compared to the

period when they did not benefi t

from such fi nancial resources.

Research has, therefore, shown that remittances are an important

source of income, especially in poor households and the main directions

of spending are improving living conditions, current consumption, health

expenditure and small investment in housing. An important part of remittances

goes to the education of children, especially as education provides a greater

degree of opportunity to have higher labor income.

METHODOLOGY

Current research uses OLS model to analyze the impact of remittances

from the country of origin perspective - Romania and Moldova. Therefore, the

dependent variables used in research are:

a) at macroeconomic level - active population, employed population,

employment rate, number of unemployed, unemployment rate, total

consumption of the population, imports, trade balance, population

savings and entrepreneurship development;

b) microeconomic level - household consumption expenditure,

endowment with durable goods, ICT implementation, schooling

expenditure for education and health.

In order to analyze the impact between the variables included in the

research, we formulated the following economic hypotheses:

• H1 - the presence of a positive correlation between remittances and

the unemployment rate;

• H2 - the presence of a negative correlation between remittances and

employment indicators;



imports;

• H4 - the presence of a positive correlation between remittances and

household expenditure indicators;


the schooling rate.

Using OLS model, we will identify if exists a direct and statistically

signifi cant relationship between remittance fl ows and dependent variables by

elaborating several equations.

The general model has the following form:

Yi = βo + β1Xi + ei (1)

unde:

Yi represents the dependent variable,

Xi represents the independent variable - remittances,

βo is a parameter and shows the mean value of the Y variable when the

size of the independent variable X is equal to 0,

β1 represents the slope and shows the mean variation of the Y

dependent variable, to an absolute variation with a unit of the variable X,

ei is the residual variable.

Estimating the parameters of the OLS model will be done using the

statistical software R, and lm() function.

DATA

The databases used in the research are those provided by the National

Institute of Statistics of Romania, the National Bureau of Statistics of the

Republic of Moldova, the World Bank, the National Bank of Romania, the

National Bank of Moldova. The analysis period is between 1997-2017.

We decided to carry out a comparative analysis due to the importance

of these external fi nancial fl ows for any economy and especially for the least

developed economies, as it was previously demonstrated in the relevant

research literature. Although the share of remittances in the GDP of the

two countries diff ers due to the level of economic development and the

prevailing pattern of remittances, the increasing number labor migrants is a

common feature. Since the purpose of the research is to highlight the impact

of remittances on economic growth and the main motivation of migration in

the two countries is to supplement the incomes of households in the country

of origin, we can consider that the two countries are homogeneous from the

perspective of consumption directions of received remittances at households

level.


Migrants’ share in the total population of the origin country

in 1995-2017,%

Chart 1

��!��>�8!��!��-��?��/��4�8��-��>�

Source: Author’s calculations based on World Bank data. Available: http://www.worldbank.org/

en/topic/migrationremittancesdiasporaissues/brief/migration-remittances-data.

The free movement of people and the opening of the labor market

(globalization and the need to cover the demographic defi cit in developed

countries with aging population) have stimulated the mobility of working-class

population from less developed countries, such as Romania and Moldova.

Thus, the number of those who left has increased considerably from year to

year, reaching 20% in 2017 of the total population in Moldova and 15% of that

in Romania (Chart no. 1).

Share of remittances in GDP in 1995-2017,%

Chart 2

��!��>�8!��!��-��?��/��4�8��-��>�Source: Author’s calculations based on World Bank data. Available: http://www.worldbank.

org/en/topic/migrationremittancesdiasporaissues/brief/migration-remittances-data. Retrieved

on17.04.2019


The rise in the number of migrant workers generated the increase of

the remittances in these two countries, and implicitly their share in GDP, being

an important source of external fi nancial fl ows, which generate changes at

both macroeconomic and household level (Chart 2).

RESULTS AND DISCUSSIONS

Remittances are the expected outcome of migration to supplement

revenue, generating a series of eff ects at country level and household /

individual level.

At the origin country level it is stated thet remittances generate

signifi cant positive eff ects on the labor market, reducing the imbalances

registered in the form of the high unemployment rate (Boboc, Vasile, and

Todose, 2012). The result of our test indicates diff erent results in the case of

Romania and the Republic of Moldova for the period 1996-2017.

Remittances impact on labor market indicators

in Romania and Moldova

Chart 3��3�� $��

��

�

The results indicate that remittances have a stronger infl uence on the

labor market indicators in Moldova compared to Romania, highlighted by the

values obtained for R2. This is explained at the level of 2017 by the share of

remittances in GDP that is 8 times higher in Moldova than in Romania (16.1%

compared to 2.1%), and the population is more involved in migration (i.e.

the share of migration for work in the total population is more than 1.5 times

higher in the Republic of Moldova, 29% compared to only 19% in Romania).


Remittances exerts a statistically insignifi cant infl uence on the

unemployment rate, only a modest reduction with 0.04% of the number of

unemployed in Romania and with 0.35% of those in the Republic of Moldova

at each increase by 1%.

The results suggest that labor migration is not primarily driven by

the unemployed, but rather by inactive or even employed people. Moreover,

the results could be signifi cant if we analyzed the remittances in relation

to the underground economy, which employs over 1.2 million Romanians

(European Commission, 2017) and holds over 22% of Romania’s GDP at the

level of 2017 (European Commission, 2018) and over 23.2% of Moldova’s

GDP (BNS, 2018), but data are not available.

In respect to the labor market employment indicators, the same

infl uence of remittances is observed in Romania and in the Republic of

Moldova. Thus, the increase in remittances entries in Romania with 1%

contributes to the average reduction of the active population by 0.02%, of

the occupied population with 0.012% and the occupancy rate with 1.69%. In

Moldova the infl uence of remittances is more noticeable, their increase with

1% determinates the average decrease of the active population with 0.13%,

of the employed population with 0.11% and of the employment rate of the

moldavians with 6.71 %. Therefore, it moves from the employed person

status, especially in the Republic of Moldova, because the income diff erential

is high and responds to the need for additional income in the household, which

cannot be adequately satisfi ed by occupation in the country of origin.

Thus, the negative impact on the employment rate and the lack of a

statistically signifi cant infl uence on the unemployment rate suggest that labor

migrants were not only unemployed persons (Vasile et al, 2013; Caragea et

al, 2013). If for the unemployed the main reason for mobility is the lack of

a job, behind the decision to migrate and remit of the employed person from

Romania / R. Moldova, is the attractive salaries in the country of destination,

precarious working conditions in the country of origin, career opportunities,

etc. At the same time, the motivation of remittances as a result of labor mobility

contributes to accelerating the aging of the active population and raising the

average age in the country of origin, as a result of the fact that the persons

involved in labor mobility are predominantly young.


Remittances impact on macroeconomic indicators


Chart 4��

�

Remittances generate considerable eff ects on consumption growth.

Thus, with the increase of remittances by 1%, the total consumption of

households increases on average by 0.328% in Romania and 0.357% in the

case of R: Moldova (Chart 4). In the absence of detailed data on the origin

of consumption - imported or indigenous, it can be used as proxy the similar

evolution of total consumption and imports and we will analyze the impact of

remittances on imports during the period 1995-2017.

With the increase of remittances by 1%, the imports increased on

average by 0.39% in Romania and 0.33% in Moldova. Therefore, Romanians

consume more imported products than Moldavians, and national consumption

of consumer goods seems to be better supported by the demand associated with

the remittance spending in Moldova than in Romania. This can be explained

by the lack of supermarket chains in the Republic of Moldova in contrast to

those in Romania. If households would have a consumption model of goods

and services predominantly from national origin, consumption would have

contributed to the development of the local and national business environment,

and implicitly to economic growth. However, a signifi cant import-intensive

consumption has negative eff ects on both the balance of payments and the

economy. At the same time, the increase of substitution imports has a adverse

eff ect for indigenous products demand, which indirectly and negatively aff ects

the employment rate (Castles, 2010).


Remittances impact on household expenditures


Graph 51��)�� 5��

��

�

�

We note that Romanians and Moldavians tend to consume more

with the increase of remittance. Between 1996-2017, both consumption and

remittances in Romania had an upward trend, explaining 84% of current

consumption expenditure. Thus, the increase of remittances in the household

budget by 1% allowed the growth of current consumption expenditures with

0.5%. A positive infl uence of the remittances on the current consumption

expenditures is also registered in the Republic of Moldova, increasing on

average with 0.73% as the remittances rise with 1% in the period 2006-2017.

It also can be observed an increase in the endowment with durable

goods, but the impact is not as important. On average, the rise of remittances

by 1% determinate an increase by only 0.09% of the supply of such goods

in Romania, but in Moldova we note the lack of any link between these

two variables. This can be explained, on the one hand, by the fact that

there are people who no longer consider the possibility of returning home

and the remittances received by the household to the remaining parents in

the country are spent on health, education or current consumption. On the

other hand, we can witness a situation of fl attening the enduring supply of

durable goods, which is natural to a household that receives medium and long-

term remittances from multi-annual migration. In the case of the Republic of

Moldova we can add as an explanation the fact that remittances are directed

mainly to the consumption of current goods and services, in order to improve

the current standard of living.

Another category of household spending, which is infl uenced by

remittances, according to Azizi (2018) și Ratha (2013) are health expenditures.


In the analyzed period, there is an increase in health care expenditure in

Romania and Moldova, which can also be attributed to remittances in

benefi ciary households. Thus, 84.73%, respectively, 34.35% of the variation

in health expenditure is explained by the change in remittance infl ows in

Romania and Moldova (in the case of Moldova the result must be considered

more limited by the use of a shorter series of data – 2006-2017). From an

economic point of view, the justifi cation for increasing health expenditure is

positively associated with the motivation of migrants to remit. Both in the

situation of single-member households, usually taking into account short-

term or medium-term mobility with a possibility of return, as well as for

multiannual and / or permanent migrants who have left their parents or other

family members at home, a particular importance of remittance is to cover the

costs for increasing the quality of life, and health care services. The rise in

remittances by 1%, facilitates, on average, the increase of the amounts spent

in the health sector with over 0.67% in the Romanian households, respectively

with 0.91% for the Moldavian households. So we can argue that the increase

in households' net disposable income due to remittances contributes to the

quality of life.

Education expenditure is another category of spending that is

important for the quality of life of the population and indirectly for the

economic benefi ts of the country of origin. The migration phenomenon and

the remittance decision have implications also on the educational fi eld, both

positive and negative. On the one hand, it is the amount that the family is

willing to spend for the education of their children in order to obtain a certain

level of education. On the other hand, it is infl uenced by the number of students

who decide to attend high school and / or university / postgraduate studies in

the country. In the case of a remittances’ benefi ciary family, the net available

income increases, with a positive impact on the availability of resources for

the study of children. However, there may be two situations of rising spending

on education:

- studies in the country of origin with positive eff ects on the

development of human capital, fi nancing of educational institutions

and greater likelihood of young graduates being integrated into their

home country;

- studies abroad, which have a negative impact on the development of

the education system by reducing the initial education demand, but

also on the economic and social development of the origin countries,

if the post-graduate employment is done abroad.

Studying abroad will determine the possibility of integration into the

labor market in the country of destination, decreasing the human capital in


Romania and Moldova respectively. On the other hand, the state will not be

able to recover the amounts invested for those students in primary or high

school education, if necessary. The same negative eff ect is also registered

by the migration decision of a household member, followed by family

reunifi cation in the country of destination, through the migration of children

who have completed compulsory education or a part of it, fi nanced by the

state.

The results obtained confi rm the results of the research made by Adams

(et al, 2010) and Ambler (et al, 2015) for the cases of Romania and Moldova,

according to which the remittances in the country of origin, contributes to

increasing household spending with education. Thus, as a result of the 1%

increase in remittances, household spending with education in Romania

increased on average by 0.275%, while in the Republic of Moldova the impact

is higher, this expenditure increasing on average by 1.68%. This evolution of

expenditures is explained by 63% of the remittance variation in Romania and

47% in the Republic of Moldova.

In addition, the higher incidence of remittances to stimulate household

spending with education in Moldova compared to Romania can also be

explained by:

- remittances are used in the Republic of Moldova more for the

fi nancing of the compulsory secondary education, than for the

tertiary, which is optional. In addition, the enrollment rate to tertiary

education is lower in Moldova than in Romania, also because of the

similarity of language between the two countries. For this reason

some of the future students prefer to pursue university studies in

Romania and not in the Moldova, having qualitative advantages

and/or diff erent opportunities, more attractive for employment after

graduation; The cost of completing compulsory education that the

household supports is signifi cantly higher in Moldova compared to

Romania;

- the migration intention after the completion of the compulsory

education is higher for the Moldavian youth compared to the

Romanians, the potential income diff erential being higher for the

medium and low skilled jobs, to which labor/graduate migrants

have access in destination countries

At the same time, the increase in remittances outcomes a drop in the

school population by 0.04% in Romania and 0.07% in Moldova, as opposed to

the results obtained by Zhunio (et al, 2012) and Azizi (2018) (they studied the

eff ects of remittances in underdeveloped and developing countries).Our results

are, on the other hand, in line with the results obtained by Amuedo-Dorantes (et


al, 2010) and Mckenzie (et al, 2006), which analyzed the Dominican Republic

and Mexico, countries with an average level of economic development. The

results obtained can be explained by the diff erences in the level of economic

development of the analyzed states, the dynamics of integration in the EU

space, the free movement facilities between Romania and Moldova, as well as

the policy of support for the development of Moldova elaborated by Romania.

(scholarships for Moldovan students, aid for R. Moldova from public funds in

Romania, etc.). Although Moldova is not a member of the European Union,

the large number of Moldovans with Romanian citizenship also determine the

same migration behavior and preference for EU space. Reducing the number

of students may also be generated by the emergence of a trend among young

people whose family members were not in mobility, abandoning further studies

in favor of migration, which are presented as generating fi nancial resources

for them and their family members.

The synthesis of the research results confi rms the hypotheses H2, H3,

H4 and H5 and highlights the specifi cities of the development conditions at

national level and the stage reached in the economic performance and social

inclusion and justifi es the analysis the impact of the remittances on the country

origin, both at macroeconomic and microeconomic level.

Synthesis of the results of the analysis of the eff ect of remittances on

economic variables in Romania and Moldova, 1995-2017

Table 2.

Dependent

variables

Romania Moldova

Macroeconomic Microeconomic Macroeconomic Microeconomicpositive negative positive negative positive negative positive negative

Active population-0.0224

***x

-0.1282

***x

Employed -0.0126

**x

-0.1137

***x

Employment rate-1.6954

***x

-6.7123

***x

Unemployed -0.0488

***x

-0.3524

***x

Unemployment

rate- - - - - - - -

Total population

consumption

0.3285

***x

0.3572

***x

Import0.3974

***x

0.3272

***x

Trade balance -0.0001 x-3.561

x


Current

consumption

expenditure

x0.51373

***x 0.7345

Endowment of

durable goodsx

0.09468

***- - - -

Implementation

of ICT services

(Internet access)

x13.59

*x

9.709

***

Health

expenditurex

0.67551

***x 0.9064

Education

expenditure x

0.27555

***x

1.6798

**

Enrollment rate x-0.0426

***x

-0.0782

***

Thus, following the comparative analysis carried out in Romania and

Republic of Moldova on the impact of remittances on the country of origin,

we can see that the infl uence generated by these external fi nancial fl ows diff ers

according to the variables included in the research (Table 2), as follows:

- the positive infl uence on household savings in Romania and the

implementation of ICT products and services, with a direct impact at micro

level and indirectly at macroeconomic level;

-strong infl uence with a negative impact on the employment rate of

the population, with direct impact at macroeconomic level and indirectly at

microeconomic level;

-moderate infl uence with positive impact on total consumption

of population and imports, with direct impact at macroeconomic level and

indirectly at microeconomic level; and on current consumption expenditure,

health and education with a direct at micro- and indirect impact at

macroeconomic level;

-weak infl uence with negative impact on active and employed

population and enrollment rate, with direct impact at macro level and indirectly

at microeconomic level;

- lack of signifi cant infl uence on the unemployment rate, which

demonstrates that labor mobility comes mainly from employment and too

little of the unemployed situation in the country of origin and the potential

impact of the underground economy.

At the same time, we note the lack of any statistically signifi cant

infl uence of remittances on the development of entrepreneurship for the entire

analyzed period.


CONCLUSIONS

Remittances are the result of labor mobility and mainly emerge as

a motivation for migration for categories of low and middle-class people in

economically less developed and emerging middle-income countries.

Remittances have both macroeconomic and microeconomic eff ects

(the analysis of the literature and the often-divergent results on migration

eff ects raised the question of specifi c causes and / or conditions that can

infl uence and generate such confl icting results) by the eff ects they produce

and by the destination of these amounts. In the present research stage, we have

tested the impact of remittances on two former socialist countries, one of them

being a EU member since 2007 and having a high (Romania), and a low level

(R. Moldova) of remittances fl ows as a share of GDP.

At the macroeconomic level, remittances balance the labor market

by reducing the number of the unemployed, which contributes to reducing

the demand for social services but also generates negative infl uences on

the number of the employed population. External labor mobility is a much

more attractive option for young people in training. This appreciation is also

confi rmed by the declining number of students and college students in both

countries, with the possibility of mobility for studies and / or work, generating

potential human capital losses for the country of origin and total / partial loss

of public investment in education.

At the same time, remittance inputs stimulate consumption and

drive, through multi-annual employment abroad, to the emergence of a more

expensive consumer trend, preferably from imports. In Romania and Moldova,

the trend of consumption follows the one of imports, which negatively

aff ects the balance of payments and domestic production. Besides creating

macroeconomic imbalances, the initiatives taken by private entrepreneurs to

diff erentiate the supply of goods and services are adjusted by the competition of

imported foreign substitute products, the price of which is below comparative to

domestic entrepreneurs. Internal market competition is necessary and benefi cial

in the medium and long term, as it supports the increasing competitiveness

of domestic products. However, shaping a pattern of current consumption

predominantly on imported substitute products, without being clearly

accounted for by qualitative diff erences, but rather by small price diff erences

or just preferences, does not help the development of indigenous companies,

which should be supported by public policy support. At the same time, also

through such policies should be stimulated the entrepreneurship developed by

people belonging to households with migrant workers, attracting their return

and the development of business in Romania and Moldova.


In this way, the benefi ts at the micro level can be materialized in the

employment of graduates in the origin country, the return of migrant workers

and the start-up of entrepreneurial business, the increasing living standards

in households, a better health of the household members and the possibility

of raising the level of education and promotion continuous training of active

people in the household and / or youth, etc. At macroeconomic level, there may

be the following benefi ts: - the development of the business environment and

the increase of the working age population, the stimulation of consumption of

indigenous products/services, tax incomes on production and consumption,

the reduction of pressure for aid and social assistance for poor households,

the development of the health sector and the education sector through demand

for quality services, including preventive health segments, respectively

continuing tertiary education and lifelong learning/specialization). In addition

to these direct benefi ts, we can identify and develop opportunities to spend

remittance savings for complementary purchases - cultural consumption,

increased access to ICT goods and services, recreational activities, housing

construction - holiday homes, etc.

The limitation of the research towards the analyzed period 1997-2017

is that for some indicators, such as: the value of household savings and the

share of households with Internet and computer access, we have datasets for

Romania only for the period 2007-2017 , and for the Republic of Moldova,

household spending types are available only from 2006 until 2017.

This research is exploratory, which is why we have selected only

Romania (high share of international labor mobility and low share of

remittances in GDP) and Moldova (high share of international labor mobility

and high share of remittances in GDP). Our further research will include the

former communist countries from Europe and Asia (former USSR countries

and the COMECOM area), which, after the transition to a market economy

and extensive economic restructuring, faced a strong labor migration, mainly

driven by the relatively diff erent earnings and working conditions than in the

country of origin. In many cases the lack of decent employment opportunities

also justifi es the propensity to move towards more developed countries. We

will aim to highlight the extent to which a typology of the impact of remittances

on the country of origin in the former communist space can be developed.

References 1. Adams, R. H., & Cuecuecha, A., 2010, Remittances, Household Expenditure and

Investment in Guatemala. World Development, 38(11), 1626–1641. https://doi.

org/10.1016/J.WORLDDEV.2010.03.003

2. Ambler, K., Aycinena, D., & Yang, D., 2015, Channeling Remittances to Education:

A Field Experiment among Migrants from El Salvador. American Economic Journal:

Applied Economics, 7(2), 207–232. https://doi.org/10.1257/app.20140010


3. Amuedo-Dorantes, C., & Pozo, S., 2010, Accounting for Remittance and Migration

Eff ects on Children’s Schooling. World Development, 38(12), 1747–1759. https://doi.

org/10.1016/J.WORLDDEV.2010.05.008

4. Azizi, S., 2018, The impacts of workers’ remittances on human capital and labor

supply in developing countries. Economic Modelling, 75, 377–396. https://doi.

org/10.1016/J.ECONMOD.2018.07.011

5. Barajas, A., Chami, R., Fullenkamp, C., Gapen, M., & Montiel, P., 2009, Do

Workers’ Remittances Promote Economic Growth? Retrieved from http://citeseerx.

ist.psu.edu/viewdoc/download?doi=10.1.1.600.6354&rep=rep1&type=pdf

6. Bayar, Y., n.d., Economic Insights-Trends and Challenges Impact of Remittances

on the Economic Growth in the Transitional Economies of the European Union.

Retrieved from http://www.upg-bulletin-se.ro/archive/2015-3/1.Bayar.pdf

7. Beaton, K., Cerovic, S., Galdamez, M., Hadzi-Vaskov, M., Loyola, F., Koczan, Z., … Wong, J., 2017, Migration and Remittances in Latin America and the Caribbean:

Engines of Growth and Macroeconomic Stabilizers? In IMF Working Papers (Vol.

17). https://doi.org/10.5089/9781484303641.001

8. Boboc, C., Vasile, V., & Todose, D., 2012, Vulnerabilities Associated to Migration

Trajectories from Romania to EU Countries. Procedia - Social and Behavioral

Sciences, 62, 352–359. https://doi.org/10.1016/J.SBSPRO.2012.09.056

9. Bunduchi, E., Vasile, V., Comes, C.-A., & Stefan, D., 2019, Macroeconomic

determinants of remittances: evidence from Romania. Applied Economics, 51(35),

3876–3889. https://doi.org/10.1080/00036846.2019.1584386

10. Caragea, N., Dobre, A. M., & Alexandru, A. C., 2013, Profi le Of Migrants In

Romania – A Statistical Analysis Using "R"; Working Papers. Retrieved from https://

ideas.repec.org/p/eub/wpaper/2013-04.html

11. Castles, S., 2010, Understanding Global Migration: A Social Transformation

Perspective. Journal of Ethnic and Migration Studies, 36(10), 1565–1586. https://

doi.org/10.1080/1369183X.2010.489381

12. Eggoh, J., Bangake, C., & Semedo, G., 2019, Do remittances spur economic

growth? Evidence from developing countries. The Journal of International Trade &

Economic Development, 1–28. https://doi.org/10.1080/09638199.2019.1568522

13. European Commission, 2017, Country Report Romania 2017. Retrieved from

https://ec.europa.eu/info/sites/info/fi les/2017-european-semester-country-report-

romania-en.pdf

14. European Commission, 2018, Country Report Romania 2018. Retrieved from

https://ec.europa.eu/info/sites/info/fi les/2018-european-semester-country-report-

romania-en.pdf

15. Fromentin, V., 2017, The long-run and short-run impacts of remittances on fi nancial

development in developing countries. Quarterly Review of Economics and Finance,

66, 192–201. https://doi.org/10.1016/j.qref.2017.02.006

16. Giannetti, M., Federici, D., & Raitano, M., 2009, Migrant Remittances and

Inequality in Central-Eastern Europe. International Review of Applied Economics,

23(3), 289–307. https://doi.org/10.1080/02692170902811710

17. Giuliano, P., & Ruiz-Arranz, M., 2009, Remittances, fi nancial development,

and growth. Journal of Development Economics, 90(1), 144–152. https://doi.

org/10.1016/j.jdeveco.2008.10.005

18. Imai, K. S., Gaiha, R., Ali, A., & Kaicker, N., 2014, Remittances, growth and

poverty: NEW evidence from Asian countries. Journal of Policy Modeling, 36(3),

524–538. https://doi.org/10.1016/j.jpolmod.2014.01.009

19. Javed, M., Awan, M. S., & Waqas, M., 2017, International Migration, Remittances

Infl ow and Household Welfare: An Intra Village Comparison from Pakistan. Social

Indicators Research, 130(2), 779–797. https://doi.org/10.1007/s11205-015-1199-8


20. Jr, R. H. A., Cuecuecha, A., & Tlaxcala, E. C. De., 2013, The Impact of Remittances

on Investment and Poverty in Ghana. World Development, 50, 24–40. https://doi.

org/10.1016/j.worlddev.2013.04.009

21. Kumar, R. R., Stauvermann, P. J., Kumar, N. N., & Shahzad, S. J. H., 2018,

Revisiting the threshold eff ect of remittances on total factor productivity growth in

South Asia: a study of Bangladesh and India. Applied Economics, 50(26), 2860–

2877. https://doi.org/10.1080/00036846.2017.1412074

22. Lartey, E. K. K., Mandelman, F., & Acosta, P. A., 2008, Remittances, Exchange

Rate Regimes, and the Dutch Disease: A Panel Data Analysis. SSRN Electronic

Journal. https://doi.org/10.2139/ssrn.1109206

23. Leon-Ledesma, M., & Piracha, M., 2004, International Migration and the Role of

Remittances in Eastern Europe. International Migration, 42(4), 65–83. https://doi.

org/10.1111/j.0020-7985.2004.00295.x

24. Lim, S., & Simmons, W. O., 2015, Do remittances promote economic growth in the

Caribbean Community and Common Market? Journal of Economics and Business,

77, 42–59. https://doi.org/10.1016/j.jeconbus.2014.09.001

25. Matuzeviciute, K., & Butkus, M., 2016, Remittances, Development Level, and

Long-Run Economic Growth. Economies, 4(4), 28. https://doi.org/10.3390/

economies4040028

26. Mckenzie, D., Rapoport, H., Bauer, T., Hanson, G., Jouneau, F., Licandro, O., & Lopez, E., 2006,. Can migration reduce educational attainment? Evidence

from Mexico * (No. 3952). Retrieved from http://siteresources.worldbank.org/DEC/

Resources/Can_Migration_reduce_Educational_Attainment.pdf

27. Medina, C., & Cardona, L., 2010, The Eff ects of Remittances on Household

Consumption, Education Attendance and Living Standards: the Case of Colombia.

In Lecturas de Economía (Vol. 72). Retrieved from http://aprendeenlinea.udea.edu.

co/revistas/index.php/lecturasdeeconomia/article/viewFile/6498/5960

28. Meyer, D., & Shera, A., 2017, The impact of remittances on economic growth:

An econometric model. EconomiA, 18(2), 147–155. https://doi.org/10.1016/J.

ECON.2016.06.001

29. Ratha, D., 2013, THE IMPACT OF REMITTANCES ON ECONOMIC GROWTH

AND POVERTY REDUCTION. Retrieved from www.knomad.org/powerpoints/

30. Vadean, F., Randazzo, T., & Piracha, M., 2019, Remittances, Labour Supply and

Activity of Household Members Left-Behind. Journal of Development Studies,

55(2), 278–293. https://doi.org/10.1080/00220388.2017.1404031

31. Vasile, V., Boboc, C., Pisica, S., & Cramarenco, R. S., 2013, The estimation of

the impact of free movement of Romanian workers in EU region from 01.01.2014;

realities and trends from economic, employment, and social perspectives, at

national and European level, Study no 3 / SPOS. Retrieved from www.ier.ro

32. Zhunio, M. C., Vishwasrao, S., & Chiang, E. P., 2012, The infl uence of remittances

on education and health outcomes: a cross country study. Applied Economics,

44(35), 4605–4616. https://doi.org/10.1080/00036846.2011.593499


Estimation of Number of Persons Per Household Based on Characteristics of Consumption Items - utilization of big-data to improve the Consumption Trend Index in Japan-Anri Mutoh ([email protected])


Masayo Yamashita ([email protected])


Yoshiyasu Tamura ([email protected])


Masahiro Matsumoto ([email protected])


ABSTRACT

The article suggests the possibility of utilizing big-data held by companies,

integrating it with the data of offi cial statistics. Offi cial statistics agencies in Japan have

sought to develop a Consumption Trend Index (CTI) by cooperating with academic

researchers and companies as a provider of the big-data. One of the important roles of

the CTI is to more accurately indicate the trend of one-person household consumption,

therefore, the big-data is expected to reinforce existing offi cial micro-data, especially

one-person household. However, the obtainable big-data seldom includes the number

of household members, and needs imputation of the missing value. Therefore, we

estimate the number of members in each household according to the characteristics of

consumption items in the Japanese traditional household expenditure survey. We used

logistic regression with an L1 penalty (Lasso regression) for the analysis, with each

type of household as the response variable and purchase items as the explanatory


variable. As a result, since one-person households and two-or-more-person house-

holds are identifi ed by their purchasing tendencies, so the household characteristic

become evident.

Keywords: Consumption trend, household accounts, statistical imputation,

logistic regression, LASSO, R package ‘glmnet’

JEL classifi cation: D13, D16, D90, P44, Z13

1. INTRODUCTION

1.1 Big-data for Consumption Trend Index

We researched the utilization of big-data for offi cial statistics.

Since 2017, in Japan, the Statistics Bureau, Ministry of Internal Aff airs

and Communications, Statistical Research and Training Institute, and the

National Statistics Centre have begun to research the development of a

novel Consumption Trend Index (CTI) by cooperating with professors and

commercial companies as the data holders (Statistics Bureau, 2017, 2018a).

The CTI is an index that enables consumption trends to be grasped quickly and

comprehensively. There are two types of CTI, for macro-level (CTI Macro)

and micro-level (CTI Micro) (The Consumer Statistics Division of Statistics

Bureau, 2018). The CTI Macro provides an early estimate of the monthly

trend in the Household Final Consumption Expenditure of GDP. In contrast,

the CTI Micro indicates the monthly trend in household average expenditure

by major consumption items. In order to further improvement of the CTI, our

research group plans to utilize big-data held by companies as a part of the

input data of the CTI. We particular deal with the fusion of big-data and the

source of CTI Micro in this paper.

The utilization of big-data as the source of CTI Micro is expected to

refl ect the consumption tendency of a one-person household more accurately.

In Japan, even though of one-person households account for about 1/3 of the

population (Statistics Bureau, 2018b), it is diffi cult to survey the one-person

household in a Family Income and Expenditure Survey (FIES). One item of

published evidence for the diffi culty, which is a little old, is case of paradata

research for the FIES by Hamasuna (Hamasuna, 1980). According to the

paradata research, one-person households tended to be absent and need a lot

of revisiting for the survey. This trend is considered to remain even in the

2010s. In order to deal with this diffi culty, the source of CTI Micro consists

of the Single Household Expenditure Monitor Survey, in addition to the FIES

and the Survey of Household Economy. The big-data becomes a source of

information to reinforce these surveys.


1.2 Details and issues of big-data for the CTI micro

Data of loyalty programs and data of online personal fi nance software

are considered as usable big-data for the CTI Micro. Their advantages are 1)

the ability to automatically and instantly obtain enormous amounts of data, 2)

that the items of data correspond to a part of consumption items in the FIES

(namely it is a proper subset), and 3) that the data consists of several samples

whose unit is a user of loyalty program or personal fi nance software.

These big-data include information of the user’s individual age and sex,

however, they have the issue that they rarely include household information;

the number of household members of the samples are unclear. Since the input

data of CTI micro consists of the samples whose unit is household, the big-

data need imputation of the missing value: the number of household members.

As mentioned above, it is important but diffi cult to survey the one-person

household for the FIES, thus at least data on one-person households have to

be identifi ed and used.

1.3 Purpose

The purpose of this paper is to estimate the number of persons per

household by the consumption items and to clarify the characteristics of every

consumption item of the household type in order to impute the big-data and

suggest the possibility of utilization for the CTI Micro.

2. RELATED STUDY

2.1 Big-data and Offi cial Statistics

The researches in the fi eld of economic or social systems using big-

data have increased in recent years (Japec et al., 2015). This is the same trend

in offi cial statistics. Struijs mentioned that the opportunity of collaboration

between offi cial statistics agency and business and universities was increased

associating with big-data research in National Statistical Institutes (at the

Netherlands); and reviewed issues and challenges about using big-data in

offi cial statistics (Struijs et al., 2014).

Research on Consumer Price Indices(CPI) is especially active

among the studies using big-data for offi cial statistics. For example, Offi ce

for National Statistics (at the UK) has reported several articles that estimated

experimental CPI using web scraping data; and 10,000 price data on the web

are collected automatically per month and utilized as `the harmonized index

of consumer prices` in the Federal Statistical Offi ce (at Germany) (Blaudow

and Burg, 2018). However, few studies use big-data as a part of offi cial micro

data.


2.2 Stochastic regression imputation methods

Statistical imputation is a part of the most important fi eld in offi cial

statistics. In recent years, multiple imputation of missing values has been

commonly used and its software is large in variety (Takahashi and Ito, 2013).

In this paper we do not deal with multiple imputation, but stochastic regression

imputation. Because it is possible to design regression models for imputation.

Unlike ordinary missing value, they have a full reason for missing, and also

have a highly reliable reference data as the FIES. The imputation by stochastic

regression is appropriate for the purpose of complementing a structure of the

FIES.

In this paper, we use logistic regression with the L1 norm as a model

for imputation, but there are few previous studies using such model for

stochastic regression on imputation.

3. METHODOLOGY

3.1 FIES data

The data for analysis were retrieved from the January 2010 FIES

conducted in Japan. The FIES had two types of survey, for one-person

households and for two-or-more-person households. There was a total of 700

one-person households, along with approximately 7,800 two-or-more-person

households. Although the two types of survey were diff erent, their contents

were almost the same, comprising the demographic characteristics of the

householder and family members, and the purchased items as represented by

price amount or frequency.

We consider the elements of the response variables of estimation to

be one-person households, two-person households, three-person households,

or four-or-more-person households, because 90 percent of the two-or-more-

person households were occupied by 2–4 person households. Five-person

households accounted for only 9 percent of the total (see Table 1), yet they

show little diff erence from the four-person households in terms of the total

spent.

Number and percentage of each type of household in the FIES data

Table 1

one person

two-or-more

persons

two

persons

three

persons

four

persons

five

persons

six

persons

seven-or-more

persons

700 7801 3165 2019 1719 676 182 40

(percentage) 100% 41% 26% 22% 9% 2% 1%


Although it is true that there is a positive correlation between the

number of household members and the total amount spent, there is obvious

overlap in the histograms based on the total spent by household size. Figure 1

shows the histograms and density plots with a uniform number of households.

We are going to identify the items with less overlap among each household.

Histogram and density plot of total spent per household

Figure 1B�&��

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

'� � � � � � ''�'''� � � � � � � � >''�'''� � � � � � �''�'''� � � � � !''�'''� � � � � �'''�'''�

�� #�� )" # ��,�

3.2 Lasso regression

The FIES data contain almost 600 purchase items as explanatory

variables, yet the actual observations contain many zero values. Therefore,

we conducted a regression analysis that is proposed by Tibshirani (1996),

so-called Lasso regression analysis. This performs simultaneous variable

selection and minimization of the prediction error by adding L1 norm as a

penalty. Since the L1 norm forms part of the parameters estimated as absolutely

zero, it is possible to select the variables automatically for regression. Let

be the response vector and a

matrix of the explanatory variable, respectively, to give an data

matrix. The problem thus takes the form of eqn [1]:


[1]

The is a tuning parameter. We are considering standardized data,

and hence we omit . Let the sum of the absolute values of regression

coeffi cients become the L1 penalty. The is the regularization parameter

by using the method of Lagrange’s undetermined multipliers; thus, the Lasso

regression model is defi ned as eqn [2],

[2]

We aimed to estimate the types of households this time; thus, it uses

logit as the link function. The environment of analysis is R 3.4.4 and we used

the package ‘glmnet’ ver. 2.0-16. The estimation algorithm for the Lasso

regression is the coordinate descent in this package, which is calculating

diff erentiation for each numerical value of the norm and repeated updating

(Friedman et al., 2010). The coeffi cient of the L1 norm was determined with

10-fold cross-validation so as to minimize misclassifi cation error. The largest

lambda, which minimizes misclassifi cation error, was then selected within one

standard error.

3.3 Data preprocessing

As data preprocessing, we fi rst extracted the purchased items common

to one-person households and two-or-more-person households. Next, we

calculated the correlation between each item, and summarized those pairs

with a correlation coeffi cient over 0.7. The reason for the preprocessing is

that the variable selection by lasso regression become stable in the case of

high correlation between explanatory variables. In addition, we applied the

rank correlation as well as the linear correlation, but the linear correlation is

better to summarize more items than the rank correlation. The pairs of highly

correlated items result in 100 pairs, all of which have class and subclass

relationships. For example, the pair {‘Raw meat’, ‘Beef’} has a correlation

coeffi cient of 0.726. In this case, the ‘Raw meat’ is a larger class including

‘Beef’. In such pairs in a kind of hierarchical relationship, the subclass items

are omitted for effi cient modeling. If both the class and the subclass have

similar behaviors, it is reasonable to leave the larger class that is aff ected by

another subclass.

After the data preprocessing, the data still contained almost 500

purchased items as explanatory variables. This suggests that no class could


fully explain the features of all of its subclasses. Purchased items as represented

by both the price amount and the frequency are processed in the same way.

4. RESULTS & DISCUSSION

4.1 Multinomial model

First, we consider a multinomial model in which the response

variables are the four types of households: one-person households, two-person

households, three-person households and four-or-more-person households.

Table 2 shows that the prediction accuracy of the multinomial model is 0.657,

which is poor. A similar level of accuracy is produced whether we use data

represented by the price amount or by the frequency. This result suggests that

it was diffi cult to identify items with less overlap even if estimated using the

multinomial model.

The confusion matrix and the prediction accuracy

of the multinomial model

Table 2

one-person two-person three-personfour-or-more

-person

one-person 288 409 2 1

two-person 65 2833 173 94

three-person 18 990 495 516

four-or-more

-person12 379 253 1973

predicted

actu

al

accuracy: 0.657

4.2 Binomial model

According to Section 1, we have to identify the data of one-person

households. We consider binomial models whose response variables are

dichotomous of one-person households and the others in order to indicate the

items that are simply aff ected by the purchasing activity of multiple persons.

As a result, all the binomial models have prediction accuracy over

0.9, which is a similar result to the accuracy between the price amount and the

frequency. Table 3 shows the confusion matrix and prediction accuracy. The


columns of the matrix show predicted items, while the rows show the actual

items.

The confusion matrix and the prediction accuracy of the binomial model

Table 3

accuracy

two-or-more-person one-person

7547 254

240 460

three-or-more-person one-person

4552 84

84 616

four-or-more-person one-person

2572 45

48 652

predicted

0.942

0.969

0.972

actual

actual

actual

two-or-more-person

one-person

four-or-more-person

one-person

three-or-more-person

one-person

Figure 2 shows lambda coeffi cient plots in the one-person and two-or-

more-person binomial model, and its solution paths. The lambda coeffi cient

plots represent misclassifi cation error by each lambda at the cross-validation;

the solution paths represent the coeffi cients at the optimum lambda. There

are two plots: plot-a is for the data represented by the price amount, and

plot-b is for the frequency. The solid lines in the respective plots indicate

the lambda that minimizes misclassifi cation error. The dashed line indicates

the largest lambda within one standard error that minimizes misclassifi cation

error. We select the optimal lambda as indicated by the dashed line, which is

for the model with the purchase price amount and

for the model with the frequency of purchased items. Each models left 84 and

114 variables.

The upper (or lower) 10 coeffi cients of the binomial models by one-

person and two-or-more-person household are shown in Table 4 and 5. Each

table shows the coeffi cients by the purchase price amount and the frequency of

the purchased items. The dummy variables for the response variable are taken

as 0 for two-or-more-person households and 1 for one-person households.

Therefore, a positively loaded coeffi cient represents the items that characterize

a one-person household, while a negatively loaded coeffi cient represents the

items that characterize a multiple-person household.


Lambda coeffi cient plots and its solution paths in the one-person and two-or-more-person binomial model

Figure 2

a. The lambda coeffi cient plots (left) and the solution paths (right) in the

model with purchase price amount as the explanatory variable

�

�

�

�

�

�

�

�

�

�

�

� b. The lambda coeffi cient plots (left) and the solution paths (right) in the

model with frequency of purchased items as the explanatory variable�

�

�

�

�

�

�

�

�

�

�

Based on Table 4, the items in third place and lower have a coeffi cient

of less than 0.1. This suggests that the characteristics for identifying a single-

person household are less obvious in the purchase price amount per item.

However, focusing on the items with high coeffi cients, ‘Drinking’ has the

largest coeffi cient in terms of the price amount for a one-person household,

followed by ‘Taxi fares’. This indicates that the relatively high unit prices for

services and foods aff ect their identifi cation.

On the other hand, ‘Pocket money’, ‘Fuel, light & water charges’

and ‘Food’ have large purchase price amount coeffi cients for two-or-more-

person household, while ‘Paper diapers’ and ‘Communication’ also have large


coeffi cients. This indicates that the items proportional to the number of people

and corresponding to the diff erent stage of lives aff ect identifi cation of multi-

person household. For example, the variable ‘Communication’ represents a

tendency for the number of contracts to increase as the number of household

members increase, since, communication charges are fi xed amounts and are

proportional to the number of contract lines.

The coeffi cients of the binomial model by one-person and two-or-more-person household with the purchase price amount as the explanatory

variableTable 4

corresponding item coefficient corresponding item coefficient

Drinking 0.15 Meat -0.91

Taxi fares 0.11 Pocket money (Unexplained expenditure) -0.88

Coffee beverages 0.08 Fuel, light & water charges -0.62

Other admission fees & game charges 0.07 Paper diapers -0.35

Women's nightwear 0.05 Gasoline -0.33

Salad 0.05 Food -0.33

Tea 0.04 Communication -0.32

Other refreshments(Cafe) 0.03 Eggs -0.30

Haircut charges 0.03 Oil, fats & seasonings -0.30

Contact lenses 0.02 Soybean products -0.26

the one-person household the two-or-more-persons household

According to Table 5, ‘Rents for dwelling & land’ and ‘Rents for

dwelling, issued houses’ have large coeffi cient in terms of the frequency of

items purchased by one-person households. This indicates the low rate of

house ownership among one-person households. ‘Coff ee & cocoa’, ‘Salad’

and ‘Beer’ have greater coeffi cients in the food category. It is nonessential

grocery items with high unit prices in Japan.

On the other hand, with respect to two-or-more-person households,

the items with large purchase price amount coeffi cients show similarly large

coeffi cients in the purchase frequency. The daily necessities and items relating

child care more aff ect the identifi cation.

These variables are only a part of 84 of the model with the purchase

price amount and 114 of the model with the frequency of purchased items.

It means that at least 84 items are required to obtain the above estimation

accuracy. Moreover, variables whose coeffi cients are estimated to be 0 by

Lasso regression are unstable. It is not appropriate just because these 84

variables will be collected. There is still a long way for practical use.


The coeffi cients of the binomial model by one-person and two-or-more-person household with the frequency of purchased items as the

explanatory variableTable 5

corresponding item coefficient corresponding item coefficient

Rents for dwelling & land 0.23 Pocket money (Unexplained expenditure) -1.83

Coffee & cocoa 0.22 Meat -1.46

Hospital charges 0.21 Food -0.84

Cut flowers 0.19 Education -0.82

Rents for dwelling, issued houses 0.17 Paper diapers -0.60

Salad 0.12 Eggs -0.46

Beer 0.11 Fish & shellfish -0.27

"Onigiri" & others(rice ball) 0.09 Medical care -0.26

Obligation fees related to dwelling 0.07 Furniture & household utensils -0.23

Taxi fares 0.07 Private transportation -0.21

the one-person household the two-or-more-persons household

4.3 Eff ect of age

In Table 3 and 4, the items ‘Taxi fares’, ‘Cut fl owers’, and ‘Hospital

charges’ of one-person households tend to be consumed more by elderly

people. This refl ects the experiential tendency of the FIES.

In fact, the over-65 category accounts for a large percentage of the

age class among one-person households in the FIES (Statistics Bureau, 2005-

2015). On the two-or-more-person households, middle age householders have

a large percentage of the age class. Table 6 shows the age class of householders

in the FIES. The proportion of elderly householders becomes larger as time

goes on.

Householder distribution by age class in FIES (data from ‘e-Stat’

provided by Statistics Bureau (2005-2015))

Table 6

year under 35 35-59 60 or more under 35 35-59 60 or more

2005 26% 29% 45% 9% 50% 41%

2010 21% 28% 51% 7% 47% 45%

2015 18% 27% 55% 6% 42% 52%

one-person household two-or-more-person household


One type of the big data that is planned to be provided by the

cooperate companies is the data of online personal fi nance software. There

is low utilization of online personal fi nance software among elderly people.

Therefore, there is particular need to adjust the age class in the case of matching

the FIES data to the big data.

From the above, it is possible to suggest that the age has the potential to

be as great as the household type in aff ecting specifi c purchased items. Finally,

we are going to describe below how the estimation accuracy of multinomial

models can be improved by the demographic items, which are infl uential for

the specifi c purchased items.

4.4 Improvement of prediction accuracy for the multinomial

model

The binomial model of one-person and four-or-more-person

household has acceptable accuracy, but the multinomial model does not. It

is diffi cult to estimate household size based on their purchased items since

the characteristics of one household must be analyzed as included in other

households. For example, as it stands, some of the items bought by one-person

households are also bought by two-or-more-person households. As a potential

solution to this problem, we propose using the demographic items that are

included in the big-data that have an antagonistic eff ect on the consumption

items. Namely, we attempt to improve a prediction accuracy by using not only

consumption items with a small degree of overlap among household sizes

but also other items that are infl uenced by the demographic items included in

the big-data, which are age and sex. Although age appeared in the previous

section to be one of the infl uential demographic items for specifi c purchased

items, in this section we discuss sex because it is distributed equally.

The equivalent for variable selection is to select particular purchased

items aff ected by demographic items. Therefore, the generalized linear

mixture model (glmm) was used to make the variable selection, with sex as the

response variable and the consumption items, which were loaded on a single-

person household in the multinomial model, as the explanatory variable. Here,

age is used as the random eff ect.

As a result of actually performing the variable selection with the

glmm using the R package ‘Ime4’, the items selecting by the binomial model

(one-person and two-or-more-person) with signifi cantly eff ecting by gender

were ‘Drinking’ and ‘Apples’. Moreover, the coeffi cients are antagonistic

by gender. If there are high purchase price amount of Drinking and Apples,

this indicates that there are multiple individuals who purchased antagonistic

products. These results may be useful if the probabilities of one-person and


two-person households are similar in the multinomial logistic Lasso regression

model that simply estimates the number of people per household.

5. CONCLUSIONS

The purpose of this paper was to estimate the household size, then to

indicate the consumption items that represent the household characteristics, in

order to impute the missing information of the provided big-data and integrate

to the source of the CTI Micro.

We analyzed the FIES micro-data using logistic Lasso regression

analysis. The estimation conducted using the multinomial model, which

distinguishes between one-, two-, three-, and four-or-more-persons, does

not have good prediction accuracy. In contrast, the binomial model that

distinguishes one and multiple-persons does have good accuracy. According

to the coeffi cients of the binomial model, one-person households tend to

consume high-unit-price nonessential grocery items and services, while

four-or-more-person households tend to consume foods and daily necessities

corresponding to the diff erent stage of lives.

Though it was diffi cult to survey one-person household expenditures,

the result implies that it is possible to obtain the one-person household

consumption data in the big-data of loyalty programs and online personal

fi nance software. Moreover, the items such as the sex and age included in the

big-data with an antagonistic eff ect on the consumption items could improve

poor prediction accuracy in the multinomial model.

However, variable selection by lasso regression is unstable. We should

investigate the detailed relationship between the variables and prediction errors

for the improvement of the stability in future work. We are considering using

machine learning methods such as decision tree for interaction terms and a

stability of variables. We should also consider carefully the semi-continuous

data which is an explanatory variable of sparse estimation. Despite few studies

having treated semi-continuous data as explanatory variables, these studies

are important because the consumption items data is almost semi-continuous.

Offi cial statistics agencies in Japan have summarized and combined

offi cial survey data into economic indicators, but they have done little

analysis of the data for modeling. This study is rare among them because

the FIES, which is often used as descriptive statistics so far, is analyzed for a

mathematical model in anticipation of application to the big data. Thus, the

CTI project is also meaningful as an attempt to develop offi cial statistics in

Japan. Since the FIES has a huge volume of data, and concerns surveying and

summarizing as its fi rst priority, it is diffi cult to identify consistent eff ects of


that data. However, the above analysis suggests the possibility of identifying

characteristics that are important to merge the big data and the FIES.

Acknowledgements:

We are grateful to members of the CTI project and our research

department for helpful discussions and thoughtful comments. The authors

wish to thank for editors and referees for their fruitful suggestions. The views

expressed here are those of the authors and not necessarily those of other

members of the institute.

References:

1. Blaudow, C., Burg, F., 2018, “Dynamic Pricing as a Challenge for Consumer

Price Statistics”, EUROSTAT REVIEW ON NATIONAL ACCOUNTS AND

MACROECONOMIC no. 1, 79-93.

2. Breton, R., Flower, T., Mayhew, M., Metcalfe, E., Milliken, N., Payne, C.,

... & Woods, A., 2016, “Research indices using web scraped data: May 2016

update”, Newport: Offi ce for National Statistics.

3. Friedman, J., Hastie, T., Tibshirani, R., 2010, “Regularization paths for generalized

linear models via coordinate descent”, Journal of statistical software, 33(1), 1.

4. Hamasuna, K., 1980, “Current Status of Statistical Survey”, Hosei university Japan

statistics research institute report, 05, 18-53. (Japanese only)

5. Japec, L., Kreuter, F., Berg, M., Biemer, P., Decker, P., Lampe, C., ... & Usher,

A. 2015, “Big data in survey research: AAPOR task force report”, Public Opinion

Quarterly, 79(4), 839-880.

6. [electronic sources] Statistics Bureau, 2005-2015, “Family Income and Expenditure

Survey”, one-person household annual data available from: https://www.e-stat.go.jp/

stat-search/fi les?page=1&layout=datalist&toukei=00200561&tstat=000000330001&

cycle=7&tclass1=000000330001&tclass2=000000330022&tclass3=000000330023

(Accessed 10.06.2019), two-or-more-person household annual data available from:

https://www.e-stat.go.jp/stat-search/fi les?page=1&layout=datalist&toukei=002005

61&tstat=000000330001&cycle=7&tclass1=000000330001&tclass2=0000003300

04&tclass3=000000330005&result_back=1, e-Stat. Statistics Bureau, Ministry of

internal aff airs and communications.

7. Statistics Bureau, Ministry of internal aff airs and communications, 2017, “Estab-

lishment of Consumption Trend Index Research Council”, available from: https://

www.stat.go.jp/data/cti/pdf/ho20170728.pdf (Accessed 10.06.2019), Statistics

Bureau, Ministry of internal aff airs and communications. (Japanese only)

8. [web page] Statistics Bureau, Ministry of internal aff airs and communications,

2018a, “Statistics for Japan’s Future’- A Quick Reference.”, available from: https://

www.stat.go.jp/english/info/guide/2018guide.html#p0201 (Accessed 10.06.2019),

Statistics Bureau, Ministry of internal aff airs and communications.

9. [web page] Statistics Bureau, Ministry of internal aff airs and communications,

2018b, “Telecommunications Annual Report 2018 – Part 1: Sustainable growth

by ICT in the era of population decline”, available from: http://www.soumu.go.jp/

johotsusintokei/whitepaper/ja/h30/html/nd141110.html (Accessed 10.08.2019),

Statistics Bureau, Ministry of internal aff airs and communications. (Japanese only)

10. Struijs, P., Braaksma, B., & Daas, P. J., 2014, “Offi cial statistics and big data”, Big

Data & Society, 1(1), 2053951714538417.


11. Takahashi, M., Ito, T., 2013, “Multiple imputation of missing values in economic

surveys: Comparison of competing algorithms”, In Proceedings of The 59th World

Statistics Congress of the International Statistical Institute (ISI). Hong Kong, China,

3240-3245.

12. The Consumer Statistics Division of Statistics Bureau, 2018, “The orientation

of developing Consumption trend index(CTI)”, the 8th consumption Research

Council, document No.2, available from: https://www.stat.go.jp/info/kenkyu/

skenkyu/pdf/20190122_02.pdf (Accessed 10.06.2019), The Consumer Statistics

Division of Statistics Bureau. (only Japanese)

13. Tibshirani, R., 1996, “Regression shrinkage and selection via the lasso”, Journal

of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.


R tools for ILOSTAT: Rilostat and SMARTM. Villarreal-Fuentesa ([email protected])


S. Dingb ([email protected])


ABSTRACT

This article presents Rilostat and SMART, two statistical tools developed by

the Department of Statistics of the International Labour Organization (ILO) to facilitate

the user interaction with ILOSTAT, the largest repository of labour-related indica-tors.

The package Rilostat allows data users around the world to access, extract and ma-

nipulate information from ILOSTAT. This document presents a description of the pack-

age, including detailed explanations of all its functionalities, examples of reproducible

data visualization and a Principal Component Analysis application car-ried out using

information extracted with Rilostat from the Sustainable Development Goals (SDGs)

collection available in the database. The Statistics Metadata-driven Analysis and Re-

porting Tool (SMART) allows National Statistical Offices world-wide to easily generate

and automate the production of analytical reports (such as national SDG reporting)

defined by means of an SDMX Data Structure Definition (DSD), either from process-

ing micro-level data or from aggregated data by means of transcoding. It is a hybrid

application that employs the .NET framework to build the user interface and R as

the computational and reporting engine. These two R-based tools for ILOSTAT take

advantage of all the benefits of the R software to give ILOSTAT data users simplified

access to what they need.

Keywords: Official Statistics, data dissemination, data visualization, analyti-

cal reporting automation, GUI programming.

JEL Classification: C81, C88

1 Introduction

The Department of Statistics of the International Labour Organization (ILO) is the focalpoint for labour statistics within the United Nations System and the primary reference forall statistics-related issues within the ILO. As such, it has three fundamental mandates: (i)providing relevant, timely and comparable statistics on as many labour market topics aspossible; (ii) developing international standards with a view to improving the measurementof labour issues and enhancing international comparability; (iii) supporting member States

1. The responsibility for opinions expressed in this article rests solely with its authors, and pub-

lication does not constitute an endorsement by the International Labour Office of the opinions

expressed in it.


possible; (ii) developing international standards with a view to improving the measurementof labour issues and enhancing international comparability; (iii) supporting member Statesin developing and improving their labour statistics via training programs, capacity buildingand technical assistance.

In order to achieve its goals, the ILO Department of Statistics produces a wide range ofindicators that are related to the world of work, and then disseminates them through ILO-STAT1, the largest and most comprehensive international repository of labour statisticsin the world. The ILOSTAT database provides a large set of country-specific indicators,covering numerous labour-related topics. It assembles in one place national figures of themain labour market topics such as employment, unemployment, working time and earn-ings, but also of additional labour-related subjects such as social protection and industrialrelations, proving to be instrumental to create a broader and more detailed picture of thelabour market situation.

ILOSTAT provides the public with annual, quarterly and monthly time series data, some ofwhich cover periods of over half a century2. It includes country-level, regional and globalestimates and projections of the main labour market indicators3 as well as ad-hoc datacollections of specific topics (e.g. international labour migration).

Occasional users or basic users looking for a specific piece of information can get imme-diate access to the data and related metadata via the table viewer, or downloading anExcel summary table. Indicators are presented in the home page of ILOSTAT groupedby “subjects”4. The table viewer shows data for the chosen indicator in a customizabletable where users can select what to display, in terms of reference areas (countries, regions,etc.), time periods, sex, classification categories (such as age bands for age disaggregation,economic sectors for disaggregation by economic activity, etc.) and sources. Tables canbe downloaded in different formats (Excel, CSV or SDMX). For regular users or moreadvanced users, especially those who wish to consult broader information (covering severalindicators, areas, etc.), ILOSTAT provides two tools which enable data extraction, han-dling and analysis: the ILOSTAT SDMX web service5, and the ILOSTAT bulk downloadfacility. Both retrieve the information in machine-readable format files that can then be

imported into the user’s preferred tool6.

Maintenance of ILOSTAT involves multiple stages, from data collection and productionthat populate the database, to dissemination and accessibility, each of which uses varioussoftware for data handling and analysis. R (R Core Team, 2019) plays a major role inevery part of this process. This paper aims at presenting two R tools developed by theILO Department of Statistics, and used to help improving users interactions with thedatabase: Rilostat, the ILOSTAT R package and SMART, the Statistics Metadata-drivenAnalysis and Reporting Tool.

Rilostat takes advantage of the bulk download facility to allow users to search, rearrange,analyse, visualize and download labour market data disseminated on ILOSTAT, benefitingalso from all the potential that the R software offers to the community. SMART is a

1Available at https://ilostat.ilo.org/.2For instance, ILOSTAT includes data from the Current Employment Statistics Survey (an Establish-

ment survey) of the United States from 1938 and from the USA Current Population Survey as from thefirst quarter of 1948.

3More information can be found on https://www.ilo.org/ilostat-files/Documents/LFEP.pdf.4For instance, the indicator on employment by sex and age is found under the subject employment.5More information is available at https://www.ilo.org/ilostat-files/Documents/SDMX_User_

Guide.pdf.6More information available at https://www.ilo.org/ilostat-files/Documents/ILOSTAT_

BulkDownload_Guidelines.pdf.


statistical processor and transcoding tool able to produce datasets by processing microdataor aggregate data in several formats, based on the structural metadata read from a SDMXDataflow or Data Structure Definitions (DSD), with the purpose of reporting or exchangingdata in SDMX.

This document is structured as follows: section 2 explains the main features of the Rilostatpackage in detail, including all its functionalities, and presents three examples of datavisualization. Section 3 uses data retrieved from the ILOSTAT database using Rilostat toperform a principal component analysis of the SDG labour market indicators collection.Section 4 provides an overview of SMART and demonstrates its main functionalities withtwo use cases. Section 5 concludes.

2 Rilostat - ILOSTAT’s R package

During the past few years, the statistical information collected by the ILO Department ofStatistics and disseminated through ILOSTAT has grown exponentially. The significantincrease in data available was mainly due to improvements to the data compilation, dataproduction and data dissemination processes. Efforts made by national statistical systemsto report data to the ILO in a timely and regular manner have gone hand-in-hand withthe ILO’s household survey microdata processing, which derives comparable indicatorsfollowing international standards and definitions to the extent possible7.

Casual users can access the required statistical information directly by identifying andselecting the corresponding tables from the ILOSTAT website. However, more frequentor advanced data users may wish to avoid having to select on-screen tables, especially ifthey are involved in research projects with a strong computational component, and wouldlike to easily replicate their actions8(Gandrud, 2013). ILOSTAT provides users with twoservices that allow the programmatic extraction of large sets of information: the SDMXweb service, and the bulk download facility.

In September 2017, ILOSTAT released the first version of Rilostat9, the package for Rwhich provides a new way of accessing the ILOSTAT database. Its source code is largelybased on the algorithm and documentation developed for accessing the Eurostat opendatabase, the eurostat R package10 (Lahti et al., 2017); it uses the existing architectureof the ILOSTAT bulk download facility and the related file structure to fetch individualdatasets or the complete ILOSTAT database.

The package is maintained by the ILO’s Department of Statistics11, and gives data userswith the knowledge of R the ability to access the ILOSTAT database, along with all itsbuilt-in functions to search for data, rearrange it and download it in the desired format,while benefiting from the vast amount of functionalities already available in R for dataformatting, visualization, analysis and results reporting.BulkDownload_Guidelines.pdf.

7In 2016, the ILO Department of Statistics started to systematically process labour-related householdsurveys (HS), mainly labour force surveys (LFS) micro datasets, in order to improve the quality andcoverage of data published on ILOSTAT. More information on this can be found on the “ILOSTATMicrodata Processing Quick Guide: Principles and methods underlying the ILO’s processing of anonymizedhousehold survey microdata”.

8Moreover, as pointed out by Gandrud (2013) and later by Lahti et al. (2017): “Availability of algo-rithmic tools [. . . ] can greatly benefit reproducible research, as complete analytical workflows spanningfrom raw data to final publications can be made fully replicable and transparent”.

9Available on CRAN (https://cran.r-project.org/web/packages/Rilostat/index.html).10http://ropengov.github.io/eurostat/.11Issues can be reported through: https://github.com/ilostat/Rilostat.


2.1 Main uses of Rilostat

The Rilostat package has numerous uses, namely:

• Providing access to ILOSTAT annual, quarterly and monthly time series via theILOSTAT bulk download facility ;

• Allowing to search and download ILOSTAT data and related metadata in the threeILO official languages: English, French and Spanish;

• Giving the ability to return POSIXct dates for an easier integration into plotting andtime series analysis packages available for R;

• Returning data in long format for better interaction with widely used packages asggplot2 and dplyr ;

• Providing access to the most recent updates of the ILOSTAT database;

• Allowing for the grep-style search of data descriptions and names;

• Providing access to the ILOSTAT catalogue and related descriptive metadata.

2.2 Getting started with Rilostat

2.2.1 Installation

The installation of the CRAN release version of Rilostat is done by executing the commonlines used for it12:

install.packages("Rilostat")

library(Rilostat)

The package works with an “imports” directive that loads its necessary packages13. Asstated in Rilostat’s reference manual, there are several other packages which could beuseful to have installed in order to handle, visualize and analyse data14. All the functionsthat are part of the package are listed as a data frame after running the following command:

as.data.frame(ls("package:Rilostat"))

12Similarly, the user can install the development version via Github: https://github.com/ilostat/

Rilostat.Rilostat.

13plyr, dplyr, stringr, readr, tibble, haven, xml, data.table, RCurl, DT.14A non-exhaustive list of suggested packages is: shiny, ploty, ggplot2, knitr, rmarkdown, roxygen2,

rsdmx, plotrix, Cairo, testthat, tidyr, devtools and covr.


2.2.2 Searching for data

Just like the bulk download facility it is built on, Rilostat gives access to ILOSTAT datasetsthrough two different directories, based on two different ways of presenting the information:organizing them by ’indicator’ (and frequency) or by ’ref area’ (and frequency). The ’in-dicator’ refers to the title of each specific table, including the represented variable and theeventual disaggregations used for it (for instance, ’labour force by sex and age’, ’employ-ment by sex and economic activity’ and ’unemployment rate by sex, age and rural/urbanareas’ are ILOSTAT indicators). The ’ref area’ (i.e. reference area) refers to the geographicareas for which data are available. Since ILOSTAT includes both country-level data andregional and global estimates, the reference area can either refer to countries, to regions(geographic regions such as Africa, Americas or Arab States, income groups such as low -income countries, or other groups such as the BRICS or the G20) or the world as a whole15.The frequency refers to whether the various data points are annual, quarterly or monthly.

Taking this into account, a first step to search for data is to get the code of the ’indicator’or ’ref area’ the user is looking for. The function get ilostat toc( ) provides grep stylesearching that returns all the data files available for consultation in the correspondingdirectory, and provides summary information on each data file matching the query. The

following line gives access to the table of contents of all available indicators in ILOSTATby indicator (default):

toc ind <- get ilostat toc()

The arguments available for this function allow the user to set the segment required (’indi-cator’ (default) or ’ref area’), preferred language among the three ILO’s official languages:English (’en’; default), French (’fr’) and Spanish (’es’), the pattern within the descriptionto be searched (’none’ by default) and the filters to the variables (’none’ by default) inorder to get parts of the table. For instance, a narrower search would be to look for (1)all available indicators containing the word ’unemployment’, or (2) to get the label of thereference area by frequency for all available datasets in two countries:

(1) toc une <- get ilostat toc(search = ’Unemployment’)

(2) toc cou <- get ilostat toc(segment = ’ref area’, search =

c(’Philippines|Thailand’), fixed = FALSE)

The codes or identifiers used in the table of contents for the indicators and reference areasin the first column (’id’) are unique and allow for the unequivocal identification of thecorresponding item to be consulted. For reference, note that code names all follow thesame structure. The indicator code names include, in this order, the code of the topic, therepresented variable, the disaggregations included (’NOC’ for ’no classification’ if there isno disaggregation), the unit (’NB’ for absolute values or numbers and ’RT’ for percentagesor rates) and the frequency (’A’ for annual data, ’Q’ for quarterly data and ’M’ for monthlydata). Similarly, the code names of the files by reference area refer to the country (ISOAlpha-3 country code) or the region (codes starting with X) and the frequency (’A’, ’Q’and ’M’).

15It is important to note that global and regional estimates are only available for some indicators, andthus most datasets would only include country-level data.


2.2.3 Downloading data

The function get ilostat( ) explores ILOSTAT and returns single or multiple datasetsby indicator (default, segment = ’indicator’) or by reference area (segment = ’ref area’),using the code obtained at the identification step. The following code lines return: (1)unemployment rate by sex and age (%), annual; and (2) all available annual data pointsfor Afghanistan and Trinidad and Tobago:

(1) dat une <- get ilostat(id = ’UNE DEAP SEX AGE RT A’, segment = ’indicator’)

(2) dat att <- get ilostat(id = c(’AFG A’, ’TTO A’), segment = ’ref area’)

In addition to the arguments available for the function to search for data, the user can alsofind within this function options to set the type of the variables (’code’ (default), ’label’or ’both’) that allows for getting codes and/or human-readable labels, the format in which

the time column is to be returned (’raw’ (default), ’date’, ’date last’ and ’num’), filtersthat can be applied to the dataset (explained more in detailed in the following section),and the option to do caching (TRUE by default), whether the cache generated is to beupdated (FALSE by default) and the desired format of the file to be stored as cache (’rds’(default), ’csv’, ’dta’, ’sav’, ’sas7ndat’)(see Section 7 for more information).

Since datasets are downloaded using the ILOSTAT bulk download facility, the structure ofthe tibble obtained with this function mirrors the structure of the CSV file which wouldhave been extracted using the bulk download. That is, the subsequent rows after the headernames, present the data records, consisting of the key of the record (the ’names’ of thedimensions used to identify each record, including the data collection, the reference area,the source of the data, the classifications used, etc., referring to all fields from ’collection’to ’time’), the observation value (’obs value’) and any other metadata available (such asthe geographical coverage of the source or the specific definitions used for some concepts,referring to all fields from ’obs status’ to ’note source’).

2.2.4 Filtering data

The option ’filters’ is available as an argument within the function get ilostat( ). Itoffers the possibility of retrieving a subset of the dataset called, by using a list of its objects(’none’ by default). The names of this list are the variables codes (code names used asheaders of the dataset retrieved), and the values are vectors of predefined disaggregations.The user can access an extensive list of these disaggregations, known as dictionary files,through the function get ilostat dic( ). For instance, it is possible to obtain the annualunemployment rate (%) for women in Colombia, by executing the following code:

dat une col <- get ilostat(id = ’UNE DEAP SEX AGE RT A’, segment = ’indicator’,

filters = list(ref area = ’COL’, sex = ’SEX F’))


2.3 Data Visualization with data extracted using Rilostat

Taking advantage of all the potential that R offers to the community, the user can handlethe information extracted directly in R and use the available functions and packages fordata visualization. Some of the most widely used packages for data handling and visualiza-tion are already loaded as an “import” directive when installing Rilostat. Other packagescan be installed for more advanced plotting manipulation.

Figures 1, 2 and 3 show three different visualization examples using information fetchedfrom the ILOSTAT database and packages viridis (Garnier et al., 2018), scales (Wickham,2018), plotly (Sievert et al., 2019) and ggridges (Wilke, 2018) among others. For theseexamples, the data used is taken from the compilation “ILO modelled estimates, Novem-ber 2018”, which is methodologically robust and consistent across countries and thereforeensures international comparability. It also includes regional and global aggregates. TheR code to produce them can be found in section 7.

Figure 1: Evolution of the global employment distribution by occupation, 1991-2023 (ILOmodelled estimates, November 2018)

Figure 2: Share of youth not in employment, education or training (NEET), 2017 (ILOmodelled estimates, November 2018)


Figure 3: Distribution of the labour force participation rate by sex, 1990, 2000, 2010, 2018and 2030 (ILO modelled estimates, July 2018)

3 An application: A principal component analysis of

the SDG indicators

3.1 The Sustainable Development Goals

In January 2016, the international community adopted a set of 17 Sustainable Develop-ment Goals and 169 targets meant to take on the unfinished aspects of the MillenniumDevelopment Goals (MDGs) agenda and the new global challenges. They cover three keyelements: economic growth, social inclusion and environmental protection.16

Multiple institutions at the national and international level monitor the development ofthe SDGs across all regions of the world and over multiple follow-up stages. Moreover, theachievement of all goals, set to be accomplished by 2030, is meant to ensure at the sametime their sustainability in the long run. As stated in ILO (2018), the global goals “promoteprosperity while protecting the planet, putting forward the idea that ending poverty mustbe aligned with strategies for economic growth and addressing at the same time social needsand environmental concerns”. Thus, a continuous analysis of each indicator individuallyas well as of the set of indicators and their interactions constitutes an essential part of thereviewing process.

16More information can be found at https://www.ilo.org/wcmsp5/groups/public/---dgreports/

---stat/documents/publication/wcms_647109.pdf.


3.2 Principal component analysis (PCA)

Factorial analysis, or multivariate data analysis, are descriptive and exploratory statisticalmethods commonly used to summarize large sets of data and to produce a simpler pictureof their structure. Its main objective is to describe the relationships between variables(dimensions), in terms of a potentially lower number of unobserved variables (factors).

Principal Component Analysis examines the linear relationship between quantitative vari-ables that are correlated by creating uncorrelated synthetic factors (i.e. principal compo-nents) that belong to a lower dimensional space, and therefore allows for a more directinterpretation. These components (or factors) are linear combinations of initial variablesthat retain most of their variance, guaranteeing a proper representation of the individuals’interactions and the existent heterogeneity between them (Lebart et al., 1995; Escofier andPages, 2008).

The interpretation of the PCA results is mainly based on two measures: 1) the quality ofthe representation that can be achieved when reducing dimensions, and 2) the distance, interms of the created synthetic factors, of a pair of individuals.

3.3 Data

The SDG data collection available on ILOSTAT contains a set of SDG labour marketindicators for which the ILO is either the custodian agency or one of the partner agen-cies responsible for reporting at the global level. The information retrieved using Rilostatconsists of 11 quantitative indicators17: (SDG 0111) working poverty rate; (SDG 0131)proportion of population covered by social protection floors/systems; (SDG 0552) femaleshare of employment in managerial positions; (SDG 0821) annual growth rate of outputper worker; (SDG 0831) proportion of informal employment in non-agricultural employ-ment; (SDG 0852) unemployment rate; (SDG 0861) proportion of youth (aged 15-24 years)not in education, employment or training; (SDG N881) non-fatal occupational injuries per100’000 workers; (SDG F881) fatal occupational injuries per 100’000 workers; (SDG 0922)manufacturing employment as a proportion of total employment; (SDG 1041) labour in-come share as a percent of GDP. A detailed description of the dataset set can be found inthe appendix (7).

The availability of information of each of the SDG indicators can vary from one referencearea to another because of multiple reasons18. Given that the PCA needs a completeinput dataset, in what follows we will (1) keep only reference areas with 70% or more ofthe indicators for the analysis; that is, 54 of the 183 reference areas. And (2) treat theremaining missing values following the method proposed in Josse and Husson (2012) andavailable in the package Husson and Josse (2019).

3.4 PCA Results

As previously mentioned, the PCA gives us useful information on the differences in the un-derlying structure of a dataset. However, it must be emphasized that this type of analysisdoes not aim at making any statistical inference, but rather at carrying out a multidimen-sional exploratory analysis without distributional assumptions upon the variables underanalysis.

17Due to the amount of missing values in indicator 8.5.1 (Average hourly earnings of female and maleemployees) and indicator 8.7.1 (Proportion of children engaged in economic activity and household chores(%)), these indicators are not part of the analysis.

18For instance, the lack of sources of information to collect, process and estimate the indicators.


Nine out of the 11 indicators enter the analysis as active variables, whereas SDG 1.1.1-working poverty rate- and 8.3.1 -proportion of informal employment in non-agriculturalemployment-, are set as supplementary variables, i.e. they are not part of the syntheticfactors building process19. The analysis uses the packages FactoMineR (Husson et al.,2018) and factoextra (Kassambara and Mundt, 2017) for the extraction of the resultsand visualization respectively. Three factors are kept for analysis (those whose relatedeigenvalue is greater that the unity20, summarizing 66.1% of the total heterogeneity.

Figure 4 presents the first factor map (dimensions one and two) with the projections ofthe SDG indicators and the reference areas simultaneously. The indicators associated toeach factor are those that contributed the most to their construction. The quality of

the representation of reference areas onto the map is established by their cosine squarerepresented by the colour of their label. Thus, red, pink and purple reference areas areconsidered for interpretation.

The first principal axis explains 35.7% of the total variance and is associated with the pro-portion of youth (aged 15-24 years) not in education, employment or training (SDG 0861),the number of fatal occupational injuries (SDG F881), the proportion of population cov-ered by SP floors/systems (SDG 0131) and labour income share as a percent of GDP(SDG 1041). The second factor is characterized by the annual growth rate of output perworker (SDG 0821) and unemployment rate (SDG 0852), and explains 17.7% of the totalvariance. Finally, the third dimension accounts for 12.7% and has high contributions ofthe youth in NEET (SDG 0861) and non-fatal occupational injuries (SDG N881).

Figure 4: Factor map with variables and individuals projections.

For instance, the lack of sources of information to collect, process and estimate the indicators.19These variables are left as supplementary given that most of the European countries do not report

information on them.20The total variance (inertia) is given by the sum of all eigenvalues related to the covariance matrix

(Lebart et al. (1995)).


Projections of reference areas onto the first factor map help understanding the relationshipbetween a pair of points (reference areas) and between each point and the components built.Reference areas projected along the first dimension on the positive side, e.g. Armenia,Bolivia, Colombia and South Africa, are identified with a high proportion of youth notin education, employment or training and a high level of occupational injuries. On thecontrary, reference areas projected onto the negative side, e.g. Austria, Belgium, Estonia,

Germany, Finland, Netherlands, Norway, Sweden and Switzerland, are related with a highproportion of the population covered by social protection floors/systems and high levels oflabour income share as a percent of their GDP. Greece and Spain are projected onto thepositive side of the second axis associated to high unemployment levels, which opposes thePhilippines, projected with respect to positive annual growth rate of output per worker.

These results show evidence of the variability between the set of SDG labour indicators inmultiple reference areas. A broader multivariate description can be achieved by includinginformation on more reference areas and more SDG indicators as data become available.Some of the results, e.g. the new coordinates of the variables and observations obtainedafter reducing dimensions, could also be used as the initial step in a further statisticalanalysis of labour market indicators at the global level.

4 SMART: Statistics Metadata-driven Analysis & Re-

porting Tool

In order to strengthen the capacity of countries to report labour statistics to ILOSTAT, theILO Department of Statistics is developing a toolkit that facilitates the table producingprocedures. SMART21 receives as input a micro dataset from a LFS (or an aggregateddataset) and the specification of the tables to be produced by means of DSD. The outputfiles can be generated in diverse formats and used for analysis, data reporting or to feeda dissemination platform. In particular, it is a useful tool to produce SDMX datasets forreporting SDG (or any other) data, in the absence of a proper reporting platform able toproduce SDMX datasets.

As shown in the SMART Concept Map (Figure 5), there are three major modules in orderto perform an analysis:

• Data and DSD inputs

SMART has two main inputs: a dataset with the source information (in Stata, SPSS,CSV or SDMX-ML format) and an XML file/message containing SDMX-ML DSDwith one or multiple data structures to define the cross tabulations to be generated.This DSD can be a local file or a message queried from an SDMX registry online.

In processing the input data, SMART can count cases, summarize, compute meansand filter records based on complex conditions. However, it is not advisable toattempt to follow complex questionnaire sequences in the calculation of the indicators,but rather pre-process the micro data to compute and add derived variables usingmore powerful statistical packages such as R, Stata, SPSS or SAS. These variablescan then be used in the production of the output cross tabulations using SMART.

21Available at https://ilostat.github.io/smart/.


Figure 5: SMART Concept Map

• Mapping

The process mapping links the concepts in the DSD with the variable from theinput data. Usually a DSD defines three major roles for concepts, namely Dimension,Primary Measure and Attribute. The Primary Measure and all the Dimensions mustbe mapped with the input variables or assigned a constant value, while the mappingfor Attribute is most of the times optional as it refers to the descriptive metadata(notes). However, attributes defined as mandatory must be mapped.

For the categorical variables in the input data whose codes differ from the classifi-cation items in the DSDs, a mapping for each category must be created. This willallow, for example, to process a dataset which codes the variable gender as Male=1and Female=2. Suppose in the DSD this variable is named SEX and uses labels”M” and ”F”, it is necessary to generate a mapping that assigns variable genderto SEX and items 1 and 2 to M and F, respectively. Some classification items fromthe DSD can be left un-mapped, in which case they will not be included in the tab-ulation. Similarly, some categories in the dataset can be left unmapped and theserecords won’t be counted in the tables that include such a classification.


The attributes in the DSD will be presented to the user with a list of valid optionsto select. If the DSD includes one or more free text attributes for open text notes,they can be added at this stage.

• Generate

When all the data has been entered and the variables in the dataset are properlymapped to the concepts in the DSD (Dimensions, Primary Measure and Attributesif any), the user is requested to select the format(s) for the output report. The avail-able options are: .csv for ILOSTAT, .Stat v7 data and dimension ”pipe-separated”files, SDMX data messages (in SDMX-ML, SDMX-JSON or SDMX-CSV) for SDGreporting and Excel.

4.1 Showcases

4.1.1 SDG Reporting

To track the progress made on all SDGs, the UN has urged all national governmentsto report annually on SDG indicators. Furthermore, the reporting platforms developed atthe national level should support international standards and common formats to facilitatedata exchange both within and between countries. This includes using SDMX, a globaldata exchange initiative to process and report on the SDGs.

Compiling and reporting SDG data in SDMXmessages (XML or Json) requires the nationaldissemination platform to be able to handle SDMX artefacts. If such a reporting platformis not yet in place, producing SDG data in SDMX becomes rather challenging. SMART ismeant to facilitate this task. As an example, we demonstrate that SDG indicators preparedin Excel can be easily converted to SDMX by using the SMART transcoding feature. Thisexample is embedded in SMART and can be loaded via Project → Open ExampleProject.

The project folder contains the input data file 1.3.1 NINE INDICATOR.csv122 (aswell as its original Excel 1.3.1 NINE INDICATORS.xlsx), the desired output tablestructures in DSD SDG DSD(0.3).xml, the output.zip and the project file Exam-ple SDG Reporting.smart.

Using this example, suppose the reporting agency has properly aggregated from its micro-data source the data contained in 1.3.1 NINE INDICATOR.csv and needs to convertthe Excel file into SDMX in order to report it. To do so, the agency needs to undertakethe following tasks:

• Need a DSD or dataflow for SDGs. Since 2016, the Inter-agency and Expert Groupon SDGs (IAEG-SDGs) has started to develop the SDMX solution for SDG Indi-cator data and metadata exchange and dissemination. The pilot SDG DataflowDF UNDATA SDG PILOT developed by IAEG-SDGs is available to use.

• Allocate variables in the input data into Dimensions, Attributes and Primary Measurein DSD. This assignment task can be handled inside SMART Mapping module.

• Recode the mismatched items between the input CSV and the DSD definition. Thiscan be handled in the Mapping module as well.

• Write in SDMX-ML or SDMX-JSON. Various SDMX outputs can be generated inthe Generate module.

22Proportion of population covered by social protection floors/systems, by sex, distinguishing children,unemployed persons, older persons, persons with disabilities, pregnant women, newborns, work-injuryvictims and the poor and the vulnerable.


These tasks translate into the following steps23:

1. Upload the input data 1.3.1 NINE INDICATOR.csv by clicking the buttonAdd. . . or just drag-and-drop in the Datasets area.

2. In Data Structures, click Online Query and select SDMX UNSD in the dropdownmenu of SDMX Registry. Choose the dataflow DF UNDATA SDG PILOT andthen click the Load button.

3. Once the both input data and the output data structures have been added, click thenext button to advance, which brings you to the Mapping module.

4. Map the input variables to different SDMX concepts, i.e., Dimensions, Primary Mea-sure and Attributes. The detailed mapping guidelines can be found on the SMARTReference page. Generally in this exercise, the users are required to have a goodknowledge about the input data. For example, the variable ”Observation.Value”should be recognized as the primary measure therefore it is assigned to Measurespanel. And in the Dimensions panel, if any dimension could not be found in theinput data, i.e., FREQ. Based on the characteristic of the data, the users understandthat this is an annual aggregated data and thus it needs to be set with a constantvalue ”Annual (A)”.

5. Go to the Generate module and click Process. The results are presented in theTable Viewer according to the DSD definition. To export them in SDMX, the usercan specify the output directory and the desired SDMX formats from the optionsand then click Export button. With a few seconds, the user can see the SDMXoutputs in the Export Folder (or click Open Export Folder).

4.1.2 Microdata Processing

Besides converting the aggregated data, SMART is also able to handle microdata24 di-rectly. In particular, it can count cases, summarize, compute means and filter recordsbased on complex conditions. For the ease of reference, the embedded project “ProcessMicrodata: Unemployment” (loaded via Project → Open Example Project) is usedhere for demonstration.

The project folder contains the input microdata file Miranda Eng.sav (in SPSS format,a derived dataset from a household survey), two input DSDs YI X01 UNE TUNESEX AGE NB.xml and YI X01 UNE TUNE SEX AGE NB.xml (can be down-loaded from the ILOSTAT SDMX web portal), the output.zip and the project file Exam-ple UNE.smart. The objective of this project is to calculate and report the unemploy-ment level by two types of breakdowns: by sex and age, and by sex, age and rural/urbanareas. The following steps explain how to use this microdata to work with the SMART.

1. Add the data and the two DSDs in the “Datasets and table structures” module.Notice that these two tables (DSDs) are reported jointly because they share somecommon concepts in dimensions and attributes, such as CLASSIF SEX and CLAS-SIF AGE. The mapping of these common concepts is only required once.

2. Press the Next button to advance to the Mapping module.

23Notice that the Open Example Project automatically prepares the relevant inputs in Data and

DSD inputs and Mapping, so that you may directly advance to the Generate module to process.

24SMART doesn’t provide any data processes for cleaning, validation and editing, the microdata ithandles should be ready for the aggregation analysis.


3. Map the input data variables and items to the SDMX output concepts, namelyPrimary Measures, Dimensions and Attributes.

4. Press the Next button to the Generate module and press the Process button togenerate the tables.

A few remarks on the mapping procedure:

• The data in this exercise doesn’t have any real-world interpretation and it only servesthe purpose of demonstration.

• Primary Measure isn’t mapped to any variable but rather based on counting cases(tallying) under a filtering condition. That is, the individuals whose main activityduring the last year (MAJACTYR) is either “Looked for work” (3) or “Wanted workand available” (4) are considered as unemployed thus are counted as the measure ofunemployment.

• The example data doesn’t contain sample weights, but in practice it is mandatory toinclude to allow data aggregation from the micro level. To map the sample weightsin SMART, go to the Others tab and assign the corresponding input variable. Theselection for sample weights is only enabled when the data aggregation is needed.

• Quality measures based on the number of observations are also considered in SMART.By default, if the number of observations is less than 5, the calculated value is markedas “Not Available”, furthermore if the number of observations is less between 5 and15 the value is marked as “Unreliable”. The default criteria can be altered in theOthers tab.

• The observation status criteria and the repetitive runs feature allow the users toperform multiple procedures to determine the best level of breakdown. For example,the mapping of CLASSIF AGE to 10YRBANDS (ten-year age bands) would result inmany domains with observations “Not Available” or “Unreliable”. To improve thiswith enough observations possibly allocated in many age groups, we could report thestatistics based on a wider age breakdown, such as YTHADULT (youth and adultgroup) or even just suppress it by a total. The decision of the breakdown level canbe assessed using these observation criteria.

10YRBANDS YTHADULT No Breakdown

15-24 15-24

Total

25-34

25+35-4445-5455-6465+

Table 1: Breakdown assessment on CLASSIF AGE

• Rate/ratio calculation is also supported in SMART. In the mapping of Primary Mea-sure, press the button Specify Denominator (+/-) to specify the quotients be-tween variables. An example project “Process Microdata: Labor Force ParticipationRate” can be found via Project → Open Example Project.


• Once the mapping is set, the users can preview the output layout in the TableStructure Preview (Press Table Viewer button in the section tool bar). As anexample, a screenshot of the structure preview for table YI X01 UNE TUNE SEXAGE GEO NB is captured in Figure 6.

Figure 6: Screenshot of the structure preview

4.2 Highlight features

Beyond the key functionalities that SMART provides, some other features and tools arealso worth highlighting as they make it more user friendly and sustainable:

• Online data and metadata Query: The input data and DSD (or dataflow) canbe pulled directly from an SDMX API into SMART without having to download itas a local file first.

• Command line utility: SMARTcmd.exe, a command line version of SMARTintended to be used for batch process automation. SMARTcmd reads a projectfile (.smart) previously saved from the normal GUI-based version and executes eitherthe aggregations or the transformations. Besides the project file, it is possible tospecify several parameters in the command line which will supersede the value inthe project file for this run. For example, the input and output file names canbe changed for repeated transformations of the same type of files; or by using theparameter ”-append”, to create a single output file in several runs with differentinput files of the same type (i.e. different quarterly data of the same household


survey). To use SMARTcmd just store it in any folder by clicking on Tools →Send SMARTcmd.exe to. . .

• Reusable mapping: The mapping can be saved in the CSV format for furtherre-use. This is a useful feature as mappings can be partially uploaded from differentsaved ones. SMART looks at the names of concepts and variables and checks if theymatch the mappings that are uploaded. If a concept already mapped in memory isfound in the mapping file being uploaded, the whole mapping for this concept willbe updated.

• Repetitive Runs: SMART can take as many runs as possible (until it reaches thememory limit) for the calculation process. For example if there are multiple inputdata files going to report on the same DSD, we can run one data file at a time andthen go backwards to process another. In the end, the generated results from thesedata files can be reported jointly in a single SDMX data message.

• DSD Constructor: it is a SMART companion tool and a standalone applicationwhich is able to create and edit DSDs and their components (i.e. dimensions, at-tributes, measures and code lists) in order to generate the DSD which fits a givenuser’s needs. The DSDs can then be used by SMART to obtain required the outputdataset. It can grab concepts and associated code lists from any SDMX registry andalso allows users to create and load them from scratch. Or it might edit an existingDSD and save it with a different id after making some changes.

4.3 Architecture Design

This section is intended for application developers, in particular for those who are interestedin developing GUI applications with R.

From the design perspective, SMART is a standalone desktop application using a GUI in.NET on top of the R statistical processor. That is, on the front-end C#.NET is usedto build its User Interface and on the back-end R processor is used to serve as its com-putational and reporting engine. Compared to the pure R based solutions, this hybriddesign makes SMART available to a wider audience, specifically users with less program-ming skills in R and GUI-dependent. Furthermore, it allows SMART to benefit from thepowerful features and libraries from both languages. In the R engine, SMART employsthe package foreign to interpret input datasets in SPSS, Stata and SAS format and thepackage data.table to achieve fast data manipulation. And in the .NET framework, ituses the standard NuGet package SdmxSource (Eurostat, 2018) to read and write anySDMX artefacts.

Of course the key of this hybrid design is the bridging, to be able to access the R runtimefrom .NET. We use the .NET interoperability library R.NET (Perraud and Abe, 2017)to achieve fast data exchange. The connection from .NET to R using R.NET is fairlystraightforward (for the detailed configurations, the tutorial post by Perraud (2015) canbe followed). In the initialization of the C# code, a single REngine object instance isretrieved which will seek the R home path based on the system environmental variables.If a valid version of R (32bit, < 3.4.1) has been found locally, REngine will trigger ahidden R console which can send and receive data in-between. To interact with R, onlythe method Evaluate of the REngine instance is needed. For example, the following C#code defines x = 15 in the R runtime environment:

REngineengine = REngine.GetInstance();engine.Evaluate(′′x← 15′′);


←

The dependency on R on the other hand brings extra complexity to the application de-ployment. In order to run SMART, a proper version of R installed in advance becomesa prerequisite. Not only that, all the employed R packages (i.e., foreign and data.table)must be installed as well. This becomes quite cumbersome especially for users who barelyknow R. To resolve this, we build up a portable R together with the necessary packages,and then embed it inside the installation package of SMART. In this way, the per-buildportable R will travel together with its deployment. As by default the REngine searchesthe R path from the system environmental variables, the portable R however cannot set Rpath in the system variables. Therefore, we need to redirect the R path manually to thePortable R folder (i.e., the C# code as follows).

REngine.SetEnvironmentV ariables(;rPath : rpathPortable+@′′\bin\i386′′, rHome : rpathPortable);

5 Conclusions

This document provides a broad description of the package Rilostat for searching, rear-ranging, analyzing and downloading labour market data from the ILOSTAT database. Itshows how R users can take advantage of all the functionalities that this software offersto the community to access labour-related information, by explaining all the built-in func-tions of the package, giving examples of data visualization with actual information andanalyzing a set of indicators from the SDG collection.

This paper also presents the functionality of SMART, the Statistics Metadata-driven Anal-ysis and Reporting Tool, a hybrid application of R (as the processor) and .NET (as theGraphic User Interphase (GUI)), that can perform either aggregations or transformationof data or metadata from Stata, SPSS, CSV or SDMX-ML formats, to generate reports inExcel, CSV or SDMX formats.

6 Acknowledgments

The authors are grateful to Rafael Diez de Medina for his encouragement to pursue thiswork. Special thanks go to David Bescond, the author and maintainer of the packageRilostat, Rosina Gammarano, Edgardo Greising, Steven Kapsos and Yves Perardel fortheir support and valuable comments.

7 Appendix

7.1 More about the time format option while getting data usingRilostat

The function get ilostat( ) will return, by default, the variable ’time’ in a raw timeformat. This is, a vector of characters with the following syntax:

• Yearly data: ’YYYY’, where YYYY is the year

• Quarterly data: ’YYYqQ’ where YYYY is the year and Q is the quarter (taking thevalue of the corresponding quarter between 1 and 4).

• Monthly data: ’YYYmMM’ where YYYY is the year and MM is the month (takingthe value of the corresponding month between 01 t0 12).


However, users can find that this format is not the appropriate one when there is need forinteraction with other functions or packages available in R, especially with those createdto perform data visualization or time-series analysis. For this reason, the function includesthe option to change the format of the variable ’time’ in order to return POSIXct dates(time format=’date’; e.g. 2017M12 equals 2019-01-01) or numeric dates (time format =’num’; e.g. 2017Q2 equals 2017.25).

7.2 And, more of the options available for caching data whenusing Rilostat

The function get ilostat( ) stores cached data by default in ’rds’ binary format infile.path(tempdir(), ‘‘ilostat’’), so the information fetching process is faster. Thereexists the possibility to choose the working directory where the data will be saved, as well asthe desired format of the file, by changing the default options in the arguments cache dir

and cache format, respectively. The name of the stored file is the concatenation of: the’segment’ used to consult the database (either by ’indicator’ or ’ref area’), the ’id’ of thetable extracted, the type of the variables contained, the time format, and the date of thelatest version of the dataset (taken from latest version of the table of contents used). Fi-nally, the option of quietly getting the information is also available by setting the argumentback=FALSE.

7.3 R codes for graphs in section 2

• Figure 1: Evolution of the global employment distribution by occupation, 1991-2023(ILOmodelled estimates, November 2018)

l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( v i r i d i s )l ibrary ( hrbrthemes )l ibrary ( s c a l e s )l ibrary ( s t r i n g r )

# −− Re l a t i v e d i s t r i b u t i o n

dat emp1 <− get i l o s t a t ( id = ’EMP 2EMP SEX OCU DT A’ ,segment = ’ i n d i c a t o r ’ ,type = ”both” ,time format = ”num” ,f i l t e r s = l i s t ( r e f area = ’X01 ’ ,

sex = ’SEX T ’ ) ) %>%f i l t e r ( c l a s s i f 1 != ’OCU DETAILS TOTAL’ ) %>%mutate ( d i s t r i b u t i o n = obs va lue/100) %>%s e l e c t (time , c l a s s i f 1 , d i s t r i b u t i o n )


# Plot (With the r e l a t i v e d i s t r i b u t i o n )

dat emp1 %>%ggplot ( aes ( x=time ,

y=d i s t r i b u t i o n ,f i l l =c l a s s i f 1 ,c o l o r=c l a s s i f 1 ,text=c l a s s i f 1 ) ) +

geom area ( ) +scale f i l l v i r i d i s ( d i s c r e t e = TRUE) +scale c o l o r v i r i d i s ( d i s c r e t e = TRUE) +labs ( x=”” , y=”” ) +scale y cont inuous ( breaks = pretty breaks (n = 10) ,labels=percent , expand = c ( 0 . 0 1 , 0 . 0 1 ) ) +scale x cont inuous ( breaks = seq (1991 , 2023 , 2 ) ,l im = c (1991 , 2023) , expand = c ( 0 . 0 1 , 0 . 0 1 ) ) +theme ipsum ( ) +theme ( axis . text . x=element text ( s i z e =8) ,

axis . text . y=element text ( s i z e =8) ,legend . p o s i t i o n=”none” ) +

annotate ( ” t ex t ” ,x=1992 ,y= c ( 0 . 9 89 , 0 . 95 , 0 . 89 , 0 . 83 , 0 . 75 , 0 . 40 , 0 . 12 , 0 . 0 4 ) ,l a b e l =c ( ”Managers” ,

” P r o f e s s i o n a l s ” ,”Technic ians and a s s o c i a t e p r o f e s s i o n a l s ” ,” C l e r i c a l support workers ” ,

” Se rv i c e and s a l e s workers ” ,”Craft and r e l a t e d t rade s workers ” ,”Plant and machine operators , and assemble r s ” ,”Elementary occupat ions and s k i l l s a g r i c u l t u r a l ,

f o r e s t r y and f i s h e r y workers ” ) ,h ju s t = 0 , s i z e=I ( 3 ) ) +

annotate ( ” r e c t ” , xmin = 2018 , xmax = 2023 , ymin = 0 , ymax = 1 ,alpha = 0 . 3 , f i l l = ”gray” ) +annotate ( ” t ex t ” , l a b e l = ” Pro j e c t i on s ” , x=2019 , y=0.9 , v ju s t =1,h ju s t =0, s i z e=I ( 4 ) ) +geom v l i n e ( x i n t e r c ep t = 2018 , co l ou r = ” red ” ) +labs ( capt ion = ”Source : i l o s t a t ” ) +theme (plot . t i t l e = element text ( s i z e =12, f a c e=”bold . i t a l i c ” ) )

• Figure 2: Share of youth not in employment, education or training (NEET), 2017(ILO modelled estimates, November 2018)

l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( p l o t l y )


X <− get i l o s t a t ( id = ’EIP 2EET SEX RT A’ , segment = ’ i n d i c a t o r ’ ,f i l t e r s= l i s t ( time = ’ 2018 ’ , sex=’T ’ ) ) %>%f i l t e r ( s t r sub ( r e f area , 1 , 1 ) != ’X ’ ) %>%s e l e c t ( r e f area , obs va lue ) %>%l e f t j o i n ( R i l o s t a t : : : i l o s t a t r e f area mapping %>%

s e l e c t ( r e f area , r e f area p l o t l y ) %>%l ab e l i l o s t a t ( code = ’ r e f area ’ ) ,

by = ” r e f area ” ) %>%f i l t e r ( ! obs va lue %in% NA) %>%mutate ( to t obs va lue = cut ( obs value ,quantile ( obs value , na .rm = TRUE) , i n c lude . l owest = TRUE) )

X %>% plot geo ( width = 900 , he ight = 600) %>%add trace (

z = ˜obs value ,c o l o r = ˜obs value ,colors=c ( ” green ” , ” blue ” ) ,text = ˜ r e f area . l abe l ,l o c a t i o n s = ˜ r e f area p l o t l y ,marker = l i s t ( l i n e = l i s t ( c o l o r = toRGB(” grey ” ) ,

width = 0 . 5 ) ) ,showsca le = TRUE) %>%

co l o rba r ( t i t l e = ’ (%) ’ , l en = 0 . 5 , t i c k s u f f i x=”%” ) %>%

layout (f ont = l i s t ( s i z e = 18) ) ,f ont = ( s i z e = 1) ,geo = l i s t ( showframe = TRUE,

showcoa s t l i n e s = TRUE,p r o j e c t i o n = l i s t ( type = ’Mercator ’ ) ,showCountries = TRUE,r e s o l u t i o n = 110) ,

annotat ions =l i s t ( x = 1 , y = 1 ,

text = ”Source : i l o s t a t ” ,showarrow = F, x r e f=’ paper ’ , y r e f= ’ paper ’ ,xanchor=’ r i g h t ’ , yanchor=’ auto ’ , x s h i f t =0, y s h i f t =0,f ont=l i s t ( s i z e =15, c o l o r=”blue ” ) )

)

• Figure 3: Distribution of the labour force participation rate by sex, 1990, 2000, 2010,2018 and 2030 (ILO modelled estimates, July 2018)

l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( s c a l e s )

dat emp4 <− get i l o s t a t ( id = ’EAP 2WAP SEX AGE RT A’ , segment = ’ i n d i c a t o r ’ ,f i l t e r s = l i s t ( time = c ( ’ 1990 ’ , ’ 2000 ’ , ’ 2010 ’ , ’ 2018 ’ , ’ 2030 ’ ) ,sex=c ( ’F ’ , ’M’ ) , c l a s s i f 1=’AGGREGATE TOTAL’ ) ) %>%

f i l t e r ( s t r sub ( r e f area , 1 , 1 ) != ’X ’ ) %>%mutate ( sex lab = i f e l s e ( sex==’SEX F ’ , ’Women ’ ,i f e l s e ( sex==’SEX M’ , ’Men ’ , NA) ) ) %>%mutate ( sex year = i f e l s e ( ( sex==’SEX F ’ & time==’ 2030 ’ ) ,


mutate ( sex year = i f e l s e ( ( sex==’SEX F ’ & time==’ 2030 ’ ) ,’Women − Pro j e c t i on ’ , i f e l s e ( ( sex==’SEX M’ & time==’ 2030 ’ ) ,’Men − Pro j e c t i on ’ , i f e l s e ( ( sex==’SEX F ’ & time !=’ 2030 ’ ) ,’Women − Estimate ’ , i f e l s e ( ( sex==’SEX M’ & time !=’ 2030 ’ ) ,’Men − Estimate ’ , NA) ) ) ) ) %>%mutate ( l f p r = obs va lue/100) %>%s e l e c t ( r e f area , sex lab , sex year , time , l f p r)%>%

group by( sex year , time ) %>%mutate (MD = median( l f p r ) ) %>%ungroup ( ) %>%mutate (MD = as . character (MD) )

ggp lo t ( dat emp4 , aes ( x=time , y=l f p r , f i l l =sex year , alpha=sex year ) ) +

geom boxplot ( ) +f a c e t wrap (˜ sex lab ) +theme bw( ) +theme ( legend . p o s i t i o n=”none” ) +scale alpha manual ( va lue s=c ( 0 . 7 , 0 . 2 , 0 . 7 , 0 . 2 ) ) +scale f i l l manual ( va lue s=c ( ” ye l low ” , ” ye l low ” , ” tu rquo i s e4 ” , ” tu rquo i s e4 ” ) ) +lab s ( x=”” ,

y=”LFPR (%)” ,capt ion = ”Source : i l o s t a t ” ) +

scale y cont inuous ( labels=percent )

Variable label Description of the variable Unit Year

Active variables

SDG 0131Proportion of populationcovered by social protection floors/systems

Percentage 2016

SDG 0552Female share of employment inmanagerial positions

Percentage 2017

SDG 0821Annual growth rate of outputper worker (measured as GDP in constant2011 international $ in PPP)

Percentage 2017

SDG 0852 Unemployment rate Percentage 2017

SDG 0861Proportion of youth (aged15-24 years) not in education, employmentor training (NEET)

Percentage 2017

SDG N881Non-fatal occupationalinjuries per 100’000 workers

Frequency rate 2015

SDG F881Fatal occupational injuriesper 100’000 workers

Frequency rate 2015

SDG 0922Manufacturing employment as aproportion of total employment

Percentage 2017

SDG 1041Labour income share as apercent of GDP

Percentage 2015

Supplementary variables

SDG 0111Working poverty rate(percentage of employed living below US$1.90 PPP)

Percentage 2017

SDG B831Proportion of informalemployment in non-agricultural employment -Harmonized series

Percentage 2017

Table 2: Description of the variables


References

Escofier, B. and Pages, J. (2008). Analyses factorielles simples et multiples; objectifs,methodes et interpretation. Dunod Paris.

Eurostat (2018). Sdmxsource.net nuget package.

Gandrud, C. (2013). Reproducible Research with R and R Studio. Chapman & Hall/CRC.

Garnier, S., Ross, N., Rudis, B., Sciaini, M., and Scherer, C. (2018). Package ’Viridis’:Default Color Maps from ’matplotlib’.

Husson, F. and Josse, J. (2019). Package ’missMDA’: Handling Missing Values with Mul-tivariate Data Analysis.

Husson, F., Josse, J., Le, S., Mazet, J., and Husson (2018). Package FactoMineR: Multi-variate Exploratory Data Analysis and Data Mining.

ILO (2018). Decent work and the sustainable development goals: A guidebook on sdglabour market indicators. Department of Statistics.

Josse, J. and Husson, F. (2012). Handling missing values in exploratory multivariate dataanalysis methods. Journal de la Societe Francaise de Statistique, 153(2):79–99.

Kassambara, A. and Mundt, F. (2017). Package ’factoextra’: Extract and Visualize theResults of Multivariate Data Analyses.

Lahti, L., Huovari, J., Kainu, M., and Biecek, P. (2017). Retrieval and analysis of eurostatopen data with the eurostat package. The R Journal, 9(1):385–392.

Lebart, L., Morineau, A., and Piron, M. (1995). Statistique exploratoire multidimension-nelle, volume 3. Dunod Paris.

Perraud, J.-M. (2015). Getting started with R.NET.

Perraud, J.-M. and Abe, K. (2017). R.net nuget package.

R Core Team (2019). R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria.

Sievert, C., Parmer, c., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., Despouy, P.,and Inc., P. T. (2019). Package ’plotly’: DCreate Interactive Web Graphics via ’plotly.js’.

Wickham, H. (2018). Package ’scales’: Scale Functions for Visualization.

Wilke, C. O. (2018). Package ’ggridges’: Ridgeline Plots in ’ggplot2’.


Macroeconomic Statistical Forecasting for Engine DemandAnkit Kamboj ([email protected])


Debojyoti Samadder ([email protected])


Ambica Rajagopal ([email protected])


Sarat Sindhu Mukhopadhyay ([email protected])


ABSTRACT

Forecasting demand is a critical issue for driving effi cient operations in a

manufacturing fi rm. Due to this reason fi rms are concerned to plan their operations

and strive to improve their forecasting methods for having an edge over the competi-

tors in market. The purpose of this paper is to evaluate various shrinkage methods

for data containing large numbers of features. Here we focus on Class 8 Group 2

North America Heavy Duty (NAHD) market and macroeconomic indicators from ACT

research economic database to forecast full 3 months out shipment of engines. Vari-

ous pre-processing techniques are applied on all the variables and then they are fur-

ther decomposed by applying Seasonal and Trend decomposition using Loess (STL)

into its components (trend, seasonality and remainder). Then for each pre-processing

technique the decomposition is analysed visually. After this the relative signifi cance

of the variance associated to each decomposed component is utilized to select the

appropriate pre-processing technique for all the variables in order to ensure their sta-

tionarity for reliable forecasting accuracy. We applied several statistical as well as

machine learning methods and obtained an ensemble of them to have minimal error

in forecasting. It is also noticed that there is hardly any increase in accuracy when

the number of features is increased beyond 15. Following are the few important R

packages that were used in our analysis: forecast, forecastHybrid, tseries, readxl, xts,

quantmod, e1071, lars.

Keywords: Box-Cox Transformation, Stationarity, STL Decomposition, Least

Angular Regression, Shrinkage, Lasso, Support Vector Regression, Hybrid Forecast


1. INTRODUCTION

Forecasting future values of an observed time series plays an im-

portant role in nearly all fi elds such as economics, fi nance, meteorology and

telecommunication. Manufacturing companies with a systematic demand

forecasting framework leads to eff ective decision-making processes such as

sales budgeting and production planning. But most of the fi rms are still us-

ing subjective and intuitive judgments for product demand forecasts which

is one of the factor of having less reliable production planning. It is of great

signifi cance for manufacturing enterprises to eff ectively predict product de-

mand. Firms adopting a structured forecasting framework have observed con-

structive impacts on operational performances and they are allocating a lot of

research and innovation to achieve it.

The statistical time series models have fundamental importance to

various practical domains. Thus, a lot of active research works is going on in

this subject during several years. To attain higher forecasting accuracy, many

statistical time series models have been suggested in literature. The list of sta-

tistical forecasting methods starts with some basic models such as exponential

smoothing and its variants like Holt’s and Holt’s Winters method [10] and

then followed the Box-Jenkins methodology to ARIMA models [1]. The fur-

ther use of multivariate GARCH models are also made available [31] expand-

ing the fi eld of statistical models. Further a method of creating the forecasts

by using the lags of other macro indicators which involved extensive use of

regression and econometric models. [11]

In the last two decades, machine learning (ML) models like Sup-

port Vector Machines [41] and shrinkage methods both ridge regression and

the LASSO [42] gained popularity for forecasting high dimensional datasets

and are competing seriously against statistical models. The further study on

shrinkage methods showed that LARS [28] gives most accurate results for

selecting important variables in the model. The ML and statistical methods

diff er in the way of optimization of minimum sum of squares for achieving

higher forecast accuracy. While the ML methods use non-linear algorithms,

the statistical ones depend on linear processes. ML methods being at the junc-

tion of statistics and computer Science are computationally more extensive

than statistical ones. [34]

The relevant research studies regarding the work presented here has

been thoroughly analysed. Firstly, due to inherent trend and seasonality in

most of time series data, a major emphasis is laid on diff erent pre-processing

techniques, since both statistical time series and machine learning models

require adequate pre-processing techniques to remove the non-stationarity


in the data [43]. Further to select the relevant pre-processing technique, the

time series is decomposed into three components: trend, seasonality and re-

minder and the relative variance of each component is analysed [2,14]. Also,

the transformation should make the time series stationarity, both in terms of

stabilizing mean and variance [27]. Then few univariate models like Arima

[21,26] and Error, Trend & Seasonal Model (ETS) [24] are used to understand

the predictive nature of series without any covariates.

For multivariate forecasting, the lags of macro-indicators and ship-

ment series are used as predictors [9] and since the number of predictors are

higher than number of samples, the LARS [28] shrinkage method is used to

identify the top predictors. These important predictors are then used in multi-

variate forecasting model using Dynamic Regression and Support Vector Re-

gression models [3,20] and fi nally the ensemble of multivariate and univariate

models is build. The experimental evaluation of forecasting error in terms of

Mean Absolute Percentage Error(MAPE) is presented in tabular format cor-

responding to each forecasting model and pre-processing transformation used.

A time series is simply a series of data observed over time. In this

paper we only deal with regularly spaced time series i.e. the data is observed

every month. Provided the observation intervals are equally spaced, we call

them regularly spaced time series.

2. RELEVANT DEFINITIONS AND LITERATURE REVIEWS:

2.1 Box-Cox Transformation:

In many statistical analyses it is desirable to have the following two

assumptions: (a) the variables are normally distributed, (b) the variance of one

variable doesn’t change across all values of the independent variables i.e. the

homoscedasticity of the variable [37]. If the assumptions are violated then

certain transformation needs to be applied [15]. Suppose the observations are

and transformed observations are denoted by

then primarily the following basic transformations may be used:

(i) Square root,

(ii) Cube root,

(iii) Logarithm,

Generally, to resolve the violation of the above assumptions these ba-

sic transformations are helpful. Various eff orts are made in order to generalize

these transformations. Tukey (1957) had the initial proposal that a transfor-


mation can be thought as a class or family of similar mathematical functions.

Finally Box-Cox transformation(1964)[37] is given by the below equation:

(1)

The value of λ plays a key role, for λ = 1 the transformation boils

down to identity, in case λ = 0 the logarithm , or something in between. An

important task is to choose the appropriate value of λ. If we choose λ to lie

in the interval [0, 1] and then we may use Guerrero (1993) method to choose

Guerrero (1993) λ in the following way. [18,38]

2.2. Seasonal-Trend Decomposition:

In a time-series with a seasonal component, STL decomposition (sea-

sonal-trend decomposition based on loess) may be used to decompose the

series into trend, seasonal and remainder components. This means, if the data,

the trend component, the seasonal component, and the remainder component

are denoted by Yu, Tu, Su and Ru, respectively, for u= 1 to N. then [5]

(2)

While carrying out loess, for all the data points we defi ne a neighbour-

hood. In this method we need to choose weights for each point in the neigh-

bourhood (this is called neighbourhood weights) and this is done based on dis-

tance from the particular data point. Next, we fi t a polynomial (mostly a linear

or quadratic) to these data points. The values at each data point is basically

the trend value. The steps of STL are: (1) Detrending, (2) Smoothing of cycle-

subseries – we constitute series for each seasonal component and smoothing

is done separately, (3) Low-pass fi ltration of smoothed cycle-subseries – the

sub-series are combined and smoothed, (4) Detrending the seasonal series, (5)

The original series is de-seasonalized, the seasonal component obtained in the

previous steps are utilised, (6) Trend component is evaluated by smoothing

the de-seasonalized series [14].

The relative signifi cance of the variance associated to each decom-

posed component can be identifi ed by the ratio of statistical variation of each

of the decomposed component to the variation of original series [2]. As an

example, for remainder component:

(3)


2.3. Least Absolute Shrinkage and Selection Operator:

Suppose we have some dependent variable and a collection of inde-

pendent variables along with it which might aff ect the dependent variable.

We obtain ordinary least (OLS) estimates by minimizing the residual sum of

squares. There are two major problem of regression with OLS estimates: (i)

OLS estimates often have low bias but very high variance, (ii) With a large

set of independent variables the lose the interpretability of all the variables.

The technique which can be used to handle both of these shortcom-

ings is called LASSO (least absolute shrinkage and selection operator). It re-

duces the variance of prediction by increasing bias by a little which in result

increase the prediction accuracy. It also shrinks some of the coeffi cients and

set other coeffi cients to zero and in that way, it does variable selection [42].

Suppose we have standardized predictors and centered response

values and . The LASSO regression

problem is to fi nd which minimizes the following:

(4)

LASSO actually uses penalty where λ is a shrinkage parameter [30].

2.4. Least Angle Regression (LARs):

In many practical problems, we have a large dataset at our disposal

and the number of features of interest is also huge. If we take an example of

macro-economic forecasting, there are many time series variable available.

Here each variable may be indicator of some economic factor and hence they

are important as predictor in the model. But if we take all the predictors in

the model the prediction will be less accurate due to large variance of the

estimates and the model is also pretty much complex. So, in this situation we

need to fi nd the features which are aff ecting the forecast substantially and iso-

late them from other noise variables and this will result in improved forecast

accuracy [8].

Efron et al. (2004) presented a technique called Least Angle Regres-

sion (LARs) which can choose most informative predictors and it is inspired

by the forward stage-wise methods for selecting regression models. The ad-

vantage of LARS algorithm it gives a ranking to all the predictors which is

very helpful in many of the situations [25].

In our context we have applied moving window cross validation tech-

niques to determine optimum number of features which is widely used in this

area.


2.5 Hybrid Forecast:

For taking the ensemble of forecasting methods available in the Hynd-

man’s forecast package in R, there is a separate package called forecastHybrid.

The models that could be used in this package are Arima (auto.arima), Error

Trend and Seasonality (ets), Theta model (thetam), Feed Forward Neural Net-

work with single hidden layer and lagged inputs (nnetar), STL model(stlm),

Tbats model(tbats) and Seasonal Naïve model(snaive) and it has the fl exibility

of combining the forecasts either using equal weights or based on in-sample

errors.[6,13]

The advantage of using the ensemble forecast is that they provide im-

proved forecasting accuracy as compared to accuracy of individual models. [7]

2.6 Ljung-Box test:

It is the most signifi cant test for searching the non-appearance of auto

correlation at certain lag. The null hypothesis of this test is if the time-series

model does not lead to lack of fi t. In other way one can say that if the errors

follow white noise or it has some other properties. The test statistics for time

lag m is:

(5)

where is the accumulated sample autocorrelation with n points time

series which follows a central Chi-square distribution. [32]

2.7 Augmented Dickey Fuller Test (ADF):

ADF test is the augmented version of Dickey Full-

er test (DF) with a lag of p. So, the DF test is applied on the model

with the null hypoth-

esis that the data are non-stationary. If the test statistic is less than the critical

value or the p value is less than 0.05, the null hypothesis is rejected and no unit

root is present. [27]

2.8 Z-score Normalization:

Z score normalisation is used when the maximum and minimum value

of the time-series is unknown and the time-series is stationary in nature. If X

is a variable takes values from , then the normalized variable

of X, X’ takes values where

(6)

with mean 0 and singular variance.


The main drawback of this method is it can’t deal with non-stationary

data due to the change of the mean and variance of the time-series in diff erent

time. [35]

2.9 Autoregressive Integrated Moving Average (ARIMA):

A seasonal ARIMA model includes six terms p,q,d and P,Q,D where p,q,d rep-

resents non seasonal parts of the model and P,Q,D refers to the seasonal part of

the model. Here p,P are the autoregressive terms, q,Q are the moving average

and d,D are the diff erence. We have used “forecast” library in R which can

predict best p,q,d,P,Q,D using the model AIC values.[21,26]

2.10 Error, Trend & Seasonal Model (ETS):

Using exponential smoothing technique ETS deprecates a time-se-

ries into trend (additive or multiplicative) and seasonal model from the error

terms. ETS are estimated either by minimising the sum of squared errors or

maximizing the likelihood probability subject to the smoothing parameters

which lies between 0 to 1. [24]. From diff erent combinations of ETS models,

best model is chosen by Akaike’s Information Criterion (AIC) or Bayesian

Information Criterion (BIC) criterions and smaller is the AIC / BIC, the better

is the model. [29]

2.11 Support Vector Regression(SVR):

SVR is the modifi ed version of support vector classifi cation problem

where the model returns continuous value as output which make it a regres-

sion problem. SVR fi nds a tolerance level that attempts to fi nd the narrow-

est tube centered around the surface”. [3]

2.12 Dynamic Regression:

While evaluating regression models it is assumed that error term is

uncorrelated, but there are scenarios where we allow the errors from regres-

sion to have autocorrelation considering the assumption that error terms will

follow an ARIMA process. Further if there is stationarity among all the vari-

ables, then we only need to consider ARMA errors for residuals. [20]. Hence

in Dynamic Regression we fi t a regression model with ARIMA errors, for

example:

is a linear function of the k predictor variables ( ), is

the error term which follows an ARIMA (1,1,1) model, is white noise and

B is the backward shift operator, we can write:


(7)

Where and are the fi rst order coeffi cient of AR (autoregres-

sive) model and MA (moving average) model respectively.

2.13 First Diff erenced Series:

In case of a stationary time series its properties do not depend on the time

when it has occurred. So, time series with trend or seasonality are not consid-

ered to be stationary as the trend or seasonality value will aff ect the value of

the series at that instant, intuitively white noise process will be stationary [19].

In case of a non-stationary process, diff erencing may be a way out to handle

the situation. Diff erencing helps in stabilizing the mean. The diff erenced se-

ries is given by

(8)

The issue with this transformation is the transformed series will have

one observation less than the original series as it is not possible to compute the

diff erence between 0th and 1st observation [22]. We call this diff erence the fi rst

diff erence and in most of the cases we obtain stationarity by doing only fi rst

diff erence. To obtain stationarity sometimes it is necessary to carry out higher

order diff erence [36].

3.METHODOLOGY:

The purpose of this study is twofold. Our fi rst objective is to carry

out descriptive analyses of the relationships between the shipment demand

for various engines and the US NAHD (North America Heavy Duty) macro-

economic indicators. Our second objective is to capitalize on the knowledge

gained through these analyses by developing multivariate models that could

be used to generate forecasts of the shipment of the engines i.e. to build a

demand forecasting model for Engine Shipment that incorporates all key de-

mand planning inputs, macroeconomic indicators for full 3-month forecasts

for the US North America Heavy Duty market.

3.1 Modelling Framework:

In this paper we have worked on the data from January 2011 till May

2019. The frequency of shipment data as well as macroeconomic indicators

is monthly but from the current date the past two months values are not avail-


able. Hence to forecast 3 months out from the current date we need to forecast

5 points ahead from the date of data availability. Figure 1. Observations used

to build forecasts are called training set and the remaining observations form

the test set. With limited data at hand, the due care has been taken for not

overfi tting the ML models. We have used ML models with minimal parameter

tuning. The way to identify overfi tting is to have a very complicated model on

the training set that fi ts the data well but it will not necessarily produce reliable

forecasts on the test set. [39]

Forecasting Timelines

Figure 1

In order to have reliable forecast accuracy, time series cross validation

is performed [16] since if we have a relative small test set then the conclusions

on the accuracy drawn from this set might not be reliable for future times. In

time series cross validation, we have series of training and test sets, where

each test set consist of fi ve observations and the fi fth observation is considered

as the 3 months out forecasted value. For the observations that comprises of

test set, a corresponding training set is present that have all the observations

prior to the test set. Hence the model is tested on the data which is previously

not known to it for computing the multi-step errors. Figure2, where for each

row the blue dots show the training set and the red dot shows the test set. Each

training set consist of just one more observation then the previous training set

and we get many more observations in test sets. Finally, the average error of

all the test sets will represent the overall forecasting performance of the mo-

del.


Time Series Cross Validation

Figure2

3.2 Data Pre-processing:

There are seasonal and trend variations in most of the economic time

series. Pre-processing of the data is the major factor that aff ects the forecasting

accuracy as often we have the stochastic trend along with seasonal variations

in the time series data. In addition to time series models, even for Machine

learning models it is important to remove the non-stationarity in variables

before building any forecasting model [43]. The forecasting output might be

unstable with suboptimal results if machine learning models are used without

adequate pre-processing [43] of data.

Here we have tested four forms of pre-processing techniques for each

time series: Original Series, Box-Cox Transformed Series: to achieve statio-

narity in variance, First Diff erenced Series: helps in stabilizing the mean and

Box-Cox Transformed Then First Diff erenced Series: helps both in stabilizing

mean and variance. We need to choose appropriate transformation for making

the time series stationary. After transformation we have used Augmented

Dickey Fuller test to check if the transformation produces a stationary series

[33]. It is observed that both shipment series and the predictors are stationary

by one or more of the above transformations.

In case more than one transformation constructing stationary series,

we have visualized the potential transformations after decomposing the series

by STL (seasonal-trend decomposition based on loess) decomposition and

then the variance associated to each decomposed component is observed [2].

For a particular time-series, the ratio of variance explained by each decompo-

sed component (Trend, Seasonal and Remainder) to the variance of transfor-


med series are analysed. Then the four transformations are sorted in decrea-

sing order of the ratio of variance explained by remainder component to make

sure that the time series doesn’t have inherent trend and seasonal components.

Further to check that the transformations are not that strict to make the time

series as white noise, Ljung-Box test is used with lags as twice the period of

seasonality, i.e. for monthly series having seasonality of 12 we have used 24

lags. [17]

Since white noise are not linearly forecastable so we have not used

those transformations that are converting the time series into white noise.

[12]. Finally, that transformation is chosen whose remainder explains most

part of the variation and which is stationary but not statistically white noise.

The appropriate transformation for each predictor is saved for further appli-

cation during building up the forecasting model and is referred as Optimized

transformation. Further all the time series are then Z-score normalized using

the mean and standard deviation of the time series.

3.3 Setting up the forecasting problem

The forecasting framework is setup using lags of macro-indicators

and shipment series to form the Autoregressive distributed lag model. [9].

Since we need to forecast 3 months out i.e. 5 steps ahead from the data avai-

lability, we have taken predictors with lags greater than or equals to 5 for both

macro-economic indicators and shipment series. Moreover, in autoregressive

distributed lag equation four diff erent lags of each series is taken to account

for multivariate interaction in the model. Figure 3.

(9)

where the shipment series at time t, are the macro-

economic indicators, is a random disturbance and are

diff erent lags of a specifi c series with present time as t.

In the autoregressive distributed lag equation, we have formed 204

predictors as there are four diff erent lags for each of 51 initial predictors.


Forecasting Framework:

Figure 3Month Shipment Series

Lag 5 Lag 6 Lag 7 Lag 8 Lag 5 Lag 6 Lag 7 Lag 8 Lag 5 …..

1 NA NA NA NA NA NA NA NA NA





6 NA NA NA NA NA NA

7 NA NA NA NA

8 NA NA

9

t

t+1

t+2

t+3

t+4

t+5

Predictors

y

�

�!�"

�#�$

�%

�&

�'�(

�)

*!!

*"!

*#!

*$!

* +%!

* +$!

* +#!

* +"!

* +!!

*!!

*"!

*#!

* +&!

* +%!

* +$!

* +#!

* +"!

*!!

*"!

* +'!

* +&!

* +%!

* +$!

* +#!

*!!

* +(!

* +'!

* +&!

* +%!

* +$!

*!

* !

* +!!

* +"! * +#

!

*!"

*""

*#"

*$"

* +%"

* +$"

* +#"

* +""

* +!"

* "

*"

�

�!�"

�#�$

� +%

� +$

� +#

� +"

� +!

�!

�"�#

� +&

� +%

� +$

� +#

� +"

�!�"

� +'

� +&

� +%

� +$

� +#

�!

� +(

� +'

� +&

� +%

� +! � +" � +# � +$

3.4 Identifi cation of signifi cant Lags

Now starting with our fi rst objective to carry out descriptive analyses

of the relationships between the shipment demand and macroeconomic indi-

cators, we take seven years of monthly data from January 11 till December 17

and we need to fi nd predictors with appropriate lags that will provide better

prediction of the shipment series.

After pre-processing both shipment series and all the predictors, we

need to fi nd the top predictors and their respective lags. When the count of

predictors is signifi cantly higher than number of samples, George Box coined

the term Eff ect Sparsity [44] to describe that only small fraction aff ects the

response and most of the features will have zero eff ect. [40]. Using the cross-

validation technique described above, LARS [28] shrinkage method is used

for identifi cation of the top predictors as the number of predictors are large

relative to the sample size. Several experimentations are performed by taking

only top n features from LARs model (n: 1 to 204) and their average cross-

validation errors in terms of MAPE (Mean Absolute Percentage Error) are

analysed. It is observed that cross validation error is minimum when model is

taking only top 15 features as its predictors. Hence further forecasting models

are build using only these top 15 predictors.

3.5 Building up the forecasting model

After we have obtained the top predictors, various univariate and mul-

tivariate forecasting models are implemented for fi nding out the 3 months out

forecast of shipment series. The fi ve models used in the analysis are: “Ari-

ma”, “Error, Trend & Seasonal Model”, “Support Vector Regression”, “Dyna-

mic Regression” and “Hybrid Forecast”. Among these fi ve models Support


Vector Regression and Dynamic Regression are multivariate models whereas

Arima and Error, Trend & Seasonal Model are univariate models. The Hybrid

Forecast model is a combination of multivariate and univariate models i.e.

(“Dynamic Regression”, “Error, Trend & Seasonal Model”, and “Arima”) and

is implemented using the forecasthybrid package in R. Also, an ensemble of

the above fi ve models is implemented by giving 30% of weightage to each of

the three multivariate models and 5% weightage to each of the two univari-

ate models. By analysing diff erent combinations of weights, we have chosen

the above mentioned weights and it can a further fi ne-tuned for optimal per-

formance. Since we are forecasting 5 points ahead value using time series

cross validation, during each iteration both the shipment series and predictors

are transformed according to the relevant pre-processing technique identifi ed

during the data pre-processing step. We have kept time span from May 18 till

May 19 as out of sample period i.e. the fi rst forecasted value was of May18

and then subsequent months values are forecasted according to time series

cross validation as mentioned above in the modelling framework. After this

the corresponding inverse transformation is applied on the forecasted value

of shipment series to get value back in original scale which is further used in

computing accuracy measures.

4. RESULTS:

In time series cross validation, each training set consist of just one

more observation then the previous training set and we get many more test sets

for fi nding out the errors. Finally, the average error of all the test sets will repre-

sent the overall forecasting performance of the model. Econometricians often

call this concept as “forecast evaluation on a rolling origin” [16]. The forecast

origin is the time at the end of training data and it rolls forward in time.

Here we need 5 step ahead forecast horizon, hence all the cross vali-

dated error measures are computed for that horizon. Forecasts errors are diff e-

rence between the actual test set observations and the point forecasts and they

are diff erent from residuals. Residuals are on the training set while forecast

errors are on the test set. For an observation in test set and its corresponding

forecasted value , the forecast error is given by:

We compute the accuracy of our method using the forecast errors cal-

culated on the test data. There are number of ways to compute the forecast

accuracy. We can take average absolute error, average squared error, average

percentage error or average absolute percentage error.


Accuracy Measures

Table 1

Accuracy Measure Formula

Mean Absolute Error |)

Mean Squared Error )

Mean Percentage Error )

Mean Absolute Percentage Error |)

The MAE and MSE is dependent on the scale of the data. While MPE

and MAPE is more robust as only require all the data to be positive having

no zeros or small values and assumes there is a natural zero [23]. Since the

shipment series that we have used as dependent variable is positive and had an

absolute zero and in accordance with business requirement, MAPE is used for

accuracy comparison of diff erent forecasting models.

We have compared the average MAPE value for out of sample period

from May 18 till May 19. The comparison is done among the fi ve models:

(“Arima”, “Error, Trend & Seasonal Model”, “Support Vector Regression”,

“Dynamic Regression” and “Hybrid Forecast”) and among fi ve diff erent

transformations: (Original Series, Box-Cox Transformed Series, First Diff er-

enced Series, Box-Cox Transformed Then First Diff erenced Series and the

Optimized Transformed Series obtained from Data Pre-processing step) on all

the predictors. The transformations are applied on both shipment series and

predictors.


MAPE (%) Comparison:

Table 2

Original

Series

Box-Cox

Transformed

Series

First

Diff erenced

Series

Box-Cox

Transformed

Then First

Diff erenced

Series

Optimized

Transformed

Series

Arima 21.66 20.33 21.05 20.54 20.14

Error, Trend &

Seasonal Model19.54 19.04 19.34 17.61 18.44

Support Vector

Regression13.31 12.57 15.09 15.85 11.23

Lagged Regression 12.36 11.53 16.41 16.13 10.48

Hybrid Forecast 12.95 12.28 15.61 16.12 11.04

Ensemble Model 11.96 11.17 14.48 15.07 10.21

MAPE (%) Comparasion

Figure 4

Transformation

Fo

reca

stin

g M

od

el

5. CONCLUSIONS:

In summary, we have proposed a full framework of demand forecast-

ing using combination of statistical and machine learning methods. We have

observed that Data Pre-processing and selection of signifi cant features with

their respective lags are very important in multivariate forecasting framework.


Moreover, we have noticed that the transformation whose remainder from

STL decomposition explains most part of the variation and which is stationary

but not statistically white noise gave the best performance in terms of aver-

age out of sample MAPE. (Figure 4). Also, the three multivariate models:

“Hybrid Forecast”, “Dynamic Regression” and “Support Vector Regression”

performed better than univariate models implying that the ACT macroeco-

nomic indicators have predictive power for shipment series. Additionally, the

ensemble model gave the best accuracy metrics.

The further scope in this paper would be to use more optimized

weights in the ensemble model. Other internal indicators can also be used

along with macroeconomic indicators to improve the forecast accuracy. The

proposed forecasting framework can be extended to any industry for forecast-

ing demand or sales for better fi nancial and supply chain planning.

References:

1. Anderson,O.D, 1997, The Box-Jerkins approach to time series analysis. RAIRO -

Operations Research - Recherche Opérationnelle, Volume 11 (1977) no. 1, p. 3-29.

http://www.numdam.org/item/?id=RO_1977__11_1_3_0

2. Antoine,E. A. Lafare 1 and Peach,Denis W., 2015, Use of seasonal trend de-

composition to understand groundwater behaviour in the Permo-Triassic Sandstone

aquifer, Eden Valley, UK. Hydrogeology Journal, 24 (1). 141-158. http://nora.nerc.

ac.uk/id/eprint/512086/1/art%253A10.1007%252Fs10040-015-1309-3.pdf

3. Awad, M., Khanna R., 2015, Support Vector Regression. In: Effi cient Learning

Machines. Apress, Berkeley, CA. https://link.springer.com/content/pdf/10.1007%

2F978-1-4302-5990-9_4.pdf

4. Box,G., Meyer,D., 1986, An Analysis for Unreplicated Fractional Factorials. Techno-

metrics Vol. 28, No. 1, pp 11-18, 1986.

5. Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. 1990, Stl: A

seasonal-trend decomposition procedure based on loess. Journal of Offi cial Statis-

tics, 6(1):3–73. http://www.nniiem.ru/fi le/news/2016/stl-statistical-model.pdf

6. Cran-R-Project: forecastHybrid, VIGNETTES. https://cran.r-project.org/web/pack-

ages/forecastHybrid/vignettes/forecastHybrid.html , [Accessed 20 June 2019]

7. Ellis,Peter., 2016, Error, trend, seasonality - ets and its forecast model friends.

http://freerangestats.info/blog/2016/11/27/ets-friends ,[Accessed 20 June 2019]

8. Gelper, Sarah. & Croux, Christophe. 2008, Least angle regression for time series

forecasting with many predictors. http://citeseerx.ist.psu.edu/viewdoc/download?doi

=10.1.1.516.6505&rep=rep1&type=pdf

9. Giles,Dave. 2013, ARDL Models - Part I. Econometrics Beat. University of Victoria,

Canada.https://davegiles.blogspot.com/2013/03/ardl-models-part-i.html ,[Accessed

20 June 2019]

10. Holt, C.C., 1957, Forecasting trends and seasonals by exponentially weighted

moving averages. Carnegie Institute of Technology, Pittsburgh ONR memorandum

no. 52.

11. HOWREY, E. P., 1980, The Role of Time Series Analysis in Econometric Model

Evaluation. http://www.nber.org/chapters/c11706

12. Hurvich,Cliff ord., Chapter 3: Forecasting from Time Series Models. Forecasting

Handouts, NYU Stern School of Business, New York. http://people.stern.nyu.edu/

churvich/Forecasting/Handouts/Chapt3.1.pdf ,[Accessed 20 June 2019]


13. Hyndman,R.J., Gooijer, Jan G De., 2006, 25 Years of Time Series Forecasting.

https://robjhyndman.com/papers/ijf25.pdf

14. Hyndman, R.J., 2012, Measuring time series characteristics. https://robjhyndman.

com/hyndsight/tscharacteristics ,[Accessed 20 June 2019]

15. Hyndman, R.J., 2014a, Forecasting using R https://robjhyndman.com/talks/

RevolutionR/7-Transformations.pdf ,[Accessed 20 June 2019]

16. Hyndman,R.J., 2014b, Measuring forecast accuracy. https://pdfs.semanticscholar.

org/af71/3d815a7caba8dff 7248ecea05a5956b2a487.pdf

17. Hyndman,R.J., 2014c, Thoughts on the Ljung-Box test. https://robjhyndman.com/

hyndsight/ljung-box-test/ ,[Accessed 20 June 2019]

18. Hyndman, R.J. and Bergmeir, Christoph, 2015, Bagging Exponential Smooth-

ing Methods using STL Decomposition and Box-Cox Transformation, International

Journal of Forecasting, Volume 32, Issue 2, April–June 2016, Pages 303-312 htt-

ps://robjhyndman.com/papers/BaggedETSForIJF_rev1.pdf

19. Hyndman, R.J., 2016a, Stationarity and diff erencing, Otexts, Forecasting:

Principles and Practice, Monash University, Australia. https://www.otexts.org/

fpp/8/1,[Accessed 20 June 2019]

20. Hyndman, R.J., 2016b, Dynamic regression models, Otexts, Forecasting: Prin-

ciples and Practice, Monash University, Australia. https://otexts.com/fpp2/dynamic.

html ,[Accessed 20 June 2019]

21. Hyndman, R.J., 2016c, Seasonal ARIMA models, Otexts, Forecasting: Principles

and Practice, Monash University, Australia. https://otexts.com/fpp2/seasonal-ari-

ma.html ,[Accessed 20 June 2019]

22. Hyndman,R.J, 2016d, Forecasting: principles and practice. https://robjhyndman.

com/uwa2017/2-3-Diff erencing.pdf ,[Accessed 20 June 2019]

23. Hyndman,R.J., 2016e, Chapter: 3.4 Evaluating forecast accuracy. Otexts, Fore-

casting: Principles and Practice, Monash University, Australia. https://otexts.org/

fpp2/accuracy.html ,[Accessed 20 June 2019]

24. Hyndman,R.J., (2016f, Estimation and model selection, Otexts, Forecasting: Prin-

ciples and Practice, Monash University, Australia. https://otexts.com/fpp2/estima-

tion-and-model-selection.html ,[Accessed 20 June 2019]

25. Hyndman,R.J., Jiang, B., Athanasopoulos,George., 2018, Macroeconomic

forecasting for Australia using a large number of predictors https://robjhyndman.

com/papers/ausmacrofcastR1.pdf

26. Hyndman,R.J., 2019, Package ‘forecast’, Version: 8.7, Title: Forecasting Func-

tions for Time Series and Linear Models. https://cran.r-project.org/web/packages/

forecast/forecast.pdf ,[Accessed 20 June 2019]

27. Imam, Akeyede., Habiba, D., and Atanda,B.T., 2016,“On Consistency of Tests

for Stationarity in Autoregressive and Moving Average Models of Diff erent Orders.”

https://pdfs.semanticscholar.org/f128/d0d72f70d0a94ecf329a9363fc1ef0abfd9e.

pdf

28. Iturbide,Eric., 2013, A Comparison between LARS and LASSO for Initialising the

Time-Series Forecasting Auto-Regressive Equations. Procedia Technology 7(2013)

282 -288 https://www.sciencedirect.com/science/article/pii/S2212017313000364

29. Jofi pasi,Chesilia A., Miftahuddin and Hizir, 2017, Selection for the best ETS

(error, trend, seasonal) model to forecast weather in the Aceh Besar District, IOP

Conference Series: Materials Science and Engineering, Volume 352, conference 1.

https://iopscience.iop.org/article/10.1088/1757-899X/352/1/012055/pdf

30. Kourentzes, N. and Petropoulos,F., 2017, Forecasting with R A practical work-

shop, International Symposium on Forecasting https://kourentzes.com/forecasting/

wp-content/uploads/2017/06/Forecasting-with-R-notes.pdf


31. Laurent, R. and Violante, 2012, On the forecasting accuracy of multivariate

GARCH models. Journal of Applied Econometrics. vol. 27, no. 6, pp. 934-955.

32. Ljung, G.M. and Box, G.P., 1978. “On a Measure of a Lack of Fit in Time Series

Models”, Biometrika, 65.2, 297–303.

33. Lyocsa, S., 2011, Unit-root and stationarity testing with empirical application on

industrial production of CEE-4 countries. Munich Personal RePEc Archive Paper

No. 29648 https://mpra.ub.uni-muenchen.de/29648/

34. Makridakis,Spyros1 and Spiliotis,Evangelos, 2018, Statistical and Machine

Learning forecasting methods: Concerns and ways forward. PLoS One 13(3):

e0194889. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870978/

35. Nayak, S.,Misra, Bijan B. and Behera,Himansu Sekhar, 2013, “Impact of Data

Normalization on Stock Index Forecasting.” https://pdfs.semanticscholar.org/f412/4

953553981e32c39273bb2745a140311d160.pdf

36. Ohl,Eduard Baum and Ly´ocsa, Sˇtefan, 2009, Stationarity of time series and the

problem of spurious regression.https://mpra.ub.uni-muenchen.de/27926/1/Station-

arity_of_time_series_and_the_problem_of_spurious_regression.pdf

37. Osborne,Jason W., 2010, Improving your data transformations: Applying the Box-

Cox transformation, ISSN 1531-7714 http://citeseerx.ist.psu.edu/viewdoc/downloa

d?doi=10.1.1.470.7417&rep=rep1&type=pdf

38. Shang, Han Lin 2015, Selection of the optimal Box-Cox transformation parameter

for modelling and forecasting age-specifi c fertility, Journal of Population Research,

2015, 32(1), 69-79 https://arxiv.org/abs/1503.02344v1

39. Souhaib Ben Taieb, 2014, Machine learning strategies for multi-step-ahead time

series forecasting. Computer Science department, University of Brussels, Belgium.

http://souhaib-bentaieb.com/pdf/2014_phd.pdf

40. Stodden,Victoria., 2008, Model Selection with Many More Variables than Ob-

servations. Microsoft Research Asia, Stanford University. https://web.stanford.

edu/~vcs/talks/MicrosoftMay082008.pdf

41. Thissen,U. and Brakel,R. 2003, Using support vector machines for time series

prediction. Chemometrics and Intelligent Laboratory Systems Volume 69, Issues

1–2. https://www.sciencedirect.com/science/article/abs/pii/S0169743903001114

42. Tibshirani,R., 1996, Regression shrinkage and selection via the LASSO. Journal

of the Royal Statistical Society Vol. 58, No. 1 (1996), pp. 267-288. https://www.

jstor.org/stable/2346178?seq=1#metadata_info_tab_contents

43. Zhang,GP., Qi, M., 2005, Neural network forecasting for seasonal and trend time

series. European Journal of Operational Research. 2005; 160(2):501–514. https://

doi.org/10.1016/j.ejor.2003.08.037


Understanding Patterns in the Consumption of Agro-Food Products in Romania - An Analysis at Regional LevelAndreea MIRICĂ, PhD. Assistant Lecturer ([email protected])

Bucharest University of Economic Studies

Roxana-Ionela GLĂVAN, PhD. Assistant Lecturer ([email protected])


Iulia Elena TOMA, PhD. Candidate ([email protected])


Lucian PĂTRAȘCU, PhD. ([email protected])


ABSTRACT

The agro-food sector has faced several challenges since the end of World

War II. This article performs an analysis of this sector from the consumer perspec-

tive. More precisely, it aims to fi nd certain patterns in the consumption of agro-food

products. In this respect, quarterly data with regard to the average consumption of the

agro-food products at regional level are used. The data are provided by the Romanian

National Institute of Statistics. In order to fi nd patterns in the consumers’ behaviour,

JDemetra+ version 2.2.2 was used to analyse the time series with regard to seasonal

patterns and calendar eff ects (Trading days, Julian Easter). TRAMO-SEATS and X13

were assessed as seasonal adjustment methods for all series that showed signifi cant

seasonality. Moreover, only the automatic procedure was used in all cases. The X13

procedure provided the best results in most of the cases.

Keywords: agro-food products, seasonal adjustment, JDemetra+

JEL Classifi cation: Q17

1. INTRODUCTION

The newest releases of the “EU agricultural outlook for 2018-2030”

report published on December 2018 by the European Commission show that

France, Germany, the UK and Romania are projected to account for about 55%


of EU main cereal production in 2030. Based on the recent food trends, the

consumers are more inclined to have a closer look at the origin, environmental

friendliness and organic certifi cation of the food products they select. This

aspect has an important economic impact in the overall production chain.

Understanding such challenging factors can increase competitiveness and

bring the required technologies to drive forward the better suited agro-food

products tailored to adjust to the new demanding trends.

After World War II a priority for Europe became the development

of economic and commercial relations. Based on historical studies it can be

observed that Romania has an experience in exporting various agro-food

products. Throughout the last decades, Romania has lost the capacity to sell

goods and agro-food products in the context of the great changes from the

late 80s. Since joining the European Union in 2007, the main component

of the agro-food sector, the agriculture, has taken a slight path approach

towards increasing self-consumption (Davidova et al., 2009) and generating

new goods with high added value in the market. These are observed in the

context of EU capital investments in the agricultural sector and related

industries.

Considering the important challenges the agro-food sector is facing

especially due to the changes in the consumption patterns, it is crucial for

every country to perform an in-depth analysis of this phenomenon.

Romania has high potential in food production (PWC, 2017).

However, for potential investors to be able to exploit the knowhow and

the natural resources existing in Romania in order to best respond to the

consumers’ needs, they must understand consumption patterns, as consumers

are the main actors of the business environment. Toma and Mirică (2018)

show that exploring seasonality at a low disaggregation level is very important

for business decision makers to understand business environment. Therefore,

this article will explore the seasonal patterns in the consumption of agro-food

products in Romania.

2. DATA AND METHODS

In order to achieve the purpose of this section, quarterly data on the

average consumption per person for several agro-food products were retrieved

from the TEMPO Online Database of the Romanian National Institute of

Statistics. Data were retrieved at regional level as this is the lowest level of

disaggregation available. Moreover, the available time frame is 2015-2018,

which complies with the minimum standards in offi cial statistics with regard

to the length of time series for the purpose of seasonal adjustment (Buono et


al. 2018; UNECE, 2012). Also, a time series length of four years is enough for

detecting Easter eff ect (Findley et al., 2005).

In order to explore seasonal patterns of these series, the tools provided

by JDemetra+ 2.2.2 will be used. JDemetra+ 2.2.2 is the latest version of the

software offi cially recommended by Eurostat for seasonal adjustment (Eurostat,

2019). This sofware provides an easy to use tool for detecting seasonality, outliers

as well as an automatic procedure for seasonal adjustment (Grudkowska, 2017).

The automatic procedure of this software is very user friendly and provides high

quality results (Mirică et al. 2017). However, for problematic time series, the

decomposition method and the ARIMA Model must be choosed manually based

on the methodology proposed by Mirică et al. (2016).

In order to assess the presence of seasonality, JDemetra+ off ers several

tests for the raw series, of which the Autocorrelation at seasonal lags test will

be used (Mirică et al. 2017). Series will be seasonally adjusted only if there is

a strong evidence of seasonality.

Next, all the series that show strong seasonal pattern are seasonally

adjusted using Tramo-Seats and X13, the two methods incorporated in

JDemetra+ 2.2.2. In order to perform the seasonal adjustment, the Romanian

Calendar is defi ned, comprising all the legal holidays in this country including

the Julian Easter. For the results to be easy to interpret, the information

proposed by Andrei et el. (2019) will be extracted from the output for each

series: transformation method, the presence of Easter and Trading Days eff ects,

outliers, the result of the residual seasonality tests, the overall quality and

the AIC. The seasonal adjustment method will be chosen taking into account

the overall quality of the results of each method. Next, in the case of equal

quality, the method with the lowest AIC will prevail. With regard to the AIC,

it is important to note that Motulsky and Christopoulos (2004), show that the

sign of this indicator is of no practical importance and one should choose the

model with the lowest AIC.

3. RESULTS

Firstly, all the series, for each agro-food product and region of Romania

are tested for the presence of seasonality. The results of the Autocorrelation at

seasonal lags test are displayed in Table 1. As one can observe, the consumption

of maize fl our, milk, fats, as well as mineral water and soft drinks has no

seasonality in all regions. On the other hand, there are agro-food products

that are consumed on a seasonally basis in all regions: Vegetables and canned

vegetables in fresh vegetable equivalent, Confi ture, jam, compote, jellies and

Chocolate, sweets, Turkish delight and other sugar confectionery. For fruits


and eggs, there is strong seasonal pattern in consumption in all regions except

for Bucharest-Ilfov. The consumption of bread and bakery products presents

seasonality only in the North-West region, while the consuption of fl our and

potatoes presents seasonality only in North-East and the consumption of

rice only in the South-Muntenia Region. The consumption of fresh meat has

seasonal patterns in South-East and South-West Oltenia while the consumption

of meat products in South-East, South-West Oltenia and South-Muntenia. The

consumption of cheese and cream presents strong seasonality in the Center,

South-West Oltenia and South-Muntenia regions. The consumption of Maize,

sunfl ower and soya oil has seasonal patterns in North – West and South-

Muntenia. Sugar is consumed on a seasonal basis in South-East and South-

Muntenia. The consumption of alcoholic drinks displays strong seasonal

patterns in South-Muntenia and Center.

If we analyse the situation by region, one can observe that Bucharest-

Ilfov has the lowest number of series that present seasonal patterns, closely

followed by the West region. On the other hand, South-Muntenia has the

highest number of such series.

Results of the Autocorrelation at seasonal lags test for series concerning

the average consumption per person by agro-food product and region

– P values and interpretation, source: designed by the authors using

JDemetra+ 2.2.2.

Table 1North -

West Center

North -

East

South -

East

Bucharest -

Ilfov

South -

Muntenia

South - West

OlteniaWest

Bread and

bakery

products

Seasonality

present

0.0025

Seasonality

not present

0.2821

Seasonality

not present

0.1220

Seasonality

not present

0.2531

Seasonality

not present

0.9622

Seasonality

present

0.0026

Seasonality

not present

0.1424

Seasonality

not present

0.4537

Maize fl our

Seasonality

not present

1.0000

Seasonality

not present

0.9855

Seasonality

not present

0.3150

Seasonality

not present

1.0000

Seasonality

not present

1.0000

Seasonality

perhaps

present

0.0282

Seasonality

not present

0.9482

Seasonality

not present

0.0811

Flour

Seasonality

not present

0.2739

Seasonality

not present

0.1835

Seasonality

present

0.0016

Seasonality

not present

0.0886

Seasonality

not present

0.9724

Seasonality

not present

0.1463

Seasonality

not present

1.0000

Seasonality

not present

0.4806

Rice

Seasonality

not present

0.3387

Seasonality

not present

0.1312

Seasonality

not present

0.8842

Seasonality

not present

1.0000

Seasonality

not present

0.0848

Seasonality

present

0.0045

Seasonality

not present

0.0615

Seasonality

not present

1.0000

Fresh meat

Seasonality

not present

0.0755

Seasonality

not present

0.0535

Seasonality

not present

0.0694

Seasonality

present

0.0009

Seasonality

not present

0.2737

Seasonality

not present

0.0612

Seasonality

present

0.0019

Seasonality

not present

1.0000

Meat products

Seasonality

not present

0.8477

Seasonality

not present

0.1906

Seasonality

perhaps

present

0.0136

Seasonality

present

0.0010

Seasonality

perhaps

present

0.0159

Seasonality

present

0.0003

Seasonality

present

0.0001

Seasonality

not present

0.2348

Milk

Seasonality

not present

0.7798

Seasonality

not present

0.6186

Seasonality

not present

0.4785

Seasonality

not present

1.0000

Seasonality

not present

1.0000

Seasonality

not present

0.9972

Seasonality

not present

0.0901

Seasonality

not present

0.9964


North -

West Center

North -

East

South -

East

Bucharest -

Ilfov

South -

Muntenia

South - West

OlteniaWest

Cheese and

cream

Seasonality

not present

0.6547

Seasonality

present

0.0020

Seasonality

not present

0.2355

Seasonality

not present

0.2722

Seasonality

not present

0.0784

Seasonality

present

0.0005

Seasonality

present

0.0003

Seasonality

not present

1.0000

Eggs

Seasonality

present

0.0011

Seasonality

present

0.0006

Seasonality

present

0.0003

Seasonality

present

0.0002

Seasonality

perhaps

present

0.0323

Seasonality

present

0.0023

Seasonality

present

0.0006

Seasonality

not present

0.3625

Fats

Seasonality

perhaps

present

0.0372

Seasonality

not present

0.5340

Seasonality

not present

0.3151

Seasonality

not present

0.0893

Seasonality

not present

0.8957

Seasonality

not present

0.3149

Seasonality

not present

0.3877

Seasonality

not present

1.0000

Maize,

sunfl ower,

soya oil

Seasonality

present

0.0043

Seasonality

not present

0.2146

Seasonality

not present

0.2281

Seasonality

not present

0.0721

Seasonality

not present

0.1960

Seasonality

present

0.0015

Seasonality

not present

0.2343

Seasonality

not present

1.0000

Fruit

Seasonality

present

0.0005

Seasonality

present

0.0000

Seasonality

present

0.0003

Seasonality

present

0.0007

Seasonality

perhaps

present

0.0174

Seasonality

present

0.0001

Seasonality

present

0.0009

Seasonality

present

0.0014

Potatoes

Seasonality

not present

0.0778

Seasonality

not present

0.6932

Seasonality

present

0.0015

Seasonality

not present

0.4988

Seasonality

not present

0.0749

Seasonality

not present

0.0503

Seasonality

not present

0.1593

Seasonality

not present

0.7442 Vegetables

and canned

vegetables in

fresh vegetable

equivalent

Seasonality

present

0.0000

Seasonality

present

0.0001

Seasonality

present

0.0000

Seasonality

present

0.0000

Seasonality

present

0.0001

Seasonality

present

0.0000

Seasonality

present

0.0001

Seasonality

present

0.0004

Sugar

Seasonality

perhaps

present

0.0141

Seasonality

not present

0.3574

Seasonality

not present

0.1104

Seasonality

present

0.0001

Seasonality

perhaps

present

0.0378

Seasonality

present

0.0098

Seasonality

not present

0.1057

Seasonality

not present

0.7073

Confi ture,

jam, compote,

jellies

Seasonality

present

0.0002

Seasonality

present

0.0001

Seasonality

present

0.0001

Seasonality

present

0.0001

Seasonality

present

0.0009

Seasonality

present

0.0000

Seasonality

present

0.0001

Seasonality

present

0.0001

Chocolate,

sweets,

Turkish delight

and other sugar

confectionery

Seasonality

present

0.0070

Seasonality

present

0.0022

Seasonality

present

0.0000

Seasonality

present

0.0001

Seasonality

present

0.0057

Seasonality

present

0.0009

Seasonality

present

0.0001

Seasonality

present

0.0026

Mineral water

and other soft

drinks

Seasonality

not present

0.1143

Seasonality

not present

0.1610

Seasonality

not present

0.4609

Seasonality

not present

0.1571

Seasonality

not present

0.7285

Seasonality

not present

0.4895

Seasonality

perhaps

present

0.0458

Seasonality

not present

0.1050

Alcoholic

drinks

Seasonality

perhaps

present

0.0289

Seasonality

present

0.0092

Seasonality

not present

0.1117

Seasonality

not present

0.4691

Seasonality

not present

0.2579

Seasonality

present

0.0024

Seasonality

not present

0.9090

Seasonality

not present

0.0735

Next, the automatic procedure used for TRAMO-SEATS and X13

was applied to seasonally adjust the time series that present strong evidence

of seasonality. The results are displayed in Table 2. For most of the series, the

X13 method is more suitable for seasonal adjustment. However, there are some


series where one can’t decide between the two methods: the consumption of

fresh meat in South – East, meat products in South – Muntenia; the consumption

of eggs in North – West and South – East, respectively; the consumption of

fruits in North – West; the consumption of vegetables and canned vegetables

in fresh vegetable equivalent North – West; the consumption of confi ture,

jam, compote, jellies North – West. Moreover, there were some cases when

TRAMO-SEATS provided better results: the consumption of fruit Center,

South – Muntenia and South – East, respectively; the consumption of potatoes

North – East; the consumption of vegetables and canned vegetables in fresh

vegetable equivalent in South – East and Bucharest – Ilfov; the consumption

of Chocolate, sweets, Turkish delight and other sugar confectionery in North

– East; South - West Oltenia and West, respectively.

With regard to the calendar eff ect, interesting results were obtained.

Firstly, threre is no trading days eff ect, meaning that the consumption of agro-

food products is not infl uenced by the day of the week. Secondly, for some

products there is a signifi cant negative Easter eff ect of various lenghts: for

the consumption of Cheese and cream in South – Muntenia the eff ect lasts

for 15 days while in South - West Oltenia for 8 days; for the consumption of

Eggs the eff ect lasts for 15 days in North – East as well as South – Muntenia;

for the consumption of Meat products the eff ect lasts for 8 days in South –

East; for the consumption of Maize, sunfl ower, soya oil the eff ect lasts for

8 days in South – Muntenia; for the consumption of Fruit the eff ect lasts for

15 days both in North – East and South – Muntenia; for the consumption of

Vegetables and canned vegetables in fresh vegetable equivalent the eff ect lasts

for 8 days in the Center as well as South - West Oltenia; for the consumption

of Alcoholic drinks the eff ect lasts for 15 days in the Center region. The results

are in line with the ones obtained in the scientifi c literature. For example,

analysing US data, McElroy et al. (2018) also obtained a negative pre-Easter

eff ect for groceries.


The results of the seasonal adjustment process for the series concerning

the average consumption per person of various agro-food products using

TRAMO-SEATS and X13 with national calendar

Table 2

Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

Bread and bakery products North – West

TRAMO-

SEATS

RSA full

log-

transformedno no 1 yes severe -5.3859

X13

RSA5c

log-

transformedno no no no good -0.1254

Flour North – East

TRAMO-

SEATS

RSA full

No

transformationno no no no good -36.8021

X13

RSA5c

log-


Rice South – MunteniaTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Fresh meat South – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Fresh meat South - West Oltenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Meat products South – Muntenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Meat products South – East

TRAMO-

SEATS

RSA full

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

X13

RSA5c

log-

transformed

Yes, 8

days,

coef.

-0.66

no no no good -25.0815

Meat products South - West OlteniaTRAMO-

SEATS

RSA full

log-

transformedno no 1 no good -41.6206

X13

RSA5c

log-


Cheese and cream South – Muntenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 15

days,

coef.

-0.09


Cheese and cream Center

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Cheese and cream South - West Oltenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 8

days,

coef.

-0.1829


Eggs North – West

TRAMO-

SEATS

RSA full

log-

transformedno no no no good 24.5093

X13

RSA5c

log-


Eggs Center

TRAMO-

SEATS

RSA full

No

transformationno no no no good 15.9171

X13

RSA5c

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

Eggs North – EastTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 15

days,

coef. -0.1

no no no good 12.4254

Eggs South – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Eggs South – Muntenia

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-

transformed

Yes, 15

days,

coef.

-0.14


Eggs South - West OlteniaTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Maize, sunfl ower, soya oil North – WestTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Maize, sunfl ower, soya oil South – MunteniaTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 8

days,

coef.

-0.22


Fruit North – West TRAMO-

SEATS

RSA full

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

X13

RSA5c

log-


Fruit Center

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Fruit North – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 15

days,

coef.

-0.11


Fruit South – East

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Fruit South – Muntenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 1

day, coef

-130.4


Fruit South - West Oltenia

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-

transformed

Yes, 1

day but

coef.

aprox. 0


Fruit West

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Potatoes North – East


Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

TRAMO-

SEATS

RSA full

log-


X13

RSA5c preprocessing: failed

Vegetables and canned vegetables in fresh vegetable equivalent North – West

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Vegetables and canned vegetables in fresh vegetable equivalent Center

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-

transformed

Yes, 8

days,

coef.

-0.57


Vegetables and canned vegetables in fresh vegetable equivalent North – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformedno no 1 no good 3.6528

Vegetables and canned vegetables in fresh vegetable equivalent South – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Vegetables and canned vegetables in fresh vegetable equivalent Bucharest – Ilfov

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Vegetables and canned vegetables in fresh vegetable equivalent South –

MunteniaTRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

Vegetables and canned vegetables in fresh vegetable equivalent South - West Oltenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformed

Yes, 8

days,

coef.

-0.44


Vegetables and canned vegetables in fresh vegetable equivalent West

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Sugar South – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-

transformedno no 1 no uncertain -62.1106

Sugar South – Muntenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Confi ture, jam, compote, jellies North – West

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Confi ture, jam, compote, jellies Center

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-

transformedno no 2 no severe -53.0063

Confi ture, jam, compote, jellies North – East

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

Confi ture, jam, compote, jellies South – East

TRAMO-SEATS RSA full

log-


X13 RSA5c

log-


Confi ture, jam, compote, jellies Bucharest – IlfovTRAMO-SEATS RSA full

log-


X13 RSA5c

log-


Confi ture, jam, compote, jellies South – MunteniaTRAMO-SEATS RSA full

No


X13 RSA5c

log-


Confi ture, jam, compote, jellies South - West Oltenia


No


X13 RSA5c

log-


Confi ture, jam, compote, jellies WestTRAMO-SEATS RSA full

No


X13 RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery North – West


log-


X13 RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery Center


log-


X13 RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery North – East


No



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

X13

RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery South – East

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery Bucharest – Ilfov

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery

South – Muntenia

TRAMO-

SEATS

RSA full

log-


X13

RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery

South - West OlteniaTRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Chocolate, sweets, Turkish delight and other sugar confectionery West

TRAMO-

SEATS

RSA full

No


X13

RSA5c

log-


Alcoholic drinks CenterTRAMO-

SEATS

RSA full

No


X13

RSA5c

log-

transformed

Yes, 15

days, coef.

-0.16


Alcoholic drinks South – Muntenia

TRAMO-

SEATS

RSA full

log-



Series

transformation

Easter

Eff ect

Trading

days

eff ect

Outlier

detected

and

corrected

Residual

seasonality

Overall

qualityAIC

X13

RSA5c

log-


4. CONCLUSIONS

Understanding the consumption patterns of agro-food products is a

necessary step in the fast changing economy that may contribute to sustainable

business growth in this economic sector. Currently, the focus on the origin

traceability and quality of products is seen as a consumer behavior change that

has occurred on the agro-food product market (Opara, 2003).

In the present research, using the most recent quarterly data from the

National Institute of Statistics Romania, we explore seasonal patterns on the

consumption of agro-food products at regional level.

The results reveal that the consumption of some agro-food products

has no seasonality in all regions. However, there are products like vegetables

and canned vegetables in fresh vegetable equivalent, Confi ture, jam, compote,

jellies and Chocolate, sweets, Turkish delight and other sugar confectionery

that are consumed on a seasonally basis in all regions.

Also, the analysis shows that seasonal patterns in consumption for

fruits and eggs persist in all regions except for Bucharest-Ilfov. South-West

Oltenia and South-East Regions present seasonal patterns in the consumption

of fresh meat and meat products. South-Muntenia Region has seasonal patterns

in the consumption of rice, meat, cheese and cream, maize, sunfl ower and

soya oil, sugar products.

The consumption of alcoholic drinks shows a strong seasonal pattern

in South-Muntenia and Center Regions. The North-East Region presents

seasonality only in the consumption of fl our and potatoes, while the North –

West Region for bread and bakery products and maize, sunfl ower and soya oil.

The situation by region shows that Bucharest-Ilfov West region have

the lowest number of series that present seasonal patterns, compared to South-

Muntenia that has the highest number of such series.

When the automatic procedure TRAMO-SEATS and X13 was

applied to seasonally adjust the time series, X13 procedure obtained the best

results. Even so, there were some circumstances when TRAMO-SEATS

provided better results and some cases where one can’t decide between the

two methods.


When the series are checked by the calendar eff ect it is observed no

trading days eff ect, meaning that the consumption of agro-food products is not

infl uenced by the day of the week. Also, the results are showing a negative pre-

Easter eff ect of various lengths for some products and regions. Furthermore

they reveal that unobserved factors may contribute to the current trends in the

consumption patterns of agro-food products. The study of such unobserved

eff ects need to be addressed by using other decision criteria.

REFERENCES

1. Andrei, T., Mirică, A., Glăvan, I. R., Ferariu, G. A., and Mincu-Rădulescu, G. I.,

2019, Seasonal adjustment of tourism data for Romania using JDemetra+, paper

presented at the http://simpstat.ase.ro/wp-content/uploads/2019/06/ICAS2019-

Conference-Program..pdf

2. Buono D., Infante, E., and Mazzi, G. L., 2018, Short versus long time series: An

empirical analysis in Handbook on Seasonal Adjustment, Eurostat https://ec.europa.

eu/eurostat/documents/3859598/8939616/KS-GQ-18-001-EN-N.pdf

3. Davidova, S., Fredriksson, L., and Bailey, A., 2009, Subsistence and semi-

subsistence farming in selected EU new member states, Agricultural Economics, no.

40, pp. 733–744.

4. European Commission, 2018, EU agricultural outlook for markets and income, 2018-

2030. European Commission, DG Agriculture and Rural Development, Brussels.

5. Eurostat, 2019, Seasonal Adjustment https://ec.europa.eu/eurostat/cros/content/

download_en

6. Findley, D. F., Wills, K., and Monsell, B. C., 2005, Issues in estimating easter

regressors using regarima models with x-12-arima. In Proceedings of the American

Statistical Association.

7. Grudkowska S., 2017, JDemetra+ User Guide Version 2.2 https://ec.europa.eu/

eurostat/cros/system/fi les/jdemetra_user_guide_version_2.2.pdf

8. McElroy, T. S., Monsell, B. C., and Hutchinson, R. J., 2018, Modeling of Holiday

Eff ects and Seasonality in Daily Time Series, Statistics, 1.

9. Mirică, A., Andrei, T., Dascălu, E. D., Mincu-Rădulescu, G. I., and Glăvan, I. R.,

2016, Revision policy of seasonally adjusted series – case study on Romanian quarterly

GDP, Economic Computation & Economic Cybernetics Studies & Research, 50(3).

10. Mirică, A., Toma, I. E., and Begu, L. S., 2017, Seasonal Adjustment–Consensus

between Direct and Indirect Method. Case Study: Seasonal Adjustment of

Romanian National Accounts Using Jdemetra+ 2.1. In 30th International Business

Information Management Association Conference (pp. 526-541).

11. Motulsky, H., and Christopoulos, A., 2004, Fitting models to biological data using linear

and nonlinear regression: a practical guide to curve fi tting. Oxford University Press.

12. Opara, L.U., 2003, Traceability in agriculture and food supply chain: A review of

basic concepts, technological implications, and future prospects, WFL Publisher,

Science and Technology, Food, Agriculture & Environment, 1(1), 101-106.

13. PWC, 2017, Potenţialul dezvoltării sectorului agricol din România (available only

in Romanian at https://www.juridice.ro/wp-content/uploads/2017/03/Raport_PwC-

agricultura.pdf)

14. Toma, I. E., and Mirică, A., 2018, Using Statistical Data to Better Understand

Business Environment-Case Study on Export and Import Data at County Level.

Romanian Statistical Review, (2).

15. UNECE, 2012, Practical Guide to Seasonal Adjustment With Demetra+ http://www.

unece.org/index.php?id=40568