impact of remittances on the country of origin. … · 2020. 2. 19. · romanian statistical review...
TRANSCRIPT
IMPACT OF REMITTANCES ON THE COUNTRY OF ORIGIN.
MULTIDIMENSIONAL ANALYSIS AT MACRO AND MICROECONOMIC
LEVEL. CASE STUDY ROMANIA AND MOLDOVA 3Valentina Vasile, Professor dr.
Institute of National Economy, Romanian Academy
Elena Bunduchi, Teaching Assistant drd.
University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
Ștefan Daniel, Associate Professor dr.
University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
Călin-Adrian Comes, Associate Professor dr.
University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
ESTIMATION OF NUMBER OF PERSONS PER HOUSEHOLD BASED
ON CHARACTERISTICS OF CONSUMPTION ITEMS - UTILIZATION
OF BIG-DATA TO IMPROVE THE CONSUMPTION TREND INDEX IN
JAPAN- 23Anri Mutoh
National Statistics Center, Japan
Masayo Yamashita
National Statistics Center, Japan
Yoshiyasu Tamura
National Statistics Center, Japan
Masahiro Matsumoto
National Statistics Center, Japan
R TOOLS FOR ILOSTAT: RILOSTAT AND SMART 39M. Villarreal-Fuentesa
Department of Statistics, International Labour Organization (ILO)
S. Dingb
Department of Statistics, International Labour Organization (ILO)
Romanian Statistical Review nr. 4 / 2019
CONTENTS 4/2019
ROMANIAN STATISTICAL REVIEW www.revistadestatistica.ro
Romanian Statistical Review nr. 4 / 20192
MACROECONOMIC STATISTICAL FORECASTING FOR ENGINE
DEMAND 63Ankit Kamboj
Cummins Technologies India Pvt. Ltd, Pune, India
Debojyoti Samadder
Cummins Technologies India Pvt. Ltd, Pune, India
Ambica Rajagopal
Cummins Technologies India Pvt. Ltd, Pune, India
Sarat Sindhu Mukhopadhyay
Cummins Technologies India Pvt. Ltd, Pune, India
UNDERSTANDING PATTERNS IN THE CONSUMPTION OF
AGRO-FOOD PRODUCTS IN ROMANIA - AN ANALYSIS
AT REGIONAL LEVEL 81Andreea MIRICĂ, PhD. Assistant LecturerBucharest University of Economic Studies
Roxana-Ionela GLĂVAN, PhD. Assistant LecturerBucharest University of Economic Studies
Iulia Elena TOMA, PhD. Candidate Bucharest University of Economic Studies
Lucian PĂTRAȘCU, PhD.Bucharest University of Economic Studies
Romanian Statistical Review nr. 4 / 2019 3
Impact of remittances on the country of origin. Multidimensional analysis at macro and microeconomic level. Case study Romania and MoldovaValentina Vasile, Professor dr. Institute of National Economy, Romanian Academy
Elena Bunduchi, Teaching Assistant drd. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
Ștefan Daniel, Associate Professor dr. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
Călin-Adrian Comes, Associate Professor dr. University of Medicine, Pharmacy, Sciences and Technology of Tîrgu Mureş, Romania
ABSTRACT
This research investigates the remittances impact, from the country of origin
perspective, on economic growth at macro and micro level of the household in Roma-
nia and Moldova. We decided to carry out a comparative analysis due to the impor-
tance of these external fi nancial fl ows to the economy. Although the share of remit-
tances in the GDP of the two states diff ers due to the level of economic development,
the constantly increasing labor migration is a common characteristic. In this research
we applied time series regression model using tseries packages in R. The expected
results of the research are to highlight the indicators infl uenced by the remittances in
Romania compared to Moldova at macro and microeconomic level as well as the type
and intensity of the generated impact. This research demonstrates that remittance-
based economic growth is unsustainable and highlights the long-term negative impact
on the country of origin of these fi nancial fl ows.
Key words: Remittances, Time-Series Models, R packages, Romania, Moldova
JEL Classifi cation: F24, C22, O52
Romanian Statistical Review nr. 4 / 20194
INTRODUCTION
Researchers’ opinion is divided regarding the impact of migration and
remittances on the origin country, some considering that remittances generates
economic growth (Meyer et al, 2017; Matuzeviciute et al, 2016, Imai et al,
2014), others say there is no connection between the two variables (Lim et al,
2015; Barajas et al, 2009) and in the third category are experts who argue that
these fl ows have a negative impact (Lartey et al, 2008).
The fi rst group consider that remittances contribute to a better
allocation of resources in the country of origin, thus stimulating aggregate
demand for goods and services by increasing productivity generated by
consumption and investment (Kumar et al, 2018). Other opinions argue that
remittances contribute to increased income and productivity by reducing the
unemployment rate in the country of origin as a result of the mobility of the
unemployed (Boboc et al, 2012).
The third group, however, sees remittances as a factor stimulating
the entry of substitution imports of domestic products into the home market
(Javed et al, 2017), and on the other hand, consumption of imported products
is higher than “indigenous consumption” of similar products in these countries
(Bayar, 2015).
As a result of inequality in resource distribution, employment
opportunities and income levels, migration and remittances can act as
mechanisms for adjusting labor resource fl ows between countries of origin
and destination. On the one hand, migration and remittances represent the
consequence of the failure of national policy in the country of origin, to meet
individual needs in terms of decent employment opportunities and labor
income (Bunduchi et al, 2019). On the other hand, remittances can be a tool
for supporting economic policy in the development process by enhancing
demand for consumption and / or stimulating entrepreneurship. The economic
and social impact of remittances for countries of origin is signifi cantly
positive, at least from the perspective of the benefi ciary households. The
level of poverty, inequality and the structure of households’ expenditures are
some of the channels through which the fl ow of migration transfers reveals its
eff ects on growth and economic and social development.
The free movement of people and the opening of the labor market
(globalization and the need to cover the demographic defi cit in developed
countries with an aging population) have stimulated labor mobility for the
working population in less developed countries. Mobility for work and
emigration aimed both improving the worker’s employment status (mainly on
the side of the labor gain level) and the fi nancial support of the household in the
Romanian Statistical Review nr. 4 / 2019 5
country of origin, through remittances. Statistical data research has shown that
the relationship between people working abroad and remittances received in
the country of origin is not homogeneous and/or balanced. There are countries
of origin with a large share of the labor migrant population, and signifi cant
remittances in GDP (such as Moldova) and countries with a large number of
migrant workers and a low share of remittances in GDP (such as Romania).
As indicator, the average remittance reveals a distorted picture because: a) not
all labor migrants remit money to the remaining family, b) not all the migrants
choose offi cial channels, so some of the remittances remain unregistered; c)
the amount of remittance is very diff erent, being determined by the level of
earnings, the cost of living in the host country, the mobility model (alone
or with the family) and the amount actually transferred on offi cial channels.
In addition, the model of remittance as a quantum, period and frequency is
strongly infl uenced by the individual occupational and human development
plan and mobility expectations (post-repatriation, naturalization in the
destination country, long-term and very long-term mobility, continuing the
journey search for the optimal mobility solution - income and / or profession
- associated with a later decision on utility etc.). In view of the potentially
diversifi ed impact of labor mobility with economic, social, behavioral, eff ects
etc., in this research, we try to identify the impact of remittances received in
terms of the importance of their total volume as share in GDP. We are looking
if there is a link between the level of development of the recipient country and
the impact of remittances fl ows on economic growth, as origin countries are
facing signifi cant mobility.
CONSIDERATIONS IN THE LITERATURE ON THE IMPACT OF REMITTANCES IN THE COUNTRY OF
ORIGIN
So far, there has been substantial researches on the importance of
remittances on the country of origin, addressing each fi eld of remittance. In
the following we have compiled a synthesis of the most recent research results.
Romanian Statistical Review nr. 4 / 20196
Recent research on the impact of remittances on the country of origin
Table 1
Authors DatabaseEmpirical
approachResearch fi ndings
Economic development
(Eggoh, Bangake, &
Semedo, 2019)
49 developing
countries
Panel Smooth
Transition
Regression
Remittances have a positive
impact on the level of economic
development.
(Fromentin, 2017)
our results show
that a positive long-
run relationship
between remittances
and fi nancial
development
coexists with a
signifi cant (and
slightly positive
102 developing
countries
Pooled Mean
Group
In the short term, the study
fi nds that remittances have a
positive impact on fi nancial
development (except for low-
income countries). In the long
run, the assumption is that
households receiving remittances
abroad are more likely to use
offi cial fi nancial services for their
transactions and payments.
(Meyer & Shera,
2017)
Albania, Bulgaria,
Macedonia, R.
Moldova, Romania
and Bosnia
Herzegovina
OLS with fi xed
eff ects
The presence of positive
remittances between GDP and
GDP growth in the research
countries.
(Imai et al., 2014)24 Asian Pacifi c
countries
Panel model with
autoregressive
vector
Remittances generate economic
growth in the analyzed countries
and contribute to poverty
reduction.(Giuliano & Ruiz-
Arranz, 2009)
100 developing
countries
Generalized
moments method
Remittances contribute to GDP
growth in the analyzed countries.Labor market
(Vadean, Randazzo,
& Piracha, 2019) Tajikistan 3SL
Remittances lead to a reduction
in the number of employees,
in favor of self-employees,
especially in the fi eld of
agriculture.
At the same time, it generates
small-scale family investments,
which could have positive
household eff ects, without eff ects
at national level.
(Azizi, 2018)122 developing
countries
Dynamic panel
data with fi xed
eff ects
Remittances generate a reduction
in women’s participation in work
but do not aff ect men.
(Boboc et al., 2012) Romania
Risk assessment
to each mobility
profi le
Migration and remittances have
positive eff ects on the reduction
of unemployment and generate a
reduction in employment.
Romanian Statistical Review nr. 4 / 2019 7
Authors DatabaseEmpirical
approachResearch fi ndings
(Leon-Ledesma &
Piracha, 2004)
Central and Eastern
Europe
Panel data with
fi xed eff ects;
Generalized
moments method
Remittance infl ows positively
infl uence the employment rate
of the population in the country
of origin as a result of investing
these fl ows in the development of
entrepreneurship.Consumption
(Beaton et al., 2017)Latin America and
the Caribbean
Dynamic panel
data
Remittances contribute to
increased consumption,
especially as a result of
facilitating access to funding
sources.
(Lim & Simmons,
2015)
CARICOM
Member States
except the Bahamas
and Montserra
Cointegration
tests of panel data
The absence of any link between
remittances and GDP per capita,
but there was a positive infl uence
of remittances on consumption,
which means that remittances are
directed towards consumption
rather than productive
investments.
(Medina &
Cardona, 2010)Colombia Panel data model
The lack of impact of remittances
on current consumption, but
a positive infl uence on the
improvement of the living
standards of the benefi ciary
households was observed.Health
(Azizi, 2018)122 developing
countries
Dynamic panel
data with fi xed
eff ects
Households receiving remittances
register increases in health
expenditure. At the same time,
the mortality rate is decreasing as
the remittances increase.
(Jr, Cuecuecha, &
Tlaxcala, 2013)Ghana
Two-stage
multinomial
selection model
Remittances cause an increase in
health expenditure.
(Zhunio,
Vishwasrao, &
Chiang, 2012)
69 developed
and developing
countries
GLS with side
eff ects
Remittance-receiving households
are experiencing an increase
in life expectancy and an
improvement in living standards.Education
(Azizi, 2018)122 developing
countries
Dynamic panel
data with fi xed
eff ects
Remittances help increase
school enrollment in both public
institutions and private and
graduation rate.
Romanian Statistical Review nr. 4 / 20198
Authors DatabaseEmpirical
approachResearch fi ndings
(Ambler, Aycinena,
& Yang, 2015)El Salvador
Panel data with
fi xed eff ects
For each $ 1 remittance received
by benefi ciary households,
education spending increased by
$ 3.72
(Jr et al., 2013) Ghana
Two-stage
multinomial
selection model
Remittances increase spending on
education.
(Zhunio et al., 2012)
69 developed
and developing
countries
GLS with side
eff ects
Remittances in households
increased the tuition rate.
(Adams &
Cuecuecha, 2010)Guatemala
Multinomial
model in 2 stages
Households receiving remittances
recorded much higher spending
on education compared to the
period when they did not benefi t
from such fi nancial resources.
Research has, therefore, shown that remittances are an important
source of income, especially in poor households and the main directions
of spending are improving living conditions, current consumption, health
expenditure and small investment in housing. An important part of remittances
goes to the education of children, especially as education provides a greater
degree of opportunity to have higher labor income.
METHODOLOGY
Current research uses OLS model to analyze the impact of remittances
from the country of origin perspective - Romania and Moldova. Therefore, the
dependent variables used in research are:
a) at macroeconomic level - active population, employed population,
employment rate, number of unemployed, unemployment rate, total
consumption of the population, imports, trade balance, population
savings and entrepreneurship development;
b) microeconomic level - household consumption expenditure,
endowment with durable goods, ICT implementation, schooling
expenditure for education and health.
In order to analyze the impact between the variables included in the
research, we formulated the following economic hypotheses:
• H1 - the presence of a positive correlation between remittances and
the unemployment rate;
• H2 - the presence of a negative correlation between remittances and
employment indicators;
Romanian Statistical Review nr. 4 / 2019 9
• H3 - the presence of a negative correlation between remittances and
imports;
• H4 - the presence of a positive correlation between remittances and
household expenditure indicators;
• H5 - the presence of a negative correlation between remittances and
the schooling rate.
Using OLS model, we will identify if exists a direct and statistically
signifi cant relationship between remittance fl ows and dependent variables by
elaborating several equations.
The general model has the following form:
Yi = βo + β1Xi + ei (1)
unde:
Yi represents the dependent variable,
Xi represents the independent variable - remittances,
βo is a parameter and shows the mean value of the Y variable when the
size of the independent variable X is equal to 0,
β1 represents the slope and shows the mean variation of the Y
dependent variable, to an absolute variation with a unit of the variable X,
ei is the residual variable.
Estimating the parameters of the OLS model will be done using the
statistical software R, and lm() function.
DATA
The databases used in the research are those provided by the National
Institute of Statistics of Romania, the National Bureau of Statistics of the
Republic of Moldova, the World Bank, the National Bank of Romania, the
National Bank of Moldova. The analysis period is between 1997-2017.
We decided to carry out a comparative analysis due to the importance
of these external fi nancial fl ows for any economy and especially for the least
developed economies, as it was previously demonstrated in the relevant
research literature. Although the share of remittances in the GDP of the
two countries diff ers due to the level of economic development and the
prevailing pattern of remittances, the increasing number labor migrants is a
common feature. Since the purpose of the research is to highlight the impact
of remittances on economic growth and the main motivation of migration in
the two countries is to supplement the incomes of households in the country
of origin, we can consider that the two countries are homogeneous from the
perspective of consumption directions of received remittances at households
level.
Romanian Statistical Review nr. 4 / 201910
Migrants’ share in the total population of the origin country
in 1995-2017,%
Chart 1
��!��>�8!��������!�������-�������?�����/��������4�8�����-��>�
Source: Author’s calculations based on World Bank data. Available: http://www.worldbank.org/
en/topic/migrationremittancesdiasporaissues/brief/migration-remittances-data.
The free movement of people and the opening of the labor market
(globalization and the need to cover the demographic defi cit in developed
countries with aging population) have stimulated the mobility of working-class
population from less developed countries, such as Romania and Moldova.
Thus, the number of those who left has increased considerably from year to
year, reaching 20% in 2017 of the total population in Moldova and 15% of that
in Romania (Chart no. 1).
Share of remittances in GDP in 1995-2017,%
Chart 2
��!��>�8!��������!�������-�������?�����/��������4�8�����-��>�Source: Author’s calculations based on World Bank data. Available: http://www.worldbank.
org/en/topic/migrationremittancesdiasporaissues/brief/migration-remittances-data. Retrieved
on17.04.2019
Romanian Statistical Review nr. 4 / 2019 11
The rise in the number of migrant workers generated the increase of
the remittances in these two countries, and implicitly their share in GDP, being
an important source of external fi nancial fl ows, which generate changes at
both macroeconomic and household level (Chart 2).
RESULTS AND DISCUSSIONS
Remittances are the expected outcome of migration to supplement
revenue, generating a series of eff ects at country level and household /
individual level.
At the origin country level it is stated thet remittances generate
signifi cant positive eff ects on the labor market, reducing the imbalances
registered in the form of the high unemployment rate (Boboc, Vasile, and
Todose, 2012). The result of our test indicates diff erent results in the case of
Romania and the Republic of Moldova for the period 1996-2017.
Remittances impact on labor market indicators
in Romania and Moldova
Chart 3�����3��������� ������������ ��������$���� ��������� ����� ���� ����������
�����������������������������������������������������������������������������������������������������������
�
The results indicate that remittances have a stronger infl uence on the
labor market indicators in Moldova compared to Romania, highlighted by the
values obtained for R2. This is explained at the level of 2017 by the share of
remittances in GDP that is 8 times higher in Moldova than in Romania (16.1%
compared to 2.1%), and the population is more involved in migration (i.e.
the share of migration for work in the total population is more than 1.5 times
higher in the Republic of Moldova, 29% compared to only 19% in Romania).
Romanian Statistical Review nr. 4 / 201912
Remittances exerts a statistically insignifi cant infl uence on the
unemployment rate, only a modest reduction with 0.04% of the number of
unemployed in Romania and with 0.35% of those in the Republic of Moldova
at each increase by 1%.
The results suggest that labor migration is not primarily driven by
the unemployed, but rather by inactive or even employed people. Moreover,
the results could be signifi cant if we analyzed the remittances in relation
to the underground economy, which employs over 1.2 million Romanians
(European Commission, 2017) and holds over 22% of Romania’s GDP at the
level of 2017 (European Commission, 2018) and over 23.2% of Moldova’s
GDP (BNS, 2018), but data are not available.
In respect to the labor market employment indicators, the same
infl uence of remittances is observed in Romania and in the Republic of
Moldova. Thus, the increase in remittances entries in Romania with 1%
contributes to the average reduction of the active population by 0.02%, of
the occupied population with 0.012% and the occupancy rate with 1.69%. In
Moldova the infl uence of remittances is more noticeable, their increase with
1% determinates the average decrease of the active population with 0.13%,
of the employed population with 0.11% and of the employment rate of the
moldavians with 6.71 %. Therefore, it moves from the employed person
status, especially in the Republic of Moldova, because the income diff erential
is high and responds to the need for additional income in the household, which
cannot be adequately satisfi ed by occupation in the country of origin.
Thus, the negative impact on the employment rate and the lack of a
statistically signifi cant infl uence on the unemployment rate suggest that labor
migrants were not only unemployed persons (Vasile et al, 2013; Caragea et
al, 2013). If for the unemployed the main reason for mobility is the lack of
a job, behind the decision to migrate and remit of the employed person from
Romania / R. Moldova, is the attractive salaries in the country of destination,
precarious working conditions in the country of origin, career opportunities,
etc. At the same time, the motivation of remittances as a result of labor mobility
contributes to accelerating the aging of the active population and raising the
average age in the country of origin, as a result of the fact that the persons
involved in labor mobility are predominantly young.
Romanian Statistical Review nr. 4 / 2019 13
Remittances impact on macroeconomic indicators
in Romania and Moldova
Chart 4�����������������������������������������������������������������������������������������������������������
�
Remittances generate considerable eff ects on consumption growth.
Thus, with the increase of remittances by 1%, the total consumption of
households increases on average by 0.328% in Romania and 0.357% in the
case of R: Moldova (Chart 4). In the absence of detailed data on the origin
of consumption - imported or indigenous, it can be used as proxy the similar
evolution of total consumption and imports and we will analyze the impact of
remittances on imports during the period 1995-2017.
With the increase of remittances by 1%, the imports increased on
average by 0.39% in Romania and 0.33% in Moldova. Therefore, Romanians
consume more imported products than Moldavians, and national consumption
of consumer goods seems to be better supported by the demand associated with
the remittance spending in Moldova than in Romania. This can be explained
by the lack of supermarket chains in the Republic of Moldova in contrast to
those in Romania. If households would have a consumption model of goods
and services predominantly from national origin, consumption would have
contributed to the development of the local and national business environment,
and implicitly to economic growth. However, a signifi cant import-intensive
consumption has negative eff ects on both the balance of payments and the
economy. At the same time, the increase of substitution imports has a adverse
eff ect for indigenous products demand, which indirectly and negatively aff ects
the employment rate (Castles, 2010).
Romanian Statistical Review nr. 4 / 201914
Remittances impact on household expenditures
in Romania and Moldova
Graph 51����)��������� ������������ ������������5�� �������� ����� ���� ����������
�����������������������������������������������������������������������������������������������������������
�
�
We note that Romanians and Moldavians tend to consume more
with the increase of remittance. Between 1996-2017, both consumption and
remittances in Romania had an upward trend, explaining 84% of current
consumption expenditure. Thus, the increase of remittances in the household
budget by 1% allowed the growth of current consumption expenditures with
0.5%. A positive infl uence of the remittances on the current consumption
expenditures is also registered in the Republic of Moldova, increasing on
average with 0.73% as the remittances rise with 1% in the period 2006-2017.
It also can be observed an increase in the endowment with durable
goods, but the impact is not as important. On average, the rise of remittances
by 1% determinate an increase by only 0.09% of the supply of such goods
in Romania, but in Moldova we note the lack of any link between these
two variables. This can be explained, on the one hand, by the fact that
there are people who no longer consider the possibility of returning home
and the remittances received by the household to the remaining parents in
the country are spent on health, education or current consumption. On the
other hand, we can witness a situation of fl attening the enduring supply of
durable goods, which is natural to a household that receives medium and long-
term remittances from multi-annual migration. In the case of the Republic of
Moldova we can add as an explanation the fact that remittances are directed
mainly to the consumption of current goods and services, in order to improve
the current standard of living.
Another category of household spending, which is infl uenced by
remittances, according to Azizi (2018) și Ratha (2013) are health expenditures.
Romanian Statistical Review nr. 4 / 2019 15
In the analyzed period, there is an increase in health care expenditure in
Romania and Moldova, which can also be attributed to remittances in
benefi ciary households. Thus, 84.73%, respectively, 34.35% of the variation
in health expenditure is explained by the change in remittance infl ows in
Romania and Moldova (in the case of Moldova the result must be considered
more limited by the use of a shorter series of data – 2006-2017). From an
economic point of view, the justifi cation for increasing health expenditure is
positively associated with the motivation of migrants to remit. Both in the
situation of single-member households, usually taking into account short-
term or medium-term mobility with a possibility of return, as well as for
multiannual and / or permanent migrants who have left their parents or other
family members at home, a particular importance of remittance is to cover the
costs for increasing the quality of life, and health care services. The rise in
remittances by 1%, facilitates, on average, the increase of the amounts spent
in the health sector with over 0.67% in the Romanian households, respectively
with 0.91% for the Moldavian households. So we can argue that the increase
in households' net disposable income due to remittances contributes to the
quality of life.
Education expenditure is another category of spending that is
important for the quality of life of the population and indirectly for the
economic benefi ts of the country of origin. The migration phenomenon and
the remittance decision have implications also on the educational fi eld, both
positive and negative. On the one hand, it is the amount that the family is
willing to spend for the education of their children in order to obtain a certain
level of education. On the other hand, it is infl uenced by the number of students
who decide to attend high school and / or university / postgraduate studies in
the country. In the case of a remittances’ benefi ciary family, the net available
income increases, with a positive impact on the availability of resources for
the study of children. However, there may be two situations of rising spending
on education:
- studies in the country of origin with positive eff ects on the
development of human capital, fi nancing of educational institutions
and greater likelihood of young graduates being integrated into their
home country;
- studies abroad, which have a negative impact on the development of
the education system by reducing the initial education demand, but
also on the economic and social development of the origin countries,
if the post-graduate employment is done abroad.
Studying abroad will determine the possibility of integration into the
labor market in the country of destination, decreasing the human capital in
Romanian Statistical Review nr. 4 / 201916
Romania and Moldova respectively. On the other hand, the state will not be
able to recover the amounts invested for those students in primary or high
school education, if necessary. The same negative eff ect is also registered
by the migration decision of a household member, followed by family
reunifi cation in the country of destination, through the migration of children
who have completed compulsory education or a part of it, fi nanced by the
state.
The results obtained confi rm the results of the research made by Adams
(et al, 2010) and Ambler (et al, 2015) for the cases of Romania and Moldova,
according to which the remittances in the country of origin, contributes to
increasing household spending with education. Thus, as a result of the 1%
increase in remittances, household spending with education in Romania
increased on average by 0.275%, while in the Republic of Moldova the impact
is higher, this expenditure increasing on average by 1.68%. This evolution of
expenditures is explained by 63% of the remittance variation in Romania and
47% in the Republic of Moldova.
In addition, the higher incidence of remittances to stimulate household
spending with education in Moldova compared to Romania can also be
explained by:
- remittances are used in the Republic of Moldova more for the
fi nancing of the compulsory secondary education, than for the
tertiary, which is optional. In addition, the enrollment rate to tertiary
education is lower in Moldova than in Romania, also because of the
similarity of language between the two countries. For this reason
some of the future students prefer to pursue university studies in
Romania and not in the Moldova, having qualitative advantages
and/or diff erent opportunities, more attractive for employment after
graduation; The cost of completing compulsory education that the
household supports is signifi cantly higher in Moldova compared to
Romania;
- the migration intention after the completion of the compulsory
education is higher for the Moldavian youth compared to the
Romanians, the potential income diff erential being higher for the
medium and low skilled jobs, to which labor/graduate migrants
have access in destination countries
At the same time, the increase in remittances outcomes a drop in the
school population by 0.04% in Romania and 0.07% in Moldova, as opposed to
the results obtained by Zhunio (et al, 2012) and Azizi (2018) (they studied the
eff ects of remittances in underdeveloped and developing countries).Our results
are, on the other hand, in line with the results obtained by Amuedo-Dorantes (et
Romanian Statistical Review nr. 4 / 2019 17
al, 2010) and Mckenzie (et al, 2006), which analyzed the Dominican Republic
and Mexico, countries with an average level of economic development. The
results obtained can be explained by the diff erences in the level of economic
development of the analyzed states, the dynamics of integration in the EU
space, the free movement facilities between Romania and Moldova, as well as
the policy of support for the development of Moldova elaborated by Romania.
(scholarships for Moldovan students, aid for R. Moldova from public funds in
Romania, etc.). Although Moldova is not a member of the European Union,
the large number of Moldovans with Romanian citizenship also determine the
same migration behavior and preference for EU space. Reducing the number
of students may also be generated by the emergence of a trend among young
people whose family members were not in mobility, abandoning further studies
in favor of migration, which are presented as generating fi nancial resources
for them and their family members.
The synthesis of the research results confi rms the hypotheses H2, H3,
H4 and H5 and highlights the specifi cities of the development conditions at
national level and the stage reached in the economic performance and social
inclusion and justifi es the analysis the impact of the remittances on the country
origin, both at macroeconomic and microeconomic level.
Synthesis of the results of the analysis of the eff ect of remittances on
economic variables in Romania and Moldova, 1995-2017
Table 2.
Dependent
variables
Romania Moldova
Macroeconomic Microeconomic Macroeconomic Microeconomicpositive negative positive negative positive negative positive negative
Active population-0.0224
***x
-0.1282
***x
Employed -0.0126
**x
-0.1137
***x
Employment rate-1.6954
***x
-6.7123
***x
Unemployed -0.0488
***x
-0.3524
***x
Unemployment
rate- - - - - - - -
Total population
consumption
0.3285
***x
0.3572
***x
Import0.3974
***x
0.3272
***x
Trade balance -0.0001 x-3.561
x
Romanian Statistical Review nr. 4 / 201918
Current
consumption
expenditure
x0.51373
***x 0.7345
Endowment of
durable goodsx
0.09468
***- - - -
Implementation
of ICT services
(Internet access)
x13.59
*x
9.709
***
Health
expenditurex
0.67551
***x 0.9064
Education
expenditure x
0.27555
***x
1.6798
**
Enrollment rate x-0.0426
***x
-0.0782
***
Thus, following the comparative analysis carried out in Romania and
Republic of Moldova on the impact of remittances on the country of origin,
we can see that the infl uence generated by these external fi nancial fl ows diff ers
according to the variables included in the research (Table 2), as follows:
- the positive infl uence on household savings in Romania and the
implementation of ICT products and services, with a direct impact at micro
level and indirectly at macroeconomic level;
-strong infl uence with a negative impact on the employment rate of
the population, with direct impact at macroeconomic level and indirectly at
microeconomic level;
-moderate infl uence with positive impact on total consumption
of population and imports, with direct impact at macroeconomic level and
indirectly at microeconomic level; and on current consumption expenditure,
health and education with a direct at micro- and indirect impact at
macroeconomic level;
-weak infl uence with negative impact on active and employed
population and enrollment rate, with direct impact at macro level and indirectly
at microeconomic level;
- lack of signifi cant infl uence on the unemployment rate, which
demonstrates that labor mobility comes mainly from employment and too
little of the unemployed situation in the country of origin and the potential
impact of the underground economy.
At the same time, we note the lack of any statistically signifi cant
infl uence of remittances on the development of entrepreneurship for the entire
analyzed period.
Romanian Statistical Review nr. 4 / 2019 19
CONCLUSIONS
Remittances are the result of labor mobility and mainly emerge as
a motivation for migration for categories of low and middle-class people in
economically less developed and emerging middle-income countries.
Remittances have both macroeconomic and microeconomic eff ects
(the analysis of the literature and the often-divergent results on migration
eff ects raised the question of specifi c causes and / or conditions that can
infl uence and generate such confl icting results) by the eff ects they produce
and by the destination of these amounts. In the present research stage, we have
tested the impact of remittances on two former socialist countries, one of them
being a EU member since 2007 and having a high (Romania), and a low level
(R. Moldova) of remittances fl ows as a share of GDP.
At the macroeconomic level, remittances balance the labor market
by reducing the number of the unemployed, which contributes to reducing
the demand for social services but also generates negative infl uences on
the number of the employed population. External labor mobility is a much
more attractive option for young people in training. This appreciation is also
confi rmed by the declining number of students and college students in both
countries, with the possibility of mobility for studies and / or work, generating
potential human capital losses for the country of origin and total / partial loss
of public investment in education.
At the same time, remittance inputs stimulate consumption and
drive, through multi-annual employment abroad, to the emergence of a more
expensive consumer trend, preferably from imports. In Romania and Moldova,
the trend of consumption follows the one of imports, which negatively
aff ects the balance of payments and domestic production. Besides creating
macroeconomic imbalances, the initiatives taken by private entrepreneurs to
diff erentiate the supply of goods and services are adjusted by the competition of
imported foreign substitute products, the price of which is below comparative to
domestic entrepreneurs. Internal market competition is necessary and benefi cial
in the medium and long term, as it supports the increasing competitiveness
of domestic products. However, shaping a pattern of current consumption
predominantly on imported substitute products, without being clearly
accounted for by qualitative diff erences, but rather by small price diff erences
or just preferences, does not help the development of indigenous companies,
which should be supported by public policy support. At the same time, also
through such policies should be stimulated the entrepreneurship developed by
people belonging to households with migrant workers, attracting their return
and the development of business in Romania and Moldova.
Romanian Statistical Review nr. 4 / 201920
In this way, the benefi ts at the micro level can be materialized in the
employment of graduates in the origin country, the return of migrant workers
and the start-up of entrepreneurial business, the increasing living standards
in households, a better health of the household members and the possibility
of raising the level of education and promotion continuous training of active
people in the household and / or youth, etc. At macroeconomic level, there may
be the following benefi ts: - the development of the business environment and
the increase of the working age population, the stimulation of consumption of
indigenous products/services, tax incomes on production and consumption,
the reduction of pressure for aid and social assistance for poor households,
the development of the health sector and the education sector through demand
for quality services, including preventive health segments, respectively
continuing tertiary education and lifelong learning/specialization). In addition
to these direct benefi ts, we can identify and develop opportunities to spend
remittance savings for complementary purchases - cultural consumption,
increased access to ICT goods and services, recreational activities, housing
construction - holiday homes, etc.
The limitation of the research towards the analyzed period 1997-2017
is that for some indicators, such as: the value of household savings and the
share of households with Internet and computer access, we have datasets for
Romania only for the period 2007-2017 , and for the Republic of Moldova,
household spending types are available only from 2006 until 2017.
This research is exploratory, which is why we have selected only
Romania (high share of international labor mobility and low share of
remittances in GDP) and Moldova (high share of international labor mobility
and high share of remittances in GDP). Our further research will include the
former communist countries from Europe and Asia (former USSR countries
and the COMECOM area), which, after the transition to a market economy
and extensive economic restructuring, faced a strong labor migration, mainly
driven by the relatively diff erent earnings and working conditions than in the
country of origin. In many cases the lack of decent employment opportunities
also justifi es the propensity to move towards more developed countries. We
will aim to highlight the extent to which a typology of the impact of remittances
on the country of origin in the former communist space can be developed.
References 1. Adams, R. H., & Cuecuecha, A., 2010, Remittances, Household Expenditure and
Investment in Guatemala. World Development, 38(11), 1626–1641. https://doi.
org/10.1016/J.WORLDDEV.2010.03.003
2. Ambler, K., Aycinena, D., & Yang, D., 2015, Channeling Remittances to Education:
A Field Experiment among Migrants from El Salvador. American Economic Journal:
Applied Economics, 7(2), 207–232. https://doi.org/10.1257/app.20140010
Romanian Statistical Review nr. 4 / 2019 21
3. Amuedo-Dorantes, C., & Pozo, S., 2010, Accounting for Remittance and Migration
Eff ects on Children’s Schooling. World Development, 38(12), 1747–1759. https://doi.
org/10.1016/J.WORLDDEV.2010.05.008
4. Azizi, S., 2018, The impacts of workers’ remittances on human capital and labor
supply in developing countries. Economic Modelling, 75, 377–396. https://doi.
org/10.1016/J.ECONMOD.2018.07.011
5. Barajas, A., Chami, R., Fullenkamp, C., Gapen, M., & Montiel, P., 2009, Do
Workers’ Remittances Promote Economic Growth? Retrieved from http://citeseerx.
ist.psu.edu/viewdoc/download?doi=10.1.1.600.6354&rep=rep1&type=pdf
6. Bayar, Y., n.d., Economic Insights-Trends and Challenges Impact of Remittances
on the Economic Growth in the Transitional Economies of the European Union.
Retrieved from http://www.upg-bulletin-se.ro/archive/2015-3/1.Bayar.pdf
7. Beaton, K., Cerovic, S., Galdamez, M., Hadzi-Vaskov, M., Loyola, F., Koczan, Z., … Wong, J., 2017, Migration and Remittances in Latin America and the Caribbean:
Engines of Growth and Macroeconomic Stabilizers? In IMF Working Papers (Vol.
17). https://doi.org/10.5089/9781484303641.001
8. Boboc, C., Vasile, V., & Todose, D., 2012, Vulnerabilities Associated to Migration
Trajectories from Romania to EU Countries. Procedia - Social and Behavioral
Sciences, 62, 352–359. https://doi.org/10.1016/J.SBSPRO.2012.09.056
9. Bunduchi, E., Vasile, V., Comes, C.-A., & Stefan, D., 2019, Macroeconomic
determinants of remittances: evidence from Romania. Applied Economics, 51(35),
3876–3889. https://doi.org/10.1080/00036846.2019.1584386
10. Caragea, N., Dobre, A. M., & Alexandru, A. C., 2013, Profi le Of Migrants In
Romania – A Statistical Analysis Using "R"; Working Papers. Retrieved from https://
ideas.repec.org/p/eub/wpaper/2013-04.html
11. Castles, S., 2010, Understanding Global Migration: A Social Transformation
Perspective. Journal of Ethnic and Migration Studies, 36(10), 1565–1586. https://
doi.org/10.1080/1369183X.2010.489381
12. Eggoh, J., Bangake, C., & Semedo, G., 2019, Do remittances spur economic
growth? Evidence from developing countries. The Journal of International Trade &
Economic Development, 1–28. https://doi.org/10.1080/09638199.2019.1568522
13. European Commission, 2017, Country Report Romania 2017. Retrieved from
https://ec.europa.eu/info/sites/info/fi les/2017-european-semester-country-report-
romania-en.pdf
14. European Commission, 2018, Country Report Romania 2018. Retrieved from
https://ec.europa.eu/info/sites/info/fi les/2018-european-semester-country-report-
romania-en.pdf
15. Fromentin, V., 2017, The long-run and short-run impacts of remittances on fi nancial
development in developing countries. Quarterly Review of Economics and Finance,
66, 192–201. https://doi.org/10.1016/j.qref.2017.02.006
16. Giannetti, M., Federici, D., & Raitano, M., 2009, Migrant Remittances and
Inequality in Central-Eastern Europe. International Review of Applied Economics,
23(3), 289–307. https://doi.org/10.1080/02692170902811710
17. Giuliano, P., & Ruiz-Arranz, M., 2009, Remittances, fi nancial development,
and growth. Journal of Development Economics, 90(1), 144–152. https://doi.
org/10.1016/j.jdeveco.2008.10.005
18. Imai, K. S., Gaiha, R., Ali, A., & Kaicker, N., 2014, Remittances, growth and
poverty: NEW evidence from Asian countries. Journal of Policy Modeling, 36(3),
524–538. https://doi.org/10.1016/j.jpolmod.2014.01.009
19. Javed, M., Awan, M. S., & Waqas, M., 2017, International Migration, Remittances
Infl ow and Household Welfare: An Intra Village Comparison from Pakistan. Social
Indicators Research, 130(2), 779–797. https://doi.org/10.1007/s11205-015-1199-8
Romanian Statistical Review nr. 4 / 201922
20. Jr, R. H. A., Cuecuecha, A., & Tlaxcala, E. C. De., 2013, The Impact of Remittances
on Investment and Poverty in Ghana. World Development, 50, 24–40. https://doi.
org/10.1016/j.worlddev.2013.04.009
21. Kumar, R. R., Stauvermann, P. J., Kumar, N. N., & Shahzad, S. J. H., 2018,
Revisiting the threshold eff ect of remittances on total factor productivity growth in
South Asia: a study of Bangladesh and India. Applied Economics, 50(26), 2860–
2877. https://doi.org/10.1080/00036846.2017.1412074
22. Lartey, E. K. K., Mandelman, F., & Acosta, P. A., 2008, Remittances, Exchange
Rate Regimes, and the Dutch Disease: A Panel Data Analysis. SSRN Electronic
Journal. https://doi.org/10.2139/ssrn.1109206
23. Leon-Ledesma, M., & Piracha, M., 2004, International Migration and the Role of
Remittances in Eastern Europe. International Migration, 42(4), 65–83. https://doi.
org/10.1111/j.0020-7985.2004.00295.x
24. Lim, S., & Simmons, W. O., 2015, Do remittances promote economic growth in the
Caribbean Community and Common Market? Journal of Economics and Business,
77, 42–59. https://doi.org/10.1016/j.jeconbus.2014.09.001
25. Matuzeviciute, K., & Butkus, M., 2016, Remittances, Development Level, and
Long-Run Economic Growth. Economies, 4(4), 28. https://doi.org/10.3390/
economies4040028
26. Mckenzie, D., Rapoport, H., Bauer, T., Hanson, G., Jouneau, F., Licandro, O., & Lopez, E., 2006,. Can migration reduce educational attainment? Evidence
from Mexico * (No. 3952). Retrieved from http://siteresources.worldbank.org/DEC/
Resources/Can_Migration_reduce_Educational_Attainment.pdf
27. Medina, C., & Cardona, L., 2010, The Eff ects of Remittances on Household
Consumption, Education Attendance and Living Standards: the Case of Colombia.
In Lecturas de Economía (Vol. 72). Retrieved from http://aprendeenlinea.udea.edu.
co/revistas/index.php/lecturasdeeconomia/article/viewFile/6498/5960
28. Meyer, D., & Shera, A., 2017, The impact of remittances on economic growth:
An econometric model. EconomiA, 18(2), 147–155. https://doi.org/10.1016/J.
ECON.2016.06.001
29. Ratha, D., 2013, THE IMPACT OF REMITTANCES ON ECONOMIC GROWTH
AND POVERTY REDUCTION. Retrieved from www.knomad.org/powerpoints/
30. Vadean, F., Randazzo, T., & Piracha, M., 2019, Remittances, Labour Supply and
Activity of Household Members Left-Behind. Journal of Development Studies,
55(2), 278–293. https://doi.org/10.1080/00220388.2017.1404031
31. Vasile, V., Boboc, C., Pisica, S., & Cramarenco, R. S., 2013, The estimation of
the impact of free movement of Romanian workers in EU region from 01.01.2014;
realities and trends from economic, employment, and social perspectives, at
national and European level, Study no 3 / SPOS. Retrieved from www.ier.ro
32. Zhunio, M. C., Vishwasrao, S., & Chiang, E. P., 2012, The infl uence of remittances
on education and health outcomes: a cross country study. Applied Economics,
44(35), 4605–4616. https://doi.org/10.1080/00036846.2011.593499
Romanian Statistical Review nr. 4 / 2019 23
Estimation of Number of Persons Per Household Based on Characteristics of Consumption Items - utilization of big-data to improve the Consumption Trend Index in Japan-Anri Mutoh ([email protected])
National Statistics Center, Japan
Masayo Yamashita ([email protected])
National Statistics Center, Japan
Yoshiyasu Tamura ([email protected])
National Statistics Center, Japan
Masahiro Matsumoto ([email protected])
National Statistics Center, Japan
ABSTRACT
The article suggests the possibility of utilizing big-data held by companies,
integrating it with the data of offi cial statistics. Offi cial statistics agencies in Japan have
sought to develop a Consumption Trend Index (CTI) by cooperating with academic
researchers and companies as a provider of the big-data. One of the important roles of
the CTI is to more accurately indicate the trend of one-person household consumption,
therefore, the big-data is expected to reinforce existing offi cial micro-data, especially
one-person household. However, the obtainable big-data seldom includes the number
of household members, and needs imputation of the missing value. Therefore, we
estimate the number of members in each household according to the characteristics of
consumption items in the Japanese traditional household expenditure survey. We used
logistic regression with an L1 penalty (Lasso regression) for the analysis, with each
type of household as the response variable and purchase items as the explanatory
Romanian Statistical Review nr. 4 / 201924
variable. As a result, since one-person households and two-or-more-person house-
holds are identifi ed by their purchasing tendencies, so the household characteristic
become evident.
Keywords: Consumption trend, household accounts, statistical imputation,
logistic regression, LASSO, R package ‘glmnet’
JEL classifi cation: D13, D16, D90, P44, Z13
1. INTRODUCTION
1.1 Big-data for Consumption Trend Index
We researched the utilization of big-data for offi cial statistics.
Since 2017, in Japan, the Statistics Bureau, Ministry of Internal Aff airs
and Communications, Statistical Research and Training Institute, and the
National Statistics Centre have begun to research the development of a
novel Consumption Trend Index (CTI) by cooperating with professors and
commercial companies as the data holders (Statistics Bureau, 2017, 2018a).
The CTI is an index that enables consumption trends to be grasped quickly and
comprehensively. There are two types of CTI, for macro-level (CTI Macro)
and micro-level (CTI Micro) (The Consumer Statistics Division of Statistics
Bureau, 2018). The CTI Macro provides an early estimate of the monthly
trend in the Household Final Consumption Expenditure of GDP. In contrast,
the CTI Micro indicates the monthly trend in household average expenditure
by major consumption items. In order to further improvement of the CTI, our
research group plans to utilize big-data held by companies as a part of the
input data of the CTI. We particular deal with the fusion of big-data and the
source of CTI Micro in this paper.
The utilization of big-data as the source of CTI Micro is expected to
refl ect the consumption tendency of a one-person household more accurately.
In Japan, even though of one-person households account for about 1/3 of the
population (Statistics Bureau, 2018b), it is diffi cult to survey the one-person
household in a Family Income and Expenditure Survey (FIES). One item of
published evidence for the diffi culty, which is a little old, is case of paradata
research for the FIES by Hamasuna (Hamasuna, 1980). According to the
paradata research, one-person households tended to be absent and need a lot
of revisiting for the survey. This trend is considered to remain even in the
2010s. In order to deal with this diffi culty, the source of CTI Micro consists
of the Single Household Expenditure Monitor Survey, in addition to the FIES
and the Survey of Household Economy. The big-data becomes a source of
information to reinforce these surveys.
Romanian Statistical Review nr. 4 / 2019 25
1.2 Details and issues of big-data for the CTI micro
Data of loyalty programs and data of online personal fi nance software
are considered as usable big-data for the CTI Micro. Their advantages are 1)
the ability to automatically and instantly obtain enormous amounts of data, 2)
that the items of data correspond to a part of consumption items in the FIES
(namely it is a proper subset), and 3) that the data consists of several samples
whose unit is a user of loyalty program or personal fi nance software.
These big-data include information of the user’s individual age and sex,
however, they have the issue that they rarely include household information;
the number of household members of the samples are unclear. Since the input
data of CTI micro consists of the samples whose unit is household, the big-
data need imputation of the missing value: the number of household members.
As mentioned above, it is important but diffi cult to survey the one-person
household for the FIES, thus at least data on one-person households have to
be identifi ed and used.
1.3 Purpose
The purpose of this paper is to estimate the number of persons per
household by the consumption items and to clarify the characteristics of every
consumption item of the household type in order to impute the big-data and
suggest the possibility of utilization for the CTI Micro.
2. RELATED STUDY
2.1 Big-data and Offi cial Statistics
The researches in the fi eld of economic or social systems using big-
data have increased in recent years (Japec et al., 2015). This is the same trend
in offi cial statistics. Struijs mentioned that the opportunity of collaboration
between offi cial statistics agency and business and universities was increased
associating with big-data research in National Statistical Institutes (at the
Netherlands); and reviewed issues and challenges about using big-data in
offi cial statistics (Struijs et al., 2014).
Research on Consumer Price Indices(CPI) is especially active
among the studies using big-data for offi cial statistics. For example, Offi ce
for National Statistics (at the UK) has reported several articles that estimated
experimental CPI using web scraping data; and 10,000 price data on the web
are collected automatically per month and utilized as `the harmonized index
of consumer prices` in the Federal Statistical Offi ce (at Germany) (Blaudow
and Burg, 2018). However, few studies use big-data as a part of offi cial micro
data.
Romanian Statistical Review nr. 4 / 201926
2.2 Stochastic regression imputation methods
Statistical imputation is a part of the most important fi eld in offi cial
statistics. In recent years, multiple imputation of missing values has been
commonly used and its software is large in variety (Takahashi and Ito, 2013).
In this paper we do not deal with multiple imputation, but stochastic regression
imputation. Because it is possible to design regression models for imputation.
Unlike ordinary missing value, they have a full reason for missing, and also
have a highly reliable reference data as the FIES. The imputation by stochastic
regression is appropriate for the purpose of complementing a structure of the
FIES.
In this paper, we use logistic regression with the L1 norm as a model
for imputation, but there are few previous studies using such model for
stochastic regression on imputation.
3. METHODOLOGY
3.1 FIES data
The data for analysis were retrieved from the January 2010 FIES
conducted in Japan. The FIES had two types of survey, for one-person
households and for two-or-more-person households. There was a total of 700
one-person households, along with approximately 7,800 two-or-more-person
households. Although the two types of survey were diff erent, their contents
were almost the same, comprising the demographic characteristics of the
householder and family members, and the purchased items as represented by
price amount or frequency.
We consider the elements of the response variables of estimation to
be one-person households, two-person households, three-person households,
or four-or-more-person households, because 90 percent of the two-or-more-
person households were occupied by 2–4 person households. Five-person
households accounted for only 9 percent of the total (see Table 1), yet they
show little diff erence from the four-person households in terms of the total
spent.
Number and percentage of each type of household in the FIES data
Table 1
one person
two-or-more
persons
two
persons
three
persons
four
persons
five
persons
six
persons
seven-or-more
persons
700 7801 3165 2019 1719 676 182 40
(percentage) 100% 41% 26% 22% 9% 2% 1%
Romanian Statistical Review nr. 4 / 2019 27
Although it is true that there is a positive correlation between the
number of household members and the total amount spent, there is obvious
overlap in the histograms based on the total spent by household size. Figure 1
shows the histograms and density plots with a uniform number of households.
We are going to identify the items with less overlap among each household.
Histogram and density plot of total spent per household
Figure 1B�&���� ��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
'� � � � � � ''�'''� � � � � � � � >''�'''� � � � � � �''�'''� � � � � !''�'''� � � � � �'''�'''�
�� ���#������� �����)" # �����,�
3.2 Lasso regression
The FIES data contain almost 600 purchase items as explanatory
variables, yet the actual observations contain many zero values. Therefore,
we conducted a regression analysis that is proposed by Tibshirani (1996),
so-called Lasso regression analysis. This performs simultaneous variable
selection and minimization of the prediction error by adding L1 norm as a
penalty. Since the L1 norm forms part of the parameters estimated as absolutely
zero, it is possible to select the variables automatically for regression. Let
be the response vector and a
matrix of the explanatory variable, respectively, to give an data
matrix. The problem thus takes the form of eqn [1]:
Romanian Statistical Review nr. 4 / 201928
[1]
The is a tuning parameter. We are considering standardized data,
and hence we omit . Let the sum of the absolute values of regression
coeffi cients become the L1 penalty. The is the regularization parameter
by using the method of Lagrange’s undetermined multipliers; thus, the Lasso
regression model is defi ned as eqn [2],
[2]
We aimed to estimate the types of households this time; thus, it uses
logit as the link function. The environment of analysis is R 3.4.4 and we used
the package ‘glmnet’ ver. 2.0-16. The estimation algorithm for the Lasso
regression is the coordinate descent in this package, which is calculating
diff erentiation for each numerical value of the norm and repeated updating
(Friedman et al., 2010). The coeffi cient of the L1 norm was determined with
10-fold cross-validation so as to minimize misclassifi cation error. The largest
lambda, which minimizes misclassifi cation error, was then selected within one
standard error.
3.3 Data preprocessing
As data preprocessing, we fi rst extracted the purchased items common
to one-person households and two-or-more-person households. Next, we
calculated the correlation between each item, and summarized those pairs
with a correlation coeffi cient over 0.7. The reason for the preprocessing is
that the variable selection by lasso regression become stable in the case of
high correlation between explanatory variables. In addition, we applied the
rank correlation as well as the linear correlation, but the linear correlation is
better to summarize more items than the rank correlation. The pairs of highly
correlated items result in 100 pairs, all of which have class and subclass
relationships. For example, the pair {‘Raw meat’, ‘Beef’} has a correlation
coeffi cient of 0.726. In this case, the ‘Raw meat’ is a larger class including
‘Beef’. In such pairs in a kind of hierarchical relationship, the subclass items
are omitted for effi cient modeling. If both the class and the subclass have
similar behaviors, it is reasonable to leave the larger class that is aff ected by
another subclass.
After the data preprocessing, the data still contained almost 500
purchased items as explanatory variables. This suggests that no class could
Romanian Statistical Review nr. 4 / 2019 29
fully explain the features of all of its subclasses. Purchased items as represented
by both the price amount and the frequency are processed in the same way.
4. RESULTS & DISCUSSION
4.1 Multinomial model
First, we consider a multinomial model in which the response
variables are the four types of households: one-person households, two-person
households, three-person households and four-or-more-person households.
Table 2 shows that the prediction accuracy of the multinomial model is 0.657,
which is poor. A similar level of accuracy is produced whether we use data
represented by the price amount or by the frequency. This result suggests that
it was diffi cult to identify items with less overlap even if estimated using the
multinomial model.
The confusion matrix and the prediction accuracy
of the multinomial model
Table 2
one-person two-person three-personfour-or-more
-person
one-person 288 409 2 1
two-person 65 2833 173 94
three-person 18 990 495 516
four-or-more
-person12 379 253 1973
predicted
actu
al
accuracy: 0.657
4.2 Binomial model
According to Section 1, we have to identify the data of one-person
households. We consider binomial models whose response variables are
dichotomous of one-person households and the others in order to indicate the
items that are simply aff ected by the purchasing activity of multiple persons.
As a result, all the binomial models have prediction accuracy over
0.9, which is a similar result to the accuracy between the price amount and the
frequency. Table 3 shows the confusion matrix and prediction accuracy. The
Romanian Statistical Review nr. 4 / 201930
columns of the matrix show predicted items, while the rows show the actual
items.
The confusion matrix and the prediction accuracy of the binomial model
Table 3
accuracy
two-or-more-person one-person
7547 254
240 460
three-or-more-person one-person
4552 84
84 616
four-or-more-person one-person
2572 45
48 652
predicted
0.942
0.969
0.972
actual
actual
actual
two-or-more-person
one-person
four-or-more-person
one-person
three-or-more-person
one-person
Figure 2 shows lambda coeffi cient plots in the one-person and two-or-
more-person binomial model, and its solution paths. The lambda coeffi cient
plots represent misclassifi cation error by each lambda at the cross-validation;
the solution paths represent the coeffi cients at the optimum lambda. There
are two plots: plot-a is for the data represented by the price amount, and
plot-b is for the frequency. The solid lines in the respective plots indicate
the lambda that minimizes misclassifi cation error. The dashed line indicates
the largest lambda within one standard error that minimizes misclassifi cation
error. We select the optimal lambda as indicated by the dashed line, which is
for the model with the purchase price amount and
for the model with the frequency of purchased items. Each models left 84 and
114 variables.
The upper (or lower) 10 coeffi cients of the binomial models by one-
person and two-or-more-person household are shown in Table 4 and 5. Each
table shows the coeffi cients by the purchase price amount and the frequency of
the purchased items. The dummy variables for the response variable are taken
as 0 for two-or-more-person households and 1 for one-person households.
Therefore, a positively loaded coeffi cient represents the items that characterize
a one-person household, while a negatively loaded coeffi cient represents the
items that characterize a multiple-person household.
Romanian Statistical Review nr. 4 / 2019 31
Lambda coeffi cient plots and its solution paths in the one-person and two-or-more-person binomial model
Figure 2
a. The lambda coeffi cient plots (left) and the solution paths (right) in the
model with purchase price amount as the explanatory variable
�
�
�
�
�
�
�
�
�
�
�
� b. The lambda coeffi cient plots (left) and the solution paths (right) in the
model with frequency of purchased items as the explanatory variable�
�
�
�
�
�
�
�
�
�
�
Based on Table 4, the items in third place and lower have a coeffi cient
of less than 0.1. This suggests that the characteristics for identifying a single-
person household are less obvious in the purchase price amount per item.
However, focusing on the items with high coeffi cients, ‘Drinking’ has the
largest coeffi cient in terms of the price amount for a one-person household,
followed by ‘Taxi fares’. This indicates that the relatively high unit prices for
services and foods aff ect their identifi cation.
On the other hand, ‘Pocket money’, ‘Fuel, light & water charges’
and ‘Food’ have large purchase price amount coeffi cients for two-or-more-
person household, while ‘Paper diapers’ and ‘Communication’ also have large
Romanian Statistical Review nr. 4 / 201932
coeffi cients. This indicates that the items proportional to the number of people
and corresponding to the diff erent stage of lives aff ect identifi cation of multi-
person household. For example, the variable ‘Communication’ represents a
tendency for the number of contracts to increase as the number of household
members increase, since, communication charges are fi xed amounts and are
proportional to the number of contract lines.
The coeffi cients of the binomial model by one-person and two-or-more-person household with the purchase price amount as the explanatory
variableTable 4
corresponding item coefficient corresponding item coefficient
Drinking 0.15 Meat -0.91
Taxi fares 0.11 Pocket money (Unexplained expenditure) -0.88
Coffee beverages 0.08 Fuel, light & water charges -0.62
Other admission fees & game charges 0.07 Paper diapers -0.35
Women's nightwear 0.05 Gasoline -0.33
Salad 0.05 Food -0.33
Tea 0.04 Communication -0.32
Other refreshments(Cafe) 0.03 Eggs -0.30
Haircut charges 0.03 Oil, fats & seasonings -0.30
Contact lenses 0.02 Soybean products -0.26
the one-person household the two-or-more-persons household
According to Table 5, ‘Rents for dwelling & land’ and ‘Rents for
dwelling, issued houses’ have large coeffi cient in terms of the frequency of
items purchased by one-person households. This indicates the low rate of
house ownership among one-person households. ‘Coff ee & cocoa’, ‘Salad’
and ‘Beer’ have greater coeffi cients in the food category. It is nonessential
grocery items with high unit prices in Japan.
On the other hand, with respect to two-or-more-person households,
the items with large purchase price amount coeffi cients show similarly large
coeffi cients in the purchase frequency. The daily necessities and items relating
child care more aff ect the identifi cation.
These variables are only a part of 84 of the model with the purchase
price amount and 114 of the model with the frequency of purchased items.
It means that at least 84 items are required to obtain the above estimation
accuracy. Moreover, variables whose coeffi cients are estimated to be 0 by
Lasso regression are unstable. It is not appropriate just because these 84
variables will be collected. There is still a long way for practical use.
Romanian Statistical Review nr. 4 / 2019 33
The coeffi cients of the binomial model by one-person and two-or-more-person household with the frequency of purchased items as the
explanatory variableTable 5
corresponding item coefficient corresponding item coefficient
Rents for dwelling & land 0.23 Pocket money (Unexplained expenditure) -1.83
Coffee & cocoa 0.22 Meat -1.46
Hospital charges 0.21 Food -0.84
Cut flowers 0.19 Education -0.82
Rents for dwelling, issued houses 0.17 Paper diapers -0.60
Salad 0.12 Eggs -0.46
Beer 0.11 Fish & shellfish -0.27
"Onigiri" & others(rice ball) 0.09 Medical care -0.26
Obligation fees related to dwelling 0.07 Furniture & household utensils -0.23
Taxi fares 0.07 Private transportation -0.21
the one-person household the two-or-more-persons household
4.3 Eff ect of age
In Table 3 and 4, the items ‘Taxi fares’, ‘Cut fl owers’, and ‘Hospital
charges’ of one-person households tend to be consumed more by elderly
people. This refl ects the experiential tendency of the FIES.
In fact, the over-65 category accounts for a large percentage of the
age class among one-person households in the FIES (Statistics Bureau, 2005-
2015). On the two-or-more-person households, middle age householders have
a large percentage of the age class. Table 6 shows the age class of householders
in the FIES. The proportion of elderly householders becomes larger as time
goes on.
Householder distribution by age class in FIES (data from ‘e-Stat’
provided by Statistics Bureau (2005-2015))
Table 6
year under 35 35-59 60 or more under 35 35-59 60 or more
2005 26% 29% 45% 9% 50% 41%
2010 21% 28% 51% 7% 47% 45%
2015 18% 27% 55% 6% 42% 52%
one-person household two-or-more-person household
Romanian Statistical Review nr. 4 / 201934
One type of the big data that is planned to be provided by the
cooperate companies is the data of online personal fi nance software. There
is low utilization of online personal fi nance software among elderly people.
Therefore, there is particular need to adjust the age class in the case of matching
the FIES data to the big data.
From the above, it is possible to suggest that the age has the potential to
be as great as the household type in aff ecting specifi c purchased items. Finally,
we are going to describe below how the estimation accuracy of multinomial
models can be improved by the demographic items, which are infl uential for
the specifi c purchased items.
4.4 Improvement of prediction accuracy for the multinomial
model
The binomial model of one-person and four-or-more-person
household has acceptable accuracy, but the multinomial model does not. It
is diffi cult to estimate household size based on their purchased items since
the characteristics of one household must be analyzed as included in other
households. For example, as it stands, some of the items bought by one-person
households are also bought by two-or-more-person households. As a potential
solution to this problem, we propose using the demographic items that are
included in the big-data that have an antagonistic eff ect on the consumption
items. Namely, we attempt to improve a prediction accuracy by using not only
consumption items with a small degree of overlap among household sizes
but also other items that are infl uenced by the demographic items included in
the big-data, which are age and sex. Although age appeared in the previous
section to be one of the infl uential demographic items for specifi c purchased
items, in this section we discuss sex because it is distributed equally.
The equivalent for variable selection is to select particular purchased
items aff ected by demographic items. Therefore, the generalized linear
mixture model (glmm) was used to make the variable selection, with sex as the
response variable and the consumption items, which were loaded on a single-
person household in the multinomial model, as the explanatory variable. Here,
age is used as the random eff ect.
As a result of actually performing the variable selection with the
glmm using the R package ‘Ime4’, the items selecting by the binomial model
(one-person and two-or-more-person) with signifi cantly eff ecting by gender
were ‘Drinking’ and ‘Apples’. Moreover, the coeffi cients are antagonistic
by gender. If there are high purchase price amount of Drinking and Apples,
this indicates that there are multiple individuals who purchased antagonistic
products. These results may be useful if the probabilities of one-person and
Romanian Statistical Review nr. 4 / 2019 35
two-person households are similar in the multinomial logistic Lasso regression
model that simply estimates the number of people per household.
5. CONCLUSIONS
The purpose of this paper was to estimate the household size, then to
indicate the consumption items that represent the household characteristics, in
order to impute the missing information of the provided big-data and integrate
to the source of the CTI Micro.
We analyzed the FIES micro-data using logistic Lasso regression
analysis. The estimation conducted using the multinomial model, which
distinguishes between one-, two-, three-, and four-or-more-persons, does
not have good prediction accuracy. In contrast, the binomial model that
distinguishes one and multiple-persons does have good accuracy. According
to the coeffi cients of the binomial model, one-person households tend to
consume high-unit-price nonessential grocery items and services, while
four-or-more-person households tend to consume foods and daily necessities
corresponding to the diff erent stage of lives.
Though it was diffi cult to survey one-person household expenditures,
the result implies that it is possible to obtain the one-person household
consumption data in the big-data of loyalty programs and online personal
fi nance software. Moreover, the items such as the sex and age included in the
big-data with an antagonistic eff ect on the consumption items could improve
poor prediction accuracy in the multinomial model.
However, variable selection by lasso regression is unstable. We should
investigate the detailed relationship between the variables and prediction errors
for the improvement of the stability in future work. We are considering using
machine learning methods such as decision tree for interaction terms and a
stability of variables. We should also consider carefully the semi-continuous
data which is an explanatory variable of sparse estimation. Despite few studies
having treated semi-continuous data as explanatory variables, these studies
are important because the consumption items data is almost semi-continuous.
Offi cial statistics agencies in Japan have summarized and combined
offi cial survey data into economic indicators, but they have done little
analysis of the data for modeling. This study is rare among them because
the FIES, which is often used as descriptive statistics so far, is analyzed for a
mathematical model in anticipation of application to the big data. Thus, the
CTI project is also meaningful as an attempt to develop offi cial statistics in
Japan. Since the FIES has a huge volume of data, and concerns surveying and
summarizing as its fi rst priority, it is diffi cult to identify consistent eff ects of
Romanian Statistical Review nr. 4 / 201936
that data. However, the above analysis suggests the possibility of identifying
characteristics that are important to merge the big data and the FIES.
Acknowledgements:
We are grateful to members of the CTI project and our research
department for helpful discussions and thoughtful comments. The authors
wish to thank for editors and referees for their fruitful suggestions. The views
expressed here are those of the authors and not necessarily those of other
members of the institute.
References:
1. Blaudow, C., Burg, F., 2018, “Dynamic Pricing as a Challenge for Consumer
Price Statistics”, EUROSTAT REVIEW ON NATIONAL ACCOUNTS AND
MACROECONOMIC no. 1, 79-93.
2. Breton, R., Flower, T., Mayhew, M., Metcalfe, E., Milliken, N., Payne, C.,
... & Woods, A., 2016, “Research indices using web scraped data: May 2016
update”, Newport: Offi ce for National Statistics.
3. Friedman, J., Hastie, T., Tibshirani, R., 2010, “Regularization paths for generalized
linear models via coordinate descent”, Journal of statistical software, 33(1), 1.
4. Hamasuna, K., 1980, “Current Status of Statistical Survey”, Hosei university Japan
statistics research institute report, 05, 18-53. (Japanese only)
5. Japec, L., Kreuter, F., Berg, M., Biemer, P., Decker, P., Lampe, C., ... & Usher,
A. 2015, “Big data in survey research: AAPOR task force report”, Public Opinion
Quarterly, 79(4), 839-880.
6. [electronic sources] Statistics Bureau, 2005-2015, “Family Income and Expenditure
Survey”, one-person household annual data available from: https://www.e-stat.go.jp/
stat-search/fi les?page=1&layout=datalist&toukei=00200561&tstat=000000330001&
cycle=7&tclass1=000000330001&tclass2=000000330022&tclass3=000000330023
(Accessed 10.06.2019), two-or-more-person household annual data available from:
https://www.e-stat.go.jp/stat-search/fi les?page=1&layout=datalist&toukei=002005
61&tstat=000000330001&cycle=7&tclass1=000000330001&tclass2=0000003300
04&tclass3=000000330005&result_back=1, e-Stat. Statistics Bureau, Ministry of
internal aff airs and communications.
7. Statistics Bureau, Ministry of internal aff airs and communications, 2017, “Estab-
lishment of Consumption Trend Index Research Council”, available from: https://
www.stat.go.jp/data/cti/pdf/ho20170728.pdf (Accessed 10.06.2019), Statistics
Bureau, Ministry of internal aff airs and communications. (Japanese only)
8. [web page] Statistics Bureau, Ministry of internal aff airs and communications,
2018a, “Statistics for Japan’s Future’- A Quick Reference.”, available from: https://
www.stat.go.jp/english/info/guide/2018guide.html#p0201 (Accessed 10.06.2019),
Statistics Bureau, Ministry of internal aff airs and communications.
9. [web page] Statistics Bureau, Ministry of internal aff airs and communications,
2018b, “Telecommunications Annual Report 2018 – Part 1: Sustainable growth
by ICT in the era of population decline”, available from: http://www.soumu.go.jp/
johotsusintokei/whitepaper/ja/h30/html/nd141110.html (Accessed 10.08.2019),
Statistics Bureau, Ministry of internal aff airs and communications. (Japanese only)
10. Struijs, P., Braaksma, B., & Daas, P. J., 2014, “Offi cial statistics and big data”, Big
Data & Society, 1(1), 2053951714538417.
Romanian Statistical Review nr. 4 / 2019 37
11. Takahashi, M., Ito, T., 2013, “Multiple imputation of missing values in economic
surveys: Comparison of competing algorithms”, In Proceedings of The 59th World
Statistics Congress of the International Statistical Institute (ISI). Hong Kong, China,
3240-3245.
12. The Consumer Statistics Division of Statistics Bureau, 2018, “The orientation
of developing Consumption trend index(CTI)”, the 8th consumption Research
Council, document No.2, available from: https://www.stat.go.jp/info/kenkyu/
skenkyu/pdf/20190122_02.pdf (Accessed 10.06.2019), The Consumer Statistics
Division of Statistics Bureau. (only Japanese)
13. Tibshirani, R., 1996, “Regression shrinkage and selection via the lasso”, Journal
of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Romanian Statistical Review nr. 4 / 201938
Romanian Statistical Review nr. 4 / 2019 39
R tools for ILOSTAT: Rilostat and SMARTM. Villarreal-Fuentesa ([email protected])
Department of Statistics, International Labour Organization (ILO)
S. Dingb ([email protected])
Department of Statistics, International Labour Organization (ILO)
ABSTRACT
This article presents Rilostat and SMART, two statistical tools developed by
the Department of Statistics of the International Labour Organization (ILO) to facilitate
the user interaction with ILOSTAT, the largest repository of labour-related indica-tors.
The package Rilostat allows data users around the world to access, extract and ma-
nipulate information from ILOSTAT. This document presents a description of the pack-
age, including detailed explanations of all its functionalities, examples of reproducible
data visualization and a Principal Component Analysis application car-ried out using
information extracted with Rilostat from the Sustainable Development Goals (SDGs)
collection available in the database. The Statistics Metadata-driven Analysis and Re-
porting Tool (SMART) allows National Statistical Offices world-wide to easily generate
and automate the production of analytical reports (such as national SDG reporting)
defined by means of an SDMX Data Structure Definition (DSD), either from process-
ing micro-level data or from aggregated data by means of transcoding. It is a hybrid
application that employs the .NET framework to build the user interface and R as
the computational and reporting engine. These two R-based tools for ILOSTAT take
advantage of all the benefits of the R software to give ILOSTAT data users simplified
access to what they need.
Keywords: Official Statistics, data dissemination, data visualization, analyti-
cal reporting automation, GUI programming.
JEL Classification: C81, C88
1 Introduction
The Department of Statistics of the International Labour Organization (ILO) is the focalpoint for labour statistics within the United Nations System and the primary reference forall statistics-related issues within the ILO. As such, it has three fundamental mandates: (i)providing relevant, timely and comparable statistics on as many labour market topics aspossible; (ii) developing international standards with a view to improving the measurementof labour issues and enhancing international comparability; (iii) supporting member States
1. The responsibility for opinions expressed in this article rests solely with its authors, and pub-
lication does not constitute an endorsement by the International Labour Office of the opinions
expressed in it.
Romanian Statistical Review nr. 4 / 201940
possible; (ii) developing international standards with a view to improving the measurementof labour issues and enhancing international comparability; (iii) supporting member Statesin developing and improving their labour statistics via training programs, capacity buildingand technical assistance.
In order to achieve its goals, the ILO Department of Statistics produces a wide range ofindicators that are related to the world of work, and then disseminates them through ILO-STAT1, the largest and most comprehensive international repository of labour statisticsin the world. The ILOSTAT database provides a large set of country-specific indicators,covering numerous labour-related topics. It assembles in one place national figures of themain labour market topics such as employment, unemployment, working time and earn-ings, but also of additional labour-related subjects such as social protection and industrialrelations, proving to be instrumental to create a broader and more detailed picture of thelabour market situation.
ILOSTAT provides the public with annual, quarterly and monthly time series data, some ofwhich cover periods of over half a century2. It includes country-level, regional and globalestimates and projections of the main labour market indicators3 as well as ad-hoc datacollections of specific topics (e.g. international labour migration).
Occasional users or basic users looking for a specific piece of information can get imme-diate access to the data and related metadata via the table viewer, or downloading anExcel summary table. Indicators are presented in the home page of ILOSTAT groupedby “subjects”4. The table viewer shows data for the chosen indicator in a customizabletable where users can select what to display, in terms of reference areas (countries, regions,etc.), time periods, sex, classification categories (such as age bands for age disaggregation,economic sectors for disaggregation by economic activity, etc.) and sources. Tables canbe downloaded in different formats (Excel, CSV or SDMX). For regular users or moreadvanced users, especially those who wish to consult broader information (covering severalindicators, areas, etc.), ILOSTAT provides two tools which enable data extraction, han-dling and analysis: the ILOSTAT SDMX web service5, and the ILOSTAT bulk downloadfacility. Both retrieve the information in machine-readable format files that can then be
imported into the user’s preferred tool6.
Maintenance of ILOSTAT involves multiple stages, from data collection and productionthat populate the database, to dissemination and accessibility, each of which uses varioussoftware for data handling and analysis. R (R Core Team, 2019) plays a major role inevery part of this process. This paper aims at presenting two R tools developed by theILO Department of Statistics, and used to help improving users interactions with thedatabase: Rilostat, the ILOSTAT R package and SMART, the Statistics Metadata-drivenAnalysis and Reporting Tool.
Rilostat takes advantage of the bulk download facility to allow users to search, rearrange,analyse, visualize and download labour market data disseminated on ILOSTAT, benefitingalso from all the potential that the R software offers to the community. SMART is a
1Available at https://ilostat.ilo.org/.2For instance, ILOSTAT includes data from the Current Employment Statistics Survey (an Establish-
ment survey) of the United States from 1938 and from the USA Current Population Survey as from thefirst quarter of 1948.
3More information can be found on https://www.ilo.org/ilostat-files/Documents/LFEP.pdf.4For instance, the indicator on employment by sex and age is found under the subject employment.5More information is available at https://www.ilo.org/ilostat-files/Documents/SDMX_User_
Guide.pdf.6More information available at https://www.ilo.org/ilostat-files/Documents/ILOSTAT_
BulkDownload_Guidelines.pdf.
Romanian Statistical Review nr. 4 / 2019 41
statistical processor and transcoding tool able to produce datasets by processing microdataor aggregate data in several formats, based on the structural metadata read from a SDMXDataflow or Data Structure Definitions (DSD), with the purpose of reporting or exchangingdata in SDMX.
This document is structured as follows: section 2 explains the main features of the Rilostatpackage in detail, including all its functionalities, and presents three examples of datavisualization. Section 3 uses data retrieved from the ILOSTAT database using Rilostat toperform a principal component analysis of the SDG labour market indicators collection.Section 4 provides an overview of SMART and demonstrates its main functionalities withtwo use cases. Section 5 concludes.
2 Rilostat - ILOSTAT’s R package
During the past few years, the statistical information collected by the ILO Department ofStatistics and disseminated through ILOSTAT has grown exponentially. The significantincrease in data available was mainly due to improvements to the data compilation, dataproduction and data dissemination processes. Efforts made by national statistical systemsto report data to the ILO in a timely and regular manner have gone hand-in-hand withthe ILO’s household survey microdata processing, which derives comparable indicatorsfollowing international standards and definitions to the extent possible7.
Casual users can access the required statistical information directly by identifying andselecting the corresponding tables from the ILOSTAT website. However, more frequentor advanced data users may wish to avoid having to select on-screen tables, especially ifthey are involved in research projects with a strong computational component, and wouldlike to easily replicate their actions8(Gandrud, 2013). ILOSTAT provides users with twoservices that allow the programmatic extraction of large sets of information: the SDMXweb service, and the bulk download facility.
In September 2017, ILOSTAT released the first version of Rilostat9, the package for Rwhich provides a new way of accessing the ILOSTAT database. Its source code is largelybased on the algorithm and documentation developed for accessing the Eurostat opendatabase, the eurostat R package10 (Lahti et al., 2017); it uses the existing architectureof the ILOSTAT bulk download facility and the related file structure to fetch individualdatasets or the complete ILOSTAT database.
The package is maintained by the ILO’s Department of Statistics11, and gives data userswith the knowledge of R the ability to access the ILOSTAT database, along with all itsbuilt-in functions to search for data, rearrange it and download it in the desired format,while benefiting from the vast amount of functionalities already available in R for dataformatting, visualization, analysis and results reporting.BulkDownload_Guidelines.pdf.
7In 2016, the ILO Department of Statistics started to systematically process labour-related householdsurveys (HS), mainly labour force surveys (LFS) micro datasets, in order to improve the quality andcoverage of data published on ILOSTAT. More information on this can be found on the “ILOSTATMicrodata Processing Quick Guide: Principles and methods underlying the ILO’s processing of anonymizedhousehold survey microdata”.
8Moreover, as pointed out by Gandrud (2013) and later by Lahti et al. (2017): “Availability of algo-rithmic tools [. . . ] can greatly benefit reproducible research, as complete analytical workflows spanningfrom raw data to final publications can be made fully replicable and transparent”.
9Available on CRAN (https://cran.r-project.org/web/packages/Rilostat/index.html).10http://ropengov.github.io/eurostat/.11Issues can be reported through: https://github.com/ilostat/Rilostat.
Romanian Statistical Review nr. 4 / 201942
2.1 Main uses of Rilostat
The Rilostat package has numerous uses, namely:
• Providing access to ILOSTAT annual, quarterly and monthly time series via theILOSTAT bulk download facility ;
• Allowing to search and download ILOSTAT data and related metadata in the threeILO official languages: English, French and Spanish;
• Giving the ability to return POSIXct dates for an easier integration into plotting andtime series analysis packages available for R;
• Returning data in long format for better interaction with widely used packages asggplot2 and dplyr ;
• Providing access to the most recent updates of the ILOSTAT database;
• Allowing for the grep-style search of data descriptions and names;
• Providing access to the ILOSTAT catalogue and related descriptive metadata.
2.2 Getting started with Rilostat
2.2.1 Installation
The installation of the CRAN release version of Rilostat is done by executing the commonlines used for it12:
install.packages("Rilostat")
library(Rilostat)
The package works with an “imports” directive that loads its necessary packages13. Asstated in Rilostat’s reference manual, there are several other packages which could beuseful to have installed in order to handle, visualize and analyse data14. All the functionsthat are part of the package are listed as a data frame after running the following command:
as.data.frame(ls("package:Rilostat"))
12Similarly, the user can install the development version via Github: https://github.com/ilostat/
Rilostat.Rilostat.
13plyr, dplyr, stringr, readr, tibble, haven, xml, data.table, RCurl, DT.14A non-exhaustive list of suggested packages is: shiny, ploty, ggplot2, knitr, rmarkdown, roxygen2,
rsdmx, plotrix, Cairo, testthat, tidyr, devtools and covr.
Romanian Statistical Review nr. 4 / 2019 43
2.2.2 Searching for data
Just like the bulk download facility it is built on, Rilostat gives access to ILOSTAT datasetsthrough two different directories, based on two different ways of presenting the information:organizing them by ’indicator’ (and frequency) or by ’ref area’ (and frequency). The ’in-dicator’ refers to the title of each specific table, including the represented variable and theeventual disaggregations used for it (for instance, ’labour force by sex and age’, ’employ-ment by sex and economic activity’ and ’unemployment rate by sex, age and rural/urbanareas’ are ILOSTAT indicators). The ’ref area’ (i.e. reference area) refers to the geographicareas for which data are available. Since ILOSTAT includes both country-level data andregional and global estimates, the reference area can either refer to countries, to regions(geographic regions such as Africa, Americas or Arab States, income groups such as low -income countries, or other groups such as the BRICS or the G20) or the world as a whole15.The frequency refers to whether the various data points are annual, quarterly or monthly.
Taking this into account, a first step to search for data is to get the code of the ’indicator’or ’ref area’ the user is looking for. The function get ilostat toc( ) provides grep stylesearching that returns all the data files available for consultation in the correspondingdirectory, and provides summary information on each data file matching the query. The
following line gives access to the table of contents of all available indicators in ILOSTATby indicator (default):
toc ind <- get ilostat toc()
The arguments available for this function allow the user to set the segment required (’indi-cator’ (default) or ’ref area’), preferred language among the three ILO’s official languages:English (’en’; default), French (’fr’) and Spanish (’es’), the pattern within the descriptionto be searched (’none’ by default) and the filters to the variables (’none’ by default) inorder to get parts of the table. For instance, a narrower search would be to look for (1)all available indicators containing the word ’unemployment’, or (2) to get the label of thereference area by frequency for all available datasets in two countries:
(1) toc une <- get ilostat toc(search = ’Unemployment’)
(2) toc cou <- get ilostat toc(segment = ’ref area’, search =
c(’Philippines|Thailand’), fixed = FALSE)
The codes or identifiers used in the table of contents for the indicators and reference areasin the first column (’id’) are unique and allow for the unequivocal identification of thecorresponding item to be consulted. For reference, note that code names all follow thesame structure. The indicator code names include, in this order, the code of the topic, therepresented variable, the disaggregations included (’NOC’ for ’no classification’ if there isno disaggregation), the unit (’NB’ for absolute values or numbers and ’RT’ for percentagesor rates) and the frequency (’A’ for annual data, ’Q’ for quarterly data and ’M’ for monthlydata). Similarly, the code names of the files by reference area refer to the country (ISOAlpha-3 country code) or the region (codes starting with X) and the frequency (’A’, ’Q’and ’M’).
15It is important to note that global and regional estimates are only available for some indicators, andthus most datasets would only include country-level data.
Romanian Statistical Review nr. 4 / 201944
2.2.3 Downloading data
The function get ilostat( ) explores ILOSTAT and returns single or multiple datasetsby indicator (default, segment = ’indicator’) or by reference area (segment = ’ref area’),using the code obtained at the identification step. The following code lines return: (1)unemployment rate by sex and age (%), annual; and (2) all available annual data pointsfor Afghanistan and Trinidad and Tobago:
(1) dat une <- get ilostat(id = ’UNE DEAP SEX AGE RT A’, segment = ’indicator’)
(2) dat att <- get ilostat(id = c(’AFG A’, ’TTO A’), segment = ’ref area’)
In addition to the arguments available for the function to search for data, the user can alsofind within this function options to set the type of the variables (’code’ (default), ’label’or ’both’) that allows for getting codes and/or human-readable labels, the format in which
the time column is to be returned (’raw’ (default), ’date’, ’date last’ and ’num’), filtersthat can be applied to the dataset (explained more in detailed in the following section),and the option to do caching (TRUE by default), whether the cache generated is to beupdated (FALSE by default) and the desired format of the file to be stored as cache (’rds’(default), ’csv’, ’dta’, ’sav’, ’sas7ndat’)(see Section 7 for more information).
Since datasets are downloaded using the ILOSTAT bulk download facility, the structure ofthe tibble obtained with this function mirrors the structure of the CSV file which wouldhave been extracted using the bulk download. That is, the subsequent rows after the headernames, present the data records, consisting of the key of the record (the ’names’ of thedimensions used to identify each record, including the data collection, the reference area,the source of the data, the classifications used, etc., referring to all fields from ’collection’to ’time’), the observation value (’obs value’) and any other metadata available (such asthe geographical coverage of the source or the specific definitions used for some concepts,referring to all fields from ’obs status’ to ’note source’).
2.2.4 Filtering data
The option ’filters’ is available as an argument within the function get ilostat( ). Itoffers the possibility of retrieving a subset of the dataset called, by using a list of its objects(’none’ by default). The names of this list are the variables codes (code names used asheaders of the dataset retrieved), and the values are vectors of predefined disaggregations.The user can access an extensive list of these disaggregations, known as dictionary files,through the function get ilostat dic( ). For instance, it is possible to obtain the annualunemployment rate (%) for women in Colombia, by executing the following code:
dat une col <- get ilostat(id = ’UNE DEAP SEX AGE RT A’, segment = ’indicator’,
filters = list(ref area = ’COL’, sex = ’SEX F’))
Romanian Statistical Review nr. 4 / 2019 45
2.3 Data Visualization with data extracted using Rilostat
Taking advantage of all the potential that R offers to the community, the user can handlethe information extracted directly in R and use the available functions and packages fordata visualization. Some of the most widely used packages for data handling and visualiza-tion are already loaded as an “import” directive when installing Rilostat. Other packagescan be installed for more advanced plotting manipulation.
Figures 1, 2 and 3 show three different visualization examples using information fetchedfrom the ILOSTAT database and packages viridis (Garnier et al., 2018), scales (Wickham,2018), plotly (Sievert et al., 2019) and ggridges (Wilke, 2018) among others. For theseexamples, the data used is taken from the compilation “ILO modelled estimates, Novem-ber 2018”, which is methodologically robust and consistent across countries and thereforeensures international comparability. It also includes regional and global aggregates. TheR code to produce them can be found in section 7.
Figure 1: Evolution of the global employment distribution by occupation, 1991-2023 (ILOmodelled estimates, November 2018)
Figure 2: Share of youth not in employment, education or training (NEET), 2017 (ILOmodelled estimates, November 2018)
Romanian Statistical Review nr. 4 / 201946
Figure 3: Distribution of the labour force participation rate by sex, 1990, 2000, 2010, 2018and 2030 (ILO modelled estimates, July 2018)
3 An application: A principal component analysis of
the SDG indicators
3.1 The Sustainable Development Goals
In January 2016, the international community adopted a set of 17 Sustainable Develop-ment Goals and 169 targets meant to take on the unfinished aspects of the MillenniumDevelopment Goals (MDGs) agenda and the new global challenges. They cover three keyelements: economic growth, social inclusion and environmental protection.16
Multiple institutions at the national and international level monitor the development ofthe SDGs across all regions of the world and over multiple follow-up stages. Moreover, theachievement of all goals, set to be accomplished by 2030, is meant to ensure at the sametime their sustainability in the long run. As stated in ILO (2018), the global goals “promoteprosperity while protecting the planet, putting forward the idea that ending poverty mustbe aligned with strategies for economic growth and addressing at the same time social needsand environmental concerns”. Thus, a continuous analysis of each indicator individuallyas well as of the set of indicators and their interactions constitutes an essential part of thereviewing process.
16More information can be found at https://www.ilo.org/wcmsp5/groups/public/---dgreports/
---stat/documents/publication/wcms_647109.pdf.
Romanian Statistical Review nr. 4 / 2019 47
3.2 Principal component analysis (PCA)
Factorial analysis, or multivariate data analysis, are descriptive and exploratory statisticalmethods commonly used to summarize large sets of data and to produce a simpler pictureof their structure. Its main objective is to describe the relationships between variables(dimensions), in terms of a potentially lower number of unobserved variables (factors).
Principal Component Analysis examines the linear relationship between quantitative vari-ables that are correlated by creating uncorrelated synthetic factors (i.e. principal compo-nents) that belong to a lower dimensional space, and therefore allows for a more directinterpretation. These components (or factors) are linear combinations of initial variablesthat retain most of their variance, guaranteeing a proper representation of the individuals’interactions and the existent heterogeneity between them (Lebart et al., 1995; Escofier andPages, 2008).
The interpretation of the PCA results is mainly based on two measures: 1) the quality ofthe representation that can be achieved when reducing dimensions, and 2) the distance, interms of the created synthetic factors, of a pair of individuals.
3.3 Data
The SDG data collection available on ILOSTAT contains a set of SDG labour marketindicators for which the ILO is either the custodian agency or one of the partner agen-cies responsible for reporting at the global level. The information retrieved using Rilostatconsists of 11 quantitative indicators17: (SDG 0111) working poverty rate; (SDG 0131)proportion of population covered by social protection floors/systems; (SDG 0552) femaleshare of employment in managerial positions; (SDG 0821) annual growth rate of outputper worker; (SDG 0831) proportion of informal employment in non-agricultural employ-ment; (SDG 0852) unemployment rate; (SDG 0861) proportion of youth (aged 15-24 years)not in education, employment or training; (SDG N881) non-fatal occupational injuries per100’000 workers; (SDG F881) fatal occupational injuries per 100’000 workers; (SDG 0922)manufacturing employment as a proportion of total employment; (SDG 1041) labour in-come share as a percent of GDP. A detailed description of the dataset set can be found inthe appendix (7).
The availability of information of each of the SDG indicators can vary from one referencearea to another because of multiple reasons18. Given that the PCA needs a completeinput dataset, in what follows we will (1) keep only reference areas with 70% or more ofthe indicators for the analysis; that is, 54 of the 183 reference areas. And (2) treat theremaining missing values following the method proposed in Josse and Husson (2012) andavailable in the package Husson and Josse (2019).
3.4 PCA Results
As previously mentioned, the PCA gives us useful information on the differences in the un-derlying structure of a dataset. However, it must be emphasized that this type of analysisdoes not aim at making any statistical inference, but rather at carrying out a multidimen-sional exploratory analysis without distributional assumptions upon the variables underanalysis.
17Due to the amount of missing values in indicator 8.5.1 (Average hourly earnings of female and maleemployees) and indicator 8.7.1 (Proportion of children engaged in economic activity and household chores(%)), these indicators are not part of the analysis.
18For instance, the lack of sources of information to collect, process and estimate the indicators.
Romanian Statistical Review nr. 4 / 201948
Nine out of the 11 indicators enter the analysis as active variables, whereas SDG 1.1.1-working poverty rate- and 8.3.1 -proportion of informal employment in non-agriculturalemployment-, are set as supplementary variables, i.e. they are not part of the syntheticfactors building process19. The analysis uses the packages FactoMineR (Husson et al.,2018) and factoextra (Kassambara and Mundt, 2017) for the extraction of the resultsand visualization respectively. Three factors are kept for analysis (those whose relatedeigenvalue is greater that the unity20, summarizing 66.1% of the total heterogeneity.
Figure 4 presents the first factor map (dimensions one and two) with the projections ofthe SDG indicators and the reference areas simultaneously. The indicators associated toeach factor are those that contributed the most to their construction. The quality of
the representation of reference areas onto the map is established by their cosine squarerepresented by the colour of their label. Thus, red, pink and purple reference areas areconsidered for interpretation.
The first principal axis explains 35.7% of the total variance and is associated with the pro-portion of youth (aged 15-24 years) not in education, employment or training (SDG 0861),the number of fatal occupational injuries (SDG F881), the proportion of population cov-ered by SP floors/systems (SDG 0131) and labour income share as a percent of GDP(SDG 1041). The second factor is characterized by the annual growth rate of output perworker (SDG 0821) and unemployment rate (SDG 0852), and explains 17.7% of the totalvariance. Finally, the third dimension accounts for 12.7% and has high contributions ofthe youth in NEET (SDG 0861) and non-fatal occupational injuries (SDG N881).
Figure 4: Factor map with variables and individuals projections.
For instance, the lack of sources of information to collect, process and estimate the indicators.19These variables are left as supplementary given that most of the European countries do not report
information on them.20The total variance (inertia) is given by the sum of all eigenvalues related to the covariance matrix
(Lebart et al. (1995)).
Romanian Statistical Review nr. 4 / 2019 49
Projections of reference areas onto the first factor map help understanding the relationshipbetween a pair of points (reference areas) and between each point and the components built.Reference areas projected along the first dimension on the positive side, e.g. Armenia,Bolivia, Colombia and South Africa, are identified with a high proportion of youth notin education, employment or training and a high level of occupational injuries. On thecontrary, reference areas projected onto the negative side, e.g. Austria, Belgium, Estonia,
Germany, Finland, Netherlands, Norway, Sweden and Switzerland, are related with a highproportion of the population covered by social protection floors/systems and high levels oflabour income share as a percent of their GDP. Greece and Spain are projected onto thepositive side of the second axis associated to high unemployment levels, which opposes thePhilippines, projected with respect to positive annual growth rate of output per worker.
These results show evidence of the variability between the set of SDG labour indicators inmultiple reference areas. A broader multivariate description can be achieved by includinginformation on more reference areas and more SDG indicators as data become available.Some of the results, e.g. the new coordinates of the variables and observations obtainedafter reducing dimensions, could also be used as the initial step in a further statisticalanalysis of labour market indicators at the global level.
4 SMART: Statistics Metadata-driven Analysis & Re-
porting Tool
In order to strengthen the capacity of countries to report labour statistics to ILOSTAT, theILO Department of Statistics is developing a toolkit that facilitates the table producingprocedures. SMART21 receives as input a micro dataset from a LFS (or an aggregateddataset) and the specification of the tables to be produced by means of DSD. The outputfiles can be generated in diverse formats and used for analysis, data reporting or to feeda dissemination platform. In particular, it is a useful tool to produce SDMX datasets forreporting SDG (or any other) data, in the absence of a proper reporting platform able toproduce SDMX datasets.
As shown in the SMART Concept Map (Figure 5), there are three major modules in orderto perform an analysis:
• Data and DSD inputs
SMART has two main inputs: a dataset with the source information (in Stata, SPSS,CSV or SDMX-ML format) and an XML file/message containing SDMX-ML DSDwith one or multiple data structures to define the cross tabulations to be generated.This DSD can be a local file or a message queried from an SDMX registry online.
In processing the input data, SMART can count cases, summarize, compute meansand filter records based on complex conditions. However, it is not advisable toattempt to follow complex questionnaire sequences in the calculation of the indicators,but rather pre-process the micro data to compute and add derived variables usingmore powerful statistical packages such as R, Stata, SPSS or SAS. These variablescan then be used in the production of the output cross tabulations using SMART.
21Available at https://ilostat.github.io/smart/.
Romanian Statistical Review nr. 4 / 201950
Figure 5: SMART Concept Map
• Mapping
The process mapping links the concepts in the DSD with the variable from theinput data. Usually a DSD defines three major roles for concepts, namely Dimension,Primary Measure and Attribute. The Primary Measure and all the Dimensions mustbe mapped with the input variables or assigned a constant value, while the mappingfor Attribute is most of the times optional as it refers to the descriptive metadata(notes). However, attributes defined as mandatory must be mapped.
For the categorical variables in the input data whose codes differ from the classifi-cation items in the DSDs, a mapping for each category must be created. This willallow, for example, to process a dataset which codes the variable gender as Male=1and Female=2. Suppose in the DSD this variable is named SEX and uses labels”M” and ”F”, it is necessary to generate a mapping that assigns variable genderto SEX and items 1 and 2 to M and F, respectively. Some classification items fromthe DSD can be left un-mapped, in which case they will not be included in the tab-ulation. Similarly, some categories in the dataset can be left unmapped and theserecords won’t be counted in the tables that include such a classification.
Romanian Statistical Review nr. 4 / 2019 51
The attributes in the DSD will be presented to the user with a list of valid optionsto select. If the DSD includes one or more free text attributes for open text notes,they can be added at this stage.
• Generate
When all the data has been entered and the variables in the dataset are properlymapped to the concepts in the DSD (Dimensions, Primary Measure and Attributesif any), the user is requested to select the format(s) for the output report. The avail-able options are: .csv for ILOSTAT, .Stat v7 data and dimension ”pipe-separated”files, SDMX data messages (in SDMX-ML, SDMX-JSON or SDMX-CSV) for SDGreporting and Excel.
4.1 Showcases
4.1.1 SDG Reporting
To track the progress made on all SDGs, the UN has urged all national governmentsto report annually on SDG indicators. Furthermore, the reporting platforms developed atthe national level should support international standards and common formats to facilitatedata exchange both within and between countries. This includes using SDMX, a globaldata exchange initiative to process and report on the SDGs.
Compiling and reporting SDG data in SDMXmessages (XML or Json) requires the nationaldissemination platform to be able to handle SDMX artefacts. If such a reporting platformis not yet in place, producing SDG data in SDMX becomes rather challenging. SMART ismeant to facilitate this task. As an example, we demonstrate that SDG indicators preparedin Excel can be easily converted to SDMX by using the SMART transcoding feature. Thisexample is embedded in SMART and can be loaded via Project → Open ExampleProject.
The project folder contains the input data file 1.3.1 NINE INDICATOR.csv122 (aswell as its original Excel 1.3.1 NINE INDICATORS.xlsx), the desired output tablestructures in DSD SDG DSD(0.3).xml, the output.zip and the project file Exam-ple SDG Reporting.smart.
Using this example, suppose the reporting agency has properly aggregated from its micro-data source the data contained in 1.3.1 NINE INDICATOR.csv and needs to convertthe Excel file into SDMX in order to report it. To do so, the agency needs to undertakethe following tasks:
• Need a DSD or dataflow for SDGs. Since 2016, the Inter-agency and Expert Groupon SDGs (IAEG-SDGs) has started to develop the SDMX solution for SDG Indi-cator data and metadata exchange and dissemination. The pilot SDG DataflowDF UNDATA SDG PILOT developed by IAEG-SDGs is available to use.
• Allocate variables in the input data into Dimensions, Attributes and Primary Measurein DSD. This assignment task can be handled inside SMART Mapping module.
• Recode the mismatched items between the input CSV and the DSD definition. Thiscan be handled in the Mapping module as well.
• Write in SDMX-ML or SDMX-JSON. Various SDMX outputs can be generated inthe Generate module.
22Proportion of population covered by social protection floors/systems, by sex, distinguishing children,unemployed persons, older persons, persons with disabilities, pregnant women, newborns, work-injuryvictims and the poor and the vulnerable.
Romanian Statistical Review nr. 4 / 201952
These tasks translate into the following steps23:
1. Upload the input data 1.3.1 NINE INDICATOR.csv by clicking the buttonAdd. . . or just drag-and-drop in the Datasets area.
2. In Data Structures, click Online Query and select SDMX UNSD in the dropdownmenu of SDMX Registry. Choose the dataflow DF UNDATA SDG PILOT andthen click the Load button.
3. Once the both input data and the output data structures have been added, click thenext button to advance, which brings you to the Mapping module.
4. Map the input variables to different SDMX concepts, i.e., Dimensions, Primary Mea-sure and Attributes. The detailed mapping guidelines can be found on the SMARTReference page. Generally in this exercise, the users are required to have a goodknowledge about the input data. For example, the variable ”Observation.Value”should be recognized as the primary measure therefore it is assigned to Measurespanel. And in the Dimensions panel, if any dimension could not be found in theinput data, i.e., FREQ. Based on the characteristic of the data, the users understandthat this is an annual aggregated data and thus it needs to be set with a constantvalue ”Annual (A)”.
5. Go to the Generate module and click Process. The results are presented in theTable Viewer according to the DSD definition. To export them in SDMX, the usercan specify the output directory and the desired SDMX formats from the optionsand then click Export button. With a few seconds, the user can see the SDMXoutputs in the Export Folder (or click Open Export Folder).
4.1.2 Microdata Processing
Besides converting the aggregated data, SMART is also able to handle microdata24 di-rectly. In particular, it can count cases, summarize, compute means and filter recordsbased on complex conditions. For the ease of reference, the embedded project “ProcessMicrodata: Unemployment” (loaded via Project → Open Example Project) is usedhere for demonstration.
The project folder contains the input microdata file Miranda Eng.sav (in SPSS format,a derived dataset from a household survey), two input DSDs YI X01 UNE TUNESEX AGE NB.xml and YI X01 UNE TUNE SEX AGE NB.xml (can be down-loaded from the ILOSTAT SDMX web portal), the output.zip and the project file Exam-ple UNE.smart. The objective of this project is to calculate and report the unemploy-ment level by two types of breakdowns: by sex and age, and by sex, age and rural/urbanareas. The following steps explain how to use this microdata to work with the SMART.
1. Add the data and the two DSDs in the “Datasets and table structures” module.Notice that these two tables (DSDs) are reported jointly because they share somecommon concepts in dimensions and attributes, such as CLASSIF SEX and CLAS-SIF AGE. The mapping of these common concepts is only required once.
2. Press the Next button to advance to the Mapping module.
23Notice that the Open Example Project automatically prepares the relevant inputs in Data and
DSD inputs and Mapping, so that you may directly advance to the Generate module to process.
24SMART doesn’t provide any data processes for cleaning, validation and editing, the microdata ithandles should be ready for the aggregation analysis.
Romanian Statistical Review nr. 4 / 2019 53
3. Map the input data variables and items to the SDMX output concepts, namelyPrimary Measures, Dimensions and Attributes.
4. Press the Next button to the Generate module and press the Process button togenerate the tables.
A few remarks on the mapping procedure:
• The data in this exercise doesn’t have any real-world interpretation and it only servesthe purpose of demonstration.
• Primary Measure isn’t mapped to any variable but rather based on counting cases(tallying) under a filtering condition. That is, the individuals whose main activityduring the last year (MAJACTYR) is either “Looked for work” (3) or “Wanted workand available” (4) are considered as unemployed thus are counted as the measure ofunemployment.
• The example data doesn’t contain sample weights, but in practice it is mandatory toinclude to allow data aggregation from the micro level. To map the sample weightsin SMART, go to the Others tab and assign the corresponding input variable. Theselection for sample weights is only enabled when the data aggregation is needed.
• Quality measures based on the number of observations are also considered in SMART.By default, if the number of observations is less than 5, the calculated value is markedas “Not Available”, furthermore if the number of observations is less between 5 and15 the value is marked as “Unreliable”. The default criteria can be altered in theOthers tab.
• The observation status criteria and the repetitive runs feature allow the users toperform multiple procedures to determine the best level of breakdown. For example,the mapping of CLASSIF AGE to 10YRBANDS (ten-year age bands) would result inmany domains with observations “Not Available” or “Unreliable”. To improve thiswith enough observations possibly allocated in many age groups, we could report thestatistics based on a wider age breakdown, such as YTHADULT (youth and adultgroup) or even just suppress it by a total. The decision of the breakdown level canbe assessed using these observation criteria.
10YRBANDS YTHADULT No Breakdown
15-24 15-24
Total
25-34
25+35-4445-5455-6465+
Table 1: Breakdown assessment on CLASSIF AGE
• Rate/ratio calculation is also supported in SMART. In the mapping of Primary Mea-sure, press the button Specify Denominator (+/-) to specify the quotients be-tween variables. An example project “Process Microdata: Labor Force ParticipationRate” can be found via Project → Open Example Project.
Romanian Statistical Review nr. 4 / 201954
• Once the mapping is set, the users can preview the output layout in the TableStructure Preview (Press Table Viewer button in the section tool bar). As anexample, a screenshot of the structure preview for table YI X01 UNE TUNE SEXAGE GEO NB is captured in Figure 6.
Figure 6: Screenshot of the structure preview
4.2 Highlight features
Beyond the key functionalities that SMART provides, some other features and tools arealso worth highlighting as they make it more user friendly and sustainable:
• Online data and metadata Query: The input data and DSD (or dataflow) canbe pulled directly from an SDMX API into SMART without having to download itas a local file first.
• Command line utility: SMARTcmd.exe, a command line version of SMARTintended to be used for batch process automation. SMARTcmd reads a projectfile (.smart) previously saved from the normal GUI-based version and executes eitherthe aggregations or the transformations. Besides the project file, it is possible tospecify several parameters in the command line which will supersede the value inthe project file for this run. For example, the input and output file names canbe changed for repeated transformations of the same type of files; or by using theparameter ”-append”, to create a single output file in several runs with differentinput files of the same type (i.e. different quarterly data of the same household
Romanian Statistical Review nr. 4 / 2019 55
survey). To use SMARTcmd just store it in any folder by clicking on Tools →Send SMARTcmd.exe to. . .
• Reusable mapping: The mapping can be saved in the CSV format for furtherre-use. This is a useful feature as mappings can be partially uploaded from differentsaved ones. SMART looks at the names of concepts and variables and checks if theymatch the mappings that are uploaded. If a concept already mapped in memory isfound in the mapping file being uploaded, the whole mapping for this concept willbe updated.
• Repetitive Runs: SMART can take as many runs as possible (until it reaches thememory limit) for the calculation process. For example if there are multiple inputdata files going to report on the same DSD, we can run one data file at a time andthen go backwards to process another. In the end, the generated results from thesedata files can be reported jointly in a single SDMX data message.
• DSD Constructor: it is a SMART companion tool and a standalone applicationwhich is able to create and edit DSDs and their components (i.e. dimensions, at-tributes, measures and code lists) in order to generate the DSD which fits a givenuser’s needs. The DSDs can then be used by SMART to obtain required the outputdataset. It can grab concepts and associated code lists from any SDMX registry andalso allows users to create and load them from scratch. Or it might edit an existingDSD and save it with a different id after making some changes.
4.3 Architecture Design
This section is intended for application developers, in particular for those who are interestedin developing GUI applications with R.
From the design perspective, SMART is a standalone desktop application using a GUI in.NET on top of the R statistical processor. That is, on the front-end C#.NET is usedto build its User Interface and on the back-end R processor is used to serve as its com-putational and reporting engine. Compared to the pure R based solutions, this hybriddesign makes SMART available to a wider audience, specifically users with less program-ming skills in R and GUI-dependent. Furthermore, it allows SMART to benefit from thepowerful features and libraries from both languages. In the R engine, SMART employsthe package foreign to interpret input datasets in SPSS, Stata and SAS format and thepackage data.table to achieve fast data manipulation. And in the .NET framework, ituses the standard NuGet package SdmxSource (Eurostat, 2018) to read and write anySDMX artefacts.
Of course the key of this hybrid design is the bridging, to be able to access the R runtimefrom .NET. We use the .NET interoperability library R.NET (Perraud and Abe, 2017)to achieve fast data exchange. The connection from .NET to R using R.NET is fairlystraightforward (for the detailed configurations, the tutorial post by Perraud (2015) canbe followed). In the initialization of the C# code, a single REngine object instance isretrieved which will seek the R home path based on the system environmental variables.If a valid version of R (32bit, < 3.4.1) has been found locally, REngine will trigger ahidden R console which can send and receive data in-between. To interact with R, onlythe method Evaluate of the REngine instance is needed. For example, the following C#code defines x = 15 in the R runtime environment:
REngineengine = REngine.GetInstance();engine.Evaluate(′′x← 15′′);
Romanian Statistical Review nr. 4 / 201956
←
The dependency on R on the other hand brings extra complexity to the application de-ployment. In order to run SMART, a proper version of R installed in advance becomesa prerequisite. Not only that, all the employed R packages (i.e., foreign and data.table)must be installed as well. This becomes quite cumbersome especially for users who barelyknow R. To resolve this, we build up a portable R together with the necessary packages,and then embed it inside the installation package of SMART. In this way, the per-buildportable R will travel together with its deployment. As by default the REngine searchesthe R path from the system environmental variables, the portable R however cannot set Rpath in the system variables. Therefore, we need to redirect the R path manually to thePortable R folder (i.e., the C# code as follows).
REngine.SetEnvironmentV ariables(;rPath : rpathPortable+@′′\bin\i386′′, rHome : rpathPortable);
5 Conclusions
This document provides a broad description of the package Rilostat for searching, rear-ranging, analyzing and downloading labour market data from the ILOSTAT database. Itshows how R users can take advantage of all the functionalities that this software offersto the community to access labour-related information, by explaining all the built-in func-tions of the package, giving examples of data visualization with actual information andanalyzing a set of indicators from the SDG collection.
This paper also presents the functionality of SMART, the Statistics Metadata-driven Anal-ysis and Reporting Tool, a hybrid application of R (as the processor) and .NET (as theGraphic User Interphase (GUI)), that can perform either aggregations or transformationof data or metadata from Stata, SPSS, CSV or SDMX-ML formats, to generate reports inExcel, CSV or SDMX formats.
6 Acknowledgments
The authors are grateful to Rafael Diez de Medina for his encouragement to pursue thiswork. Special thanks go to David Bescond, the author and maintainer of the packageRilostat, Rosina Gammarano, Edgardo Greising, Steven Kapsos and Yves Perardel fortheir support and valuable comments.
7 Appendix
7.1 More about the time format option while getting data usingRilostat
The function get ilostat( ) will return, by default, the variable ’time’ in a raw timeformat. This is, a vector of characters with the following syntax:
• Yearly data: ’YYYY’, where YYYY is the year
• Quarterly data: ’YYYqQ’ where YYYY is the year and Q is the quarter (taking thevalue of the corresponding quarter between 1 and 4).
• Monthly data: ’YYYmMM’ where YYYY is the year and MM is the month (takingthe value of the corresponding month between 01 t0 12).
Romanian Statistical Review nr. 4 / 2019 57
However, users can find that this format is not the appropriate one when there is need forinteraction with other functions or packages available in R, especially with those createdto perform data visualization or time-series analysis. For this reason, the function includesthe option to change the format of the variable ’time’ in order to return POSIXct dates(time format=’date’; e.g. 2017M12 equals 2019-01-01) or numeric dates (time format =’num’; e.g. 2017Q2 equals 2017.25).
7.2 And, more of the options available for caching data whenusing Rilostat
The function get ilostat( ) stores cached data by default in ’rds’ binary format infile.path(tempdir(), ‘‘ilostat’’), so the information fetching process is faster. Thereexists the possibility to choose the working directory where the data will be saved, as well asthe desired format of the file, by changing the default options in the arguments cache dir
and cache format, respectively. The name of the stored file is the concatenation of: the’segment’ used to consult the database (either by ’indicator’ or ’ref area’), the ’id’ of thetable extracted, the type of the variables contained, the time format, and the date of thelatest version of the dataset (taken from latest version of the table of contents used). Fi-nally, the option of quietly getting the information is also available by setting the argumentback=FALSE.
7.3 R codes for graphs in section 2
• Figure 1: Evolution of the global employment distribution by occupation, 1991-2023(ILOmodelled estimates, November 2018)
l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( v i r i d i s )l ibrary ( hrbrthemes )l ibrary ( s c a l e s )l ibrary ( s t r i n g r )
# −− Re l a t i v e d i s t r i b u t i o n
dat emp1 <− get i l o s t a t ( id = ’EMP 2EMP SEX OCU DT A’ ,segment = ’ i n d i c a t o r ’ ,type = ”both” ,time format = ”num” ,f i l t e r s = l i s t ( r e f area = ’X01 ’ ,
sex = ’SEX T ’ ) ) %>%f i l t e r ( c l a s s i f 1 != ’OCU DETAILS TOTAL’ ) %>%mutate ( d i s t r i b u t i o n = obs va lue/100) %>%s e l e c t (time , c l a s s i f 1 , d i s t r i b u t i o n )
Romanian Statistical Review nr. 4 / 201958
# Plot (With the r e l a t i v e d i s t r i b u t i o n )
dat emp1 %>%ggplot ( aes ( x=time ,
y=d i s t r i b u t i o n ,f i l l =c l a s s i f 1 ,c o l o r=c l a s s i f 1 ,text=c l a s s i f 1 ) ) +
geom area ( ) +scale f i l l v i r i d i s ( d i s c r e t e = TRUE) +scale c o l o r v i r i d i s ( d i s c r e t e = TRUE) +labs ( x=”” , y=”” ) +scale y cont inuous ( breaks = pretty breaks (n = 10) ,labels=percent , expand = c ( 0 . 0 1 , 0 . 0 1 ) ) +scale x cont inuous ( breaks = seq (1991 , 2023 , 2 ) ,l im = c (1991 , 2023) , expand = c ( 0 . 0 1 , 0 . 0 1 ) ) +theme ipsum ( ) +theme ( axis . text . x=element text ( s i z e =8) ,
axis . text . y=element text ( s i z e =8) ,legend . p o s i t i o n=”none” ) +
annotate ( ” t ex t ” ,x=1992 ,y= c ( 0 . 9 89 , 0 . 95 , 0 . 89 , 0 . 83 , 0 . 75 , 0 . 40 , 0 . 12 , 0 . 0 4 ) ,l a b e l =c ( ”Managers” ,
” P r o f e s s i o n a l s ” ,”Technic ians and a s s o c i a t e p r o f e s s i o n a l s ” ,” C l e r i c a l support workers ” ,
” Se rv i c e and s a l e s workers ” ,”Craft and r e l a t e d t rade s workers ” ,”Plant and machine operators , and assemble r s ” ,”Elementary occupat ions and s k i l l s a g r i c u l t u r a l ,
f o r e s t r y and f i s h e r y workers ” ) ,h ju s t = 0 , s i z e=I ( 3 ) ) +
annotate ( ” r e c t ” , xmin = 2018 , xmax = 2023 , ymin = 0 , ymax = 1 ,alpha = 0 . 3 , f i l l = ”gray” ) +annotate ( ” t ex t ” , l a b e l = ” Pro j e c t i on s ” , x=2019 , y=0.9 , v ju s t =1,h ju s t =0, s i z e=I ( 4 ) ) +geom v l i n e ( x i n t e r c ep t = 2018 , co l ou r = ” red ” ) +labs ( capt ion = ”Source : i l o s t a t ” ) +theme (plot . t i t l e = element text ( s i z e =12, f a c e=”bold . i t a l i c ” ) )
• Figure 2: Share of youth not in employment, education or training (NEET), 2017(ILO modelled estimates, November 2018)
l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( p l o t l y )
Romanian Statistical Review nr. 4 / 2019 59
X <− get i l o s t a t ( id = ’EIP 2EET SEX RT A’ , segment = ’ i n d i c a t o r ’ ,f i l t e r s= l i s t ( time = ’ 2018 ’ , sex=’T ’ ) ) %>%f i l t e r ( s t r sub ( r e f area , 1 , 1 ) != ’X ’ ) %>%s e l e c t ( r e f area , obs va lue ) %>%l e f t j o i n ( R i l o s t a t : : : i l o s t a t r e f area mapping %>%
s e l e c t ( r e f area , r e f area p l o t l y ) %>%l ab e l i l o s t a t ( code = ’ r e f area ’ ) ,
by = ” r e f area ” ) %>%f i l t e r ( ! obs va lue %in% NA) %>%mutate ( to t obs va lue = cut ( obs value ,quantile ( obs value , na .rm = TRUE) , i n c lude . l owest = TRUE) )
X %>% plot geo ( width = 900 , he ight = 600) %>%add trace (
z = ˜obs value ,c o l o r = ˜obs value ,colors=c ( ” green ” , ” blue ” ) ,text = ˜ r e f area . l abe l ,l o c a t i o n s = ˜ r e f area p l o t l y ,marker = l i s t ( l i n e = l i s t ( c o l o r = toRGB(” grey ” ) ,
width = 0 . 5 ) ) ,showsca le = TRUE) %>%
co l o rba r ( t i t l e = ’ (%) ’ , l en = 0 . 5 , t i c k s u f f i x=”%” ) %>%
layout (f ont = l i s t ( s i z e = 18) ) ,f ont = ( s i z e = 1) ,geo = l i s t ( showframe = TRUE,
showcoa s t l i n e s = TRUE,p r o j e c t i o n = l i s t ( type = ’Mercator ’ ) ,showCountries = TRUE,r e s o l u t i o n = 110) ,
annotat ions =l i s t ( x = 1 , y = 1 ,
text = ”Source : i l o s t a t ” ,showarrow = F, x r e f=’ paper ’ , y r e f= ’ paper ’ ,xanchor=’ r i g h t ’ , yanchor=’ auto ’ , x s h i f t =0, y s h i f t =0,f ont=l i s t ( s i z e =15, c o l o r=”blue ” ) )
)
• Figure 3: Distribution of the labour force participation rate by sex, 1990, 2000, 2010,2018 and 2030 (ILO modelled estimates, July 2018)
l ibrary ( R i l o s t a t )l ibrary ( t i d yv e r s e )l ibrary ( s c a l e s )
dat emp4 <− get i l o s t a t ( id = ’EAP 2WAP SEX AGE RT A’ , segment = ’ i n d i c a t o r ’ ,f i l t e r s = l i s t ( time = c ( ’ 1990 ’ , ’ 2000 ’ , ’ 2010 ’ , ’ 2018 ’ , ’ 2030 ’ ) ,sex=c ( ’F ’ , ’M’ ) , c l a s s i f 1=’AGGREGATE TOTAL’ ) ) %>%
f i l t e r ( s t r sub ( r e f area , 1 , 1 ) != ’X ’ ) %>%mutate ( sex lab = i f e l s e ( sex==’SEX F ’ , ’Women ’ ,i f e l s e ( sex==’SEX M’ , ’Men ’ , NA) ) ) %>%mutate ( sex year = i f e l s e ( ( sex==’SEX F ’ & time==’ 2030 ’ ) ,
Romanian Statistical Review nr. 4 / 201960
mutate ( sex year = i f e l s e ( ( sex==’SEX F ’ & time==’ 2030 ’ ) ,’Women − Pro j e c t i on ’ , i f e l s e ( ( sex==’SEX M’ & time==’ 2030 ’ ) ,’Men − Pro j e c t i on ’ , i f e l s e ( ( sex==’SEX F ’ & time !=’ 2030 ’ ) ,’Women − Estimate ’ , i f e l s e ( ( sex==’SEX M’ & time !=’ 2030 ’ ) ,’Men − Estimate ’ , NA) ) ) ) ) %>%mutate ( l f p r = obs va lue/100) %>%s e l e c t ( r e f area , sex lab , sex year , time , l f p r)%>%
group by( sex year , time ) %>%mutate (MD = median( l f p r ) ) %>%ungroup ( ) %>%mutate (MD = as . character (MD) )
ggp lo t ( dat emp4 , aes ( x=time , y=l f p r , f i l l =sex year , alpha=sex year ) ) +
geom boxplot ( ) +f a c e t wrap (˜ sex lab ) +theme bw( ) +theme ( legend . p o s i t i o n=”none” ) +scale alpha manual ( va lue s=c ( 0 . 7 , 0 . 2 , 0 . 7 , 0 . 2 ) ) +scale f i l l manual ( va lue s=c ( ” ye l low ” , ” ye l low ” , ” tu rquo i s e4 ” , ” tu rquo i s e4 ” ) ) +lab s ( x=”” ,
y=”LFPR (%)” ,capt ion = ”Source : i l o s t a t ” ) +
scale y cont inuous ( labels=percent )
Variable label Description of the variable Unit Year
Active variables
SDG 0131Proportion of populationcovered by social protection floors/systems
Percentage 2016
SDG 0552Female share of employment inmanagerial positions
Percentage 2017
SDG 0821Annual growth rate of outputper worker (measured as GDP in constant2011 international $ in PPP)
Percentage 2017
SDG 0852 Unemployment rate Percentage 2017
SDG 0861Proportion of youth (aged15-24 years) not in education, employmentor training (NEET)
Percentage 2017
SDG N881Non-fatal occupationalinjuries per 100’000 workers
Frequency rate 2015
SDG F881Fatal occupational injuriesper 100’000 workers
Frequency rate 2015
SDG 0922Manufacturing employment as aproportion of total employment
Percentage 2017
SDG 1041Labour income share as apercent of GDP
Percentage 2015
Supplementary variables
SDG 0111Working poverty rate(percentage of employed living below US$1.90 PPP)
Percentage 2017
SDG B831Proportion of informalemployment in non-agricultural employment -Harmonized series
Percentage 2017
Table 2: Description of the variables
Romanian Statistical Review nr. 4 / 2019 61
References
Escofier, B. and Pages, J. (2008). Analyses factorielles simples et multiples; objectifs,methodes et interpretation. Dunod Paris.
Eurostat (2018). Sdmxsource.net nuget package.
Gandrud, C. (2013). Reproducible Research with R and R Studio. Chapman & Hall/CRC.
Garnier, S., Ross, N., Rudis, B., Sciaini, M., and Scherer, C. (2018). Package ’Viridis’:Default Color Maps from ’matplotlib’.
Husson, F. and Josse, J. (2019). Package ’missMDA’: Handling Missing Values with Mul-tivariate Data Analysis.
Husson, F., Josse, J., Le, S., Mazet, J., and Husson (2018). Package FactoMineR: Multi-variate Exploratory Data Analysis and Data Mining.
ILO (2018). Decent work and the sustainable development goals: A guidebook on sdglabour market indicators. Department of Statistics.
Josse, J. and Husson, F. (2012). Handling missing values in exploratory multivariate dataanalysis methods. Journal de la Societe Francaise de Statistique, 153(2):79–99.
Kassambara, A. and Mundt, F. (2017). Package ’factoextra’: Extract and Visualize theResults of Multivariate Data Analyses.
Lahti, L., Huovari, J., Kainu, M., and Biecek, P. (2017). Retrieval and analysis of eurostatopen data with the eurostat package. The R Journal, 9(1):385–392.
Lebart, L., Morineau, A., and Piron, M. (1995). Statistique exploratoire multidimension-nelle, volume 3. Dunod Paris.
Perraud, J.-M. (2015). Getting started with R.NET.
Perraud, J.-M. and Abe, K. (2017). R.net nuget package.
R Core Team (2019). R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria.
Sievert, C., Parmer, c., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., Despouy, P.,and Inc., P. T. (2019). Package ’plotly’: DCreate Interactive Web Graphics via ’plotly.js’.
Wickham, H. (2018). Package ’scales’: Scale Functions for Visualization.
Wilke, C. O. (2018). Package ’ggridges’: Ridgeline Plots in ’ggplot2’.
Romanian Statistical Review nr. 4 / 201962
Romanian Statistical Review nr. 4 / 2019 63
Macroeconomic Statistical Forecasting for Engine DemandAnkit Kamboj ([email protected])
Cummins Technologies India Pvt. Ltd, Pune, India
Debojyoti Samadder ([email protected])
Cummins Technologies India Pvt. Ltd, Pune, India
Ambica Rajagopal ([email protected])
Cummins Technologies India Pvt. Ltd, Pune, India
Sarat Sindhu Mukhopadhyay ([email protected])
Cummins Technologies India Pvt. Ltd, Pune, India
ABSTRACT
Forecasting demand is a critical issue for driving effi cient operations in a
manufacturing fi rm. Due to this reason fi rms are concerned to plan their operations
and strive to improve their forecasting methods for having an edge over the competi-
tors in market. The purpose of this paper is to evaluate various shrinkage methods
for data containing large numbers of features. Here we focus on Class 8 Group 2
North America Heavy Duty (NAHD) market and macroeconomic indicators from ACT
research economic database to forecast full 3 months out shipment of engines. Vari-
ous pre-processing techniques are applied on all the variables and then they are fur-
ther decomposed by applying Seasonal and Trend decomposition using Loess (STL)
into its components (trend, seasonality and remainder). Then for each pre-processing
technique the decomposition is analysed visually. After this the relative signifi cance
of the variance associated to each decomposed component is utilized to select the
appropriate pre-processing technique for all the variables in order to ensure their sta-
tionarity for reliable forecasting accuracy. We applied several statistical as well as
machine learning methods and obtained an ensemble of them to have minimal error
in forecasting. It is also noticed that there is hardly any increase in accuracy when
the number of features is increased beyond 15. Following are the few important R
packages that were used in our analysis: forecast, forecastHybrid, tseries, readxl, xts,
quantmod, e1071, lars.
Keywords: Box-Cox Transformation, Stationarity, STL Decomposition, Least
Angular Regression, Shrinkage, Lasso, Support Vector Regression, Hybrid Forecast
Romanian Statistical Review nr. 4 / 201964
1. INTRODUCTION
Forecasting future values of an observed time series plays an im-
portant role in nearly all fi elds such as economics, fi nance, meteorology and
telecommunication. Manufacturing companies with a systematic demand
forecasting framework leads to eff ective decision-making processes such as
sales budgeting and production planning. But most of the fi rms are still us-
ing subjective and intuitive judgments for product demand forecasts which
is one of the factor of having less reliable production planning. It is of great
signifi cance for manufacturing enterprises to eff ectively predict product de-
mand. Firms adopting a structured forecasting framework have observed con-
structive impacts on operational performances and they are allocating a lot of
research and innovation to achieve it.
The statistical time series models have fundamental importance to
various practical domains. Thus, a lot of active research works is going on in
this subject during several years. To attain higher forecasting accuracy, many
statistical time series models have been suggested in literature. The list of sta-
tistical forecasting methods starts with some basic models such as exponential
smoothing and its variants like Holt’s and Holt’s Winters method [10] and
then followed the Box-Jenkins methodology to ARIMA models [1]. The fur-
ther use of multivariate GARCH models are also made available [31] expand-
ing the fi eld of statistical models. Further a method of creating the forecasts
by using the lags of other macro indicators which involved extensive use of
regression and econometric models. [11]
In the last two decades, machine learning (ML) models like Sup-
port Vector Machines [41] and shrinkage methods both ridge regression and
the LASSO [42] gained popularity for forecasting high dimensional datasets
and are competing seriously against statistical models. The further study on
shrinkage methods showed that LARS [28] gives most accurate results for
selecting important variables in the model. The ML and statistical methods
diff er in the way of optimization of minimum sum of squares for achieving
higher forecast accuracy. While the ML methods use non-linear algorithms,
the statistical ones depend on linear processes. ML methods being at the junc-
tion of statistics and computer Science are computationally more extensive
than statistical ones. [34]
The relevant research studies regarding the work presented here has
been thoroughly analysed. Firstly, due to inherent trend and seasonality in
most of time series data, a major emphasis is laid on diff erent pre-processing
techniques, since both statistical time series and machine learning models
require adequate pre-processing techniques to remove the non-stationarity
Romanian Statistical Review nr. 4 / 2019 65
in the data [43]. Further to select the relevant pre-processing technique, the
time series is decomposed into three components: trend, seasonality and re-
minder and the relative variance of each component is analysed [2,14]. Also,
the transformation should make the time series stationarity, both in terms of
stabilizing mean and variance [27]. Then few univariate models like Arima
[21,26] and Error, Trend & Seasonal Model (ETS) [24] are used to understand
the predictive nature of series without any covariates.
For multivariate forecasting, the lags of macro-indicators and ship-
ment series are used as predictors [9] and since the number of predictors are
higher than number of samples, the LARS [28] shrinkage method is used to
identify the top predictors. These important predictors are then used in multi-
variate forecasting model using Dynamic Regression and Support Vector Re-
gression models [3,20] and fi nally the ensemble of multivariate and univariate
models is build. The experimental evaluation of forecasting error in terms of
Mean Absolute Percentage Error(MAPE) is presented in tabular format cor-
responding to each forecasting model and pre-processing transformation used.
A time series is simply a series of data observed over time. In this
paper we only deal with regularly spaced time series i.e. the data is observed
every month. Provided the observation intervals are equally spaced, we call
them regularly spaced time series.
2. RELEVANT DEFINITIONS AND LITERATURE REVIEWS:
2.1 Box-Cox Transformation:
In many statistical analyses it is desirable to have the following two
assumptions: (a) the variables are normally distributed, (b) the variance of one
variable doesn’t change across all values of the independent variables i.e. the
homoscedasticity of the variable [37]. If the assumptions are violated then
certain transformation needs to be applied [15]. Suppose the observations are
and transformed observations are denoted by
then primarily the following basic transformations may be used:
(i) Square root,
(ii) Cube root,
(iii) Logarithm,
Generally, to resolve the violation of the above assumptions these ba-
sic transformations are helpful. Various eff orts are made in order to generalize
these transformations. Tukey (1957) had the initial proposal that a transfor-
Romanian Statistical Review nr. 4 / 201966
mation can be thought as a class or family of similar mathematical functions.
Finally Box-Cox transformation(1964)[37] is given by the below equation:
(1)
The value of λ plays a key role, for λ = 1 the transformation boils
down to identity, in case λ = 0 the logarithm , or something in between. An
important task is to choose the appropriate value of λ. If we choose λ to lie
in the interval [0, 1] and then we may use Guerrero (1993) method to choose
Guerrero (1993) λ in the following way. [18,38]
2.2. Seasonal-Trend Decomposition:
In a time-series with a seasonal component, STL decomposition (sea-
sonal-trend decomposition based on loess) may be used to decompose the
series into trend, seasonal and remainder components. This means, if the data,
the trend component, the seasonal component, and the remainder component
are denoted by Yu, Tu, Su and Ru, respectively, for u= 1 to N. then [5]
(2)
While carrying out loess, for all the data points we defi ne a neighbour-
hood. In this method we need to choose weights for each point in the neigh-
bourhood (this is called neighbourhood weights) and this is done based on dis-
tance from the particular data point. Next, we fi t a polynomial (mostly a linear
or quadratic) to these data points. The values at each data point is basically
the trend value. The steps of STL are: (1) Detrending, (2) Smoothing of cycle-
subseries – we constitute series for each seasonal component and smoothing
is done separately, (3) Low-pass fi ltration of smoothed cycle-subseries – the
sub-series are combined and smoothed, (4) Detrending the seasonal series, (5)
The original series is de-seasonalized, the seasonal component obtained in the
previous steps are utilised, (6) Trend component is evaluated by smoothing
the de-seasonalized series [14].
The relative signifi cance of the variance associated to each decom-
posed component can be identifi ed by the ratio of statistical variation of each
of the decomposed component to the variation of original series [2]. As an
example, for remainder component:
(3)
Romanian Statistical Review nr. 4 / 2019 67
2.3. Least Absolute Shrinkage and Selection Operator:
Suppose we have some dependent variable and a collection of inde-
pendent variables along with it which might aff ect the dependent variable.
We obtain ordinary least (OLS) estimates by minimizing the residual sum of
squares. There are two major problem of regression with OLS estimates: (i)
OLS estimates often have low bias but very high variance, (ii) With a large
set of independent variables the lose the interpretability of all the variables.
The technique which can be used to handle both of these shortcom-
ings is called LASSO (least absolute shrinkage and selection operator). It re-
duces the variance of prediction by increasing bias by a little which in result
increase the prediction accuracy. It also shrinks some of the coeffi cients and
set other coeffi cients to zero and in that way, it does variable selection [42].
Suppose we have standardized predictors and centered response
values and . The LASSO regression
problem is to fi nd which minimizes the following:
(4)
LASSO actually uses penalty where λ is a shrinkage parameter [30].
2.4. Least Angle Regression (LARs):
In many practical problems, we have a large dataset at our disposal
and the number of features of interest is also huge. If we take an example of
macro-economic forecasting, there are many time series variable available.
Here each variable may be indicator of some economic factor and hence they
are important as predictor in the model. But if we take all the predictors in
the model the prediction will be less accurate due to large variance of the
estimates and the model is also pretty much complex. So, in this situation we
need to fi nd the features which are aff ecting the forecast substantially and iso-
late them from other noise variables and this will result in improved forecast
accuracy [8].
Efron et al. (2004) presented a technique called Least Angle Regres-
sion (LARs) which can choose most informative predictors and it is inspired
by the forward stage-wise methods for selecting regression models. The ad-
vantage of LARS algorithm it gives a ranking to all the predictors which is
very helpful in many of the situations [25].
In our context we have applied moving window cross validation tech-
niques to determine optimum number of features which is widely used in this
area.
Romanian Statistical Review nr. 4 / 201968
2.5 Hybrid Forecast:
For taking the ensemble of forecasting methods available in the Hynd-
man’s forecast package in R, there is a separate package called forecastHybrid.
The models that could be used in this package are Arima (auto.arima), Error
Trend and Seasonality (ets), Theta model (thetam), Feed Forward Neural Net-
work with single hidden layer and lagged inputs (nnetar), STL model(stlm),
Tbats model(tbats) and Seasonal Naïve model(snaive) and it has the fl exibility
of combining the forecasts either using equal weights or based on in-sample
errors.[6,13]
The advantage of using the ensemble forecast is that they provide im-
proved forecasting accuracy as compared to accuracy of individual models. [7]
2.6 Ljung-Box test:
It is the most signifi cant test for searching the non-appearance of auto
correlation at certain lag. The null hypothesis of this test is if the time-series
model does not lead to lack of fi t. In other way one can say that if the errors
follow white noise or it has some other properties. The test statistics for time
lag m is:
(5)
where is the accumulated sample autocorrelation with n points time
series which follows a central Chi-square distribution. [32]
2.7 Augmented Dickey Fuller Test (ADF):
ADF test is the augmented version of Dickey Full-
er test (DF) with a lag of p. So, the DF test is applied on the model
with the null hypoth-
esis that the data are non-stationary. If the test statistic is less than the critical
value or the p value is less than 0.05, the null hypothesis is rejected and no unit
root is present. [27]
2.8 Z-score Normalization:
Z score normalisation is used when the maximum and minimum value
of the time-series is unknown and the time-series is stationary in nature. If X
is a variable takes values from , then the normalized variable
of X, X’ takes values where
(6)
with mean 0 and singular variance.
Romanian Statistical Review nr. 4 / 2019 69
The main drawback of this method is it can’t deal with non-stationary
data due to the change of the mean and variance of the time-series in diff erent
time. [35]
2.9 Autoregressive Integrated Moving Average (ARIMA):
A seasonal ARIMA model includes six terms p,q,d and P,Q,D where p,q,d rep-
resents non seasonal parts of the model and P,Q,D refers to the seasonal part of
the model. Here p,P are the autoregressive terms, q,Q are the moving average
and d,D are the diff erence. We have used “forecast” library in R which can
predict best p,q,d,P,Q,D using the model AIC values.[21,26]
2.10 Error, Trend & Seasonal Model (ETS):
Using exponential smoothing technique ETS deprecates a time-se-
ries into trend (additive or multiplicative) and seasonal model from the error
terms. ETS are estimated either by minimising the sum of squared errors or
maximizing the likelihood probability subject to the smoothing parameters
which lies between 0 to 1. [24]. From diff erent combinations of ETS models,
best model is chosen by Akaike’s Information Criterion (AIC) or Bayesian
Information Criterion (BIC) criterions and smaller is the AIC / BIC, the better
is the model. [29]
2.11 Support Vector Regression(SVR):
SVR is the modifi ed version of support vector classifi cation problem
where the model returns continuous value as output which make it a regres-
sion problem. SVR fi nds a tolerance level that attempts to fi nd the narrow-
est tube centered around the surface”. [3]
2.12 Dynamic Regression:
While evaluating regression models it is assumed that error term is
uncorrelated, but there are scenarios where we allow the errors from regres-
sion to have autocorrelation considering the assumption that error terms will
follow an ARIMA process. Further if there is stationarity among all the vari-
ables, then we only need to consider ARMA errors for residuals. [20]. Hence
in Dynamic Regression we fi t a regression model with ARIMA errors, for
example:
is a linear function of the k predictor variables ( ), is
the error term which follows an ARIMA (1,1,1) model, is white noise and
B is the backward shift operator, we can write:
Romanian Statistical Review nr. 4 / 201970
(7)
Where and are the fi rst order coeffi cient of AR (autoregres-
sive) model and MA (moving average) model respectively.
2.13 First Diff erenced Series:
In case of a stationary time series its properties do not depend on the time
when it has occurred. So, time series with trend or seasonality are not consid-
ered to be stationary as the trend or seasonality value will aff ect the value of
the series at that instant, intuitively white noise process will be stationary [19].
In case of a non-stationary process, diff erencing may be a way out to handle
the situation. Diff erencing helps in stabilizing the mean. The diff erenced se-
ries is given by
(8)
The issue with this transformation is the transformed series will have
one observation less than the original series as it is not possible to compute the
diff erence between 0th and 1st observation [22]. We call this diff erence the fi rst
diff erence and in most of the cases we obtain stationarity by doing only fi rst
diff erence. To obtain stationarity sometimes it is necessary to carry out higher
order diff erence [36].
3.METHODOLOGY:
The purpose of this study is twofold. Our fi rst objective is to carry
out descriptive analyses of the relationships between the shipment demand
for various engines and the US NAHD (North America Heavy Duty) macro-
economic indicators. Our second objective is to capitalize on the knowledge
gained through these analyses by developing multivariate models that could
be used to generate forecasts of the shipment of the engines i.e. to build a
demand forecasting model for Engine Shipment that incorporates all key de-
mand planning inputs, macroeconomic indicators for full 3-month forecasts
for the US North America Heavy Duty market.
3.1 Modelling Framework:
In this paper we have worked on the data from January 2011 till May
2019. The frequency of shipment data as well as macroeconomic indicators
is monthly but from the current date the past two months values are not avail-
Romanian Statistical Review nr. 4 / 2019 71
able. Hence to forecast 3 months out from the current date we need to forecast
5 points ahead from the date of data availability. Figure 1. Observations used
to build forecasts are called training set and the remaining observations form
the test set. With limited data at hand, the due care has been taken for not
overfi tting the ML models. We have used ML models with minimal parameter
tuning. The way to identify overfi tting is to have a very complicated model on
the training set that fi ts the data well but it will not necessarily produce reliable
forecasts on the test set. [39]
Forecasting Timelines
Figure 1
In order to have reliable forecast accuracy, time series cross validation
is performed [16] since if we have a relative small test set then the conclusions
on the accuracy drawn from this set might not be reliable for future times. In
time series cross validation, we have series of training and test sets, where
each test set consist of fi ve observations and the fi fth observation is considered
as the 3 months out forecasted value. For the observations that comprises of
test set, a corresponding training set is present that have all the observations
prior to the test set. Hence the model is tested on the data which is previously
not known to it for computing the multi-step errors. Figure2, where for each
row the blue dots show the training set and the red dot shows the test set. Each
training set consist of just one more observation then the previous training set
and we get many more observations in test sets. Finally, the average error of
all the test sets will represent the overall forecasting performance of the mo-
del.
Romanian Statistical Review nr. 4 / 201972
Time Series Cross Validation
Figure2
3.2 Data Pre-processing:
There are seasonal and trend variations in most of the economic time
series. Pre-processing of the data is the major factor that aff ects the forecasting
accuracy as often we have the stochastic trend along with seasonal variations
in the time series data. In addition to time series models, even for Machine
learning models it is important to remove the non-stationarity in variables
before building any forecasting model [43]. The forecasting output might be
unstable with suboptimal results if machine learning models are used without
adequate pre-processing [43] of data.
Here we have tested four forms of pre-processing techniques for each
time series: Original Series, Box-Cox Transformed Series: to achieve statio-
narity in variance, First Diff erenced Series: helps in stabilizing the mean and
Box-Cox Transformed Then First Diff erenced Series: helps both in stabilizing
mean and variance. We need to choose appropriate transformation for making
the time series stationary. After transformation we have used Augmented
Dickey Fuller test to check if the transformation produces a stationary series
[33]. It is observed that both shipment series and the predictors are stationary
by one or more of the above transformations.
In case more than one transformation constructing stationary series,
we have visualized the potential transformations after decomposing the series
by STL (seasonal-trend decomposition based on loess) decomposition and
then the variance associated to each decomposed component is observed [2].
For a particular time-series, the ratio of variance explained by each decompo-
sed component (Trend, Seasonal and Remainder) to the variance of transfor-
Romanian Statistical Review nr. 4 / 2019 73
med series are analysed. Then the four transformations are sorted in decrea-
sing order of the ratio of variance explained by remainder component to make
sure that the time series doesn’t have inherent trend and seasonal components.
Further to check that the transformations are not that strict to make the time
series as white noise, Ljung-Box test is used with lags as twice the period of
seasonality, i.e. for monthly series having seasonality of 12 we have used 24
lags. [17]
Since white noise are not linearly forecastable so we have not used
those transformations that are converting the time series into white noise.
[12]. Finally, that transformation is chosen whose remainder explains most
part of the variation and which is stationary but not statistically white noise.
The appropriate transformation for each predictor is saved for further appli-
cation during building up the forecasting model and is referred as Optimized
transformation. Further all the time series are then Z-score normalized using
the mean and standard deviation of the time series.
3.3 Setting up the forecasting problem
The forecasting framework is setup using lags of macro-indicators
and shipment series to form the Autoregressive distributed lag model. [9].
Since we need to forecast 3 months out i.e. 5 steps ahead from the data avai-
lability, we have taken predictors with lags greater than or equals to 5 for both
macro-economic indicators and shipment series. Moreover, in autoregressive
distributed lag equation four diff erent lags of each series is taken to account
for multivariate interaction in the model. Figure 3.
(9)
where the shipment series at time t, are the macro-
economic indicators, is a random disturbance and are
diff erent lags of a specifi c series with present time as t.
In the autoregressive distributed lag equation, we have formed 204
predictors as there are four diff erent lags for each of 51 initial predictors.
Romanian Statistical Review nr. 4 / 201974
Forecasting Framework:
Figure 3Month Shipment Series
Lag 5 Lag 6 Lag 7 Lag 8 Lag 5 Lag 6 Lag 7 Lag 8 Lag 5 …..
1 NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA
7 NA NA NA NA
8 NA NA
9
t
t+1
t+2
t+3
t+4
t+5
Predictors
y
�
�!�"
�#�$
�%
�&
�'�(
�)
*!!
*"!
*#!
*$!
* +%!
* +$!
* +#!
* +"!
* +!!
*!!
*"!
*#!
* +&!
* +%!
* +$!
* +#!
* +"!
*!!
*"!
* +'!
* +&!
* +%!
* +$!
* +#!
*!!
* +(!
* +'!
* +&!
* +%!
* +$!
*!
* !
* +!!
* +"! * +#
!
*!"
*""
*#"
*$"
* +%"
* +$"
* +#"
* +""
* +!"
* "
*"
�
�!�"
�#�$
� +%
� +$
� +#
� +"
� +!
�!
�"�#
� +&
� +%
� +$
� +#
� +"
�!�"
� +'
� +&
� +%
� +$
� +#
�!
� +(
� +'
� +&
� +%
� +! � +" � +# � +$
3.4 Identifi cation of signifi cant Lags
Now starting with our fi rst objective to carry out descriptive analyses
of the relationships between the shipment demand and macroeconomic indi-
cators, we take seven years of monthly data from January 11 till December 17
and we need to fi nd predictors with appropriate lags that will provide better
prediction of the shipment series.
After pre-processing both shipment series and all the predictors, we
need to fi nd the top predictors and their respective lags. When the count of
predictors is signifi cantly higher than number of samples, George Box coined
the term Eff ect Sparsity [44] to describe that only small fraction aff ects the
response and most of the features will have zero eff ect. [40]. Using the cross-
validation technique described above, LARS [28] shrinkage method is used
for identifi cation of the top predictors as the number of predictors are large
relative to the sample size. Several experimentations are performed by taking
only top n features from LARs model (n: 1 to 204) and their average cross-
validation errors in terms of MAPE (Mean Absolute Percentage Error) are
analysed. It is observed that cross validation error is minimum when model is
taking only top 15 features as its predictors. Hence further forecasting models
are build using only these top 15 predictors.
3.5 Building up the forecasting model
After we have obtained the top predictors, various univariate and mul-
tivariate forecasting models are implemented for fi nding out the 3 months out
forecast of shipment series. The fi ve models used in the analysis are: “Ari-
ma”, “Error, Trend & Seasonal Model”, “Support Vector Regression”, “Dyna-
mic Regression” and “Hybrid Forecast”. Among these fi ve models Support
Romanian Statistical Review nr. 4 / 2019 75
Vector Regression and Dynamic Regression are multivariate models whereas
Arima and Error, Trend & Seasonal Model are univariate models. The Hybrid
Forecast model is a combination of multivariate and univariate models i.e.
(“Dynamic Regression”, “Error, Trend & Seasonal Model”, and “Arima”) and
is implemented using the forecasthybrid package in R. Also, an ensemble of
the above fi ve models is implemented by giving 30% of weightage to each of
the three multivariate models and 5% weightage to each of the two univari-
ate models. By analysing diff erent combinations of weights, we have chosen
the above mentioned weights and it can a further fi ne-tuned for optimal per-
formance. Since we are forecasting 5 points ahead value using time series
cross validation, during each iteration both the shipment series and predictors
are transformed according to the relevant pre-processing technique identifi ed
during the data pre-processing step. We have kept time span from May 18 till
May 19 as out of sample period i.e. the fi rst forecasted value was of May18
and then subsequent months values are forecasted according to time series
cross validation as mentioned above in the modelling framework. After this
the corresponding inverse transformation is applied on the forecasted value
of shipment series to get value back in original scale which is further used in
computing accuracy measures.
4. RESULTS:
In time series cross validation, each training set consist of just one
more observation then the previous training set and we get many more test sets
for fi nding out the errors. Finally, the average error of all the test sets will repre-
sent the overall forecasting performance of the model. Econometricians often
call this concept as “forecast evaluation on a rolling origin” [16]. The forecast
origin is the time at the end of training data and it rolls forward in time.
Here we need 5 step ahead forecast horizon, hence all the cross vali-
dated error measures are computed for that horizon. Forecasts errors are diff e-
rence between the actual test set observations and the point forecasts and they
are diff erent from residuals. Residuals are on the training set while forecast
errors are on the test set. For an observation in test set and its corresponding
forecasted value , the forecast error is given by:
We compute the accuracy of our method using the forecast errors cal-
culated on the test data. There are number of ways to compute the forecast
accuracy. We can take average absolute error, average squared error, average
percentage error or average absolute percentage error.
Romanian Statistical Review nr. 4 / 201976
Accuracy Measures
Table 1
Accuracy Measure Formula
Mean Absolute Error |)
Mean Squared Error )
Mean Percentage Error )
Mean Absolute Percentage Error |)
The MAE and MSE is dependent on the scale of the data. While MPE
and MAPE is more robust as only require all the data to be positive having
no zeros or small values and assumes there is a natural zero [23]. Since the
shipment series that we have used as dependent variable is positive and had an
absolute zero and in accordance with business requirement, MAPE is used for
accuracy comparison of diff erent forecasting models.
We have compared the average MAPE value for out of sample period
from May 18 till May 19. The comparison is done among the fi ve models:
(“Arima”, “Error, Trend & Seasonal Model”, “Support Vector Regression”,
“Dynamic Regression” and “Hybrid Forecast”) and among fi ve diff erent
transformations: (Original Series, Box-Cox Transformed Series, First Diff er-
enced Series, Box-Cox Transformed Then First Diff erenced Series and the
Optimized Transformed Series obtained from Data Pre-processing step) on all
the predictors. The transformations are applied on both shipment series and
predictors.
Romanian Statistical Review nr. 4 / 2019 77
MAPE (%) Comparison:
Table 2
Original
Series
Box-Cox
Transformed
Series
First
Diff erenced
Series
Box-Cox
Transformed
Then First
Diff erenced
Series
Optimized
Transformed
Series
Arima 21.66 20.33 21.05 20.54 20.14
Error, Trend &
Seasonal Model19.54 19.04 19.34 17.61 18.44
Support Vector
Regression13.31 12.57 15.09 15.85 11.23
Lagged Regression 12.36 11.53 16.41 16.13 10.48
Hybrid Forecast 12.95 12.28 15.61 16.12 11.04
Ensemble Model 11.96 11.17 14.48 15.07 10.21
MAPE (%) Comparasion
Figure 4
Transformation
Fo
reca
stin
g M
od
el
5. CONCLUSIONS:
In summary, we have proposed a full framework of demand forecast-
ing using combination of statistical and machine learning methods. We have
observed that Data Pre-processing and selection of signifi cant features with
their respective lags are very important in multivariate forecasting framework.
Romanian Statistical Review nr. 4 / 201978
Moreover, we have noticed that the transformation whose remainder from
STL decomposition explains most part of the variation and which is stationary
but not statistically white noise gave the best performance in terms of aver-
age out of sample MAPE. (Figure 4). Also, the three multivariate models:
“Hybrid Forecast”, “Dynamic Regression” and “Support Vector Regression”
performed better than univariate models implying that the ACT macroeco-
nomic indicators have predictive power for shipment series. Additionally, the
ensemble model gave the best accuracy metrics.
The further scope in this paper would be to use more optimized
weights in the ensemble model. Other internal indicators can also be used
along with macroeconomic indicators to improve the forecast accuracy. The
proposed forecasting framework can be extended to any industry for forecast-
ing demand or sales for better fi nancial and supply chain planning.
References:
1. Anderson,O.D, 1997, The Box-Jerkins approach to time series analysis. RAIRO -
Operations Research - Recherche Opérationnelle, Volume 11 (1977) no. 1, p. 3-29.
http://www.numdam.org/item/?id=RO_1977__11_1_3_0
2. Antoine,E. A. Lafare 1 and Peach,Denis W., 2015, Use of seasonal trend de-
composition to understand groundwater behaviour in the Permo-Triassic Sandstone
aquifer, Eden Valley, UK. Hydrogeology Journal, 24 (1). 141-158. http://nora.nerc.
ac.uk/id/eprint/512086/1/art%253A10.1007%252Fs10040-015-1309-3.pdf
3. Awad, M., Khanna R., 2015, Support Vector Regression. In: Effi cient Learning
Machines. Apress, Berkeley, CA. https://link.springer.com/content/pdf/10.1007%
2F978-1-4302-5990-9_4.pdf
4. Box,G., Meyer,D., 1986, An Analysis for Unreplicated Fractional Factorials. Techno-
metrics Vol. 28, No. 1, pp 11-18, 1986.
5. Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. 1990, Stl: A
seasonal-trend decomposition procedure based on loess. Journal of Offi cial Statis-
tics, 6(1):3–73. http://www.nniiem.ru/fi le/news/2016/stl-statistical-model.pdf
6. Cran-R-Project: forecastHybrid, VIGNETTES. https://cran.r-project.org/web/pack-
ages/forecastHybrid/vignettes/forecastHybrid.html , [Accessed 20 June 2019]
7. Ellis,Peter., 2016, Error, trend, seasonality - ets and its forecast model friends.
http://freerangestats.info/blog/2016/11/27/ets-friends ,[Accessed 20 June 2019]
8. Gelper, Sarah. & Croux, Christophe. 2008, Least angle regression for time series
forecasting with many predictors. http://citeseerx.ist.psu.edu/viewdoc/download?doi
=10.1.1.516.6505&rep=rep1&type=pdf
9. Giles,Dave. 2013, ARDL Models - Part I. Econometrics Beat. University of Victoria,
Canada.https://davegiles.blogspot.com/2013/03/ardl-models-part-i.html ,[Accessed
20 June 2019]
10. Holt, C.C., 1957, Forecasting trends and seasonals by exponentially weighted
moving averages. Carnegie Institute of Technology, Pittsburgh ONR memorandum
no. 52.
11. HOWREY, E. P., 1980, The Role of Time Series Analysis in Econometric Model
Evaluation. http://www.nber.org/chapters/c11706
12. Hurvich,Cliff ord., Chapter 3: Forecasting from Time Series Models. Forecasting
Handouts, NYU Stern School of Business, New York. http://people.stern.nyu.edu/
churvich/Forecasting/Handouts/Chapt3.1.pdf ,[Accessed 20 June 2019]
Romanian Statistical Review nr. 4 / 2019 79
13. Hyndman,R.J., Gooijer, Jan G De., 2006, 25 Years of Time Series Forecasting.
https://robjhyndman.com/papers/ijf25.pdf
14. Hyndman, R.J., 2012, Measuring time series characteristics. https://robjhyndman.
com/hyndsight/tscharacteristics ,[Accessed 20 June 2019]
15. Hyndman, R.J., 2014a, Forecasting using R https://robjhyndman.com/talks/
RevolutionR/7-Transformations.pdf ,[Accessed 20 June 2019]
16. Hyndman,R.J., 2014b, Measuring forecast accuracy. https://pdfs.semanticscholar.
org/af71/3d815a7caba8dff 7248ecea05a5956b2a487.pdf
17. Hyndman,R.J., 2014c, Thoughts on the Ljung-Box test. https://robjhyndman.com/
hyndsight/ljung-box-test/ ,[Accessed 20 June 2019]
18. Hyndman, R.J. and Bergmeir, Christoph, 2015, Bagging Exponential Smooth-
ing Methods using STL Decomposition and Box-Cox Transformation, International
Journal of Forecasting, Volume 32, Issue 2, April–June 2016, Pages 303-312 htt-
ps://robjhyndman.com/papers/BaggedETSForIJF_rev1.pdf
19. Hyndman, R.J., 2016a, Stationarity and diff erencing, Otexts, Forecasting:
Principles and Practice, Monash University, Australia. https://www.otexts.org/
fpp/8/1,[Accessed 20 June 2019]
20. Hyndman, R.J., 2016b, Dynamic regression models, Otexts, Forecasting: Prin-
ciples and Practice, Monash University, Australia. https://otexts.com/fpp2/dynamic.
html ,[Accessed 20 June 2019]
21. Hyndman, R.J., 2016c, Seasonal ARIMA models, Otexts, Forecasting: Principles
and Practice, Monash University, Australia. https://otexts.com/fpp2/seasonal-ari-
ma.html ,[Accessed 20 June 2019]
22. Hyndman,R.J, 2016d, Forecasting: principles and practice. https://robjhyndman.
com/uwa2017/2-3-Diff erencing.pdf ,[Accessed 20 June 2019]
23. Hyndman,R.J., 2016e, Chapter: 3.4 Evaluating forecast accuracy. Otexts, Fore-
casting: Principles and Practice, Monash University, Australia. https://otexts.org/
fpp2/accuracy.html ,[Accessed 20 June 2019]
24. Hyndman,R.J., (2016f, Estimation and model selection, Otexts, Forecasting: Prin-
ciples and Practice, Monash University, Australia. https://otexts.com/fpp2/estima-
tion-and-model-selection.html ,[Accessed 20 June 2019]
25. Hyndman,R.J., Jiang, B., Athanasopoulos,George., 2018, Macroeconomic
forecasting for Australia using a large number of predictors https://robjhyndman.
com/papers/ausmacrofcastR1.pdf
26. Hyndman,R.J., 2019, Package ‘forecast’, Version: 8.7, Title: Forecasting Func-
tions for Time Series and Linear Models. https://cran.r-project.org/web/packages/
forecast/forecast.pdf ,[Accessed 20 June 2019]
27. Imam, Akeyede., Habiba, D., and Atanda,B.T., 2016,“On Consistency of Tests
for Stationarity in Autoregressive and Moving Average Models of Diff erent Orders.”
https://pdfs.semanticscholar.org/f128/d0d72f70d0a94ecf329a9363fc1ef0abfd9e.
28. Iturbide,Eric., 2013, A Comparison between LARS and LASSO for Initialising the
Time-Series Forecasting Auto-Regressive Equations. Procedia Technology 7(2013)
282 -288 https://www.sciencedirect.com/science/article/pii/S2212017313000364
29. Jofi pasi,Chesilia A., Miftahuddin and Hizir, 2017, Selection for the best ETS
(error, trend, seasonal) model to forecast weather in the Aceh Besar District, IOP
Conference Series: Materials Science and Engineering, Volume 352, conference 1.
https://iopscience.iop.org/article/10.1088/1757-899X/352/1/012055/pdf
30. Kourentzes, N. and Petropoulos,F., 2017, Forecasting with R A practical work-
shop, International Symposium on Forecasting https://kourentzes.com/forecasting/
wp-content/uploads/2017/06/Forecasting-with-R-notes.pdf
Romanian Statistical Review nr. 4 / 201980
31. Laurent, R. and Violante, 2012, On the forecasting accuracy of multivariate
GARCH models. Journal of Applied Econometrics. vol. 27, no. 6, pp. 934-955.
32. Ljung, G.M. and Box, G.P., 1978. “On a Measure of a Lack of Fit in Time Series
Models”, Biometrika, 65.2, 297–303.
33. Lyocsa, S., 2011, Unit-root and stationarity testing with empirical application on
industrial production of CEE-4 countries. Munich Personal RePEc Archive Paper
No. 29648 https://mpra.ub.uni-muenchen.de/29648/
34. Makridakis,Spyros1 and Spiliotis,Evangelos, 2018, Statistical and Machine
Learning forecasting methods: Concerns and ways forward. PLoS One 13(3):
e0194889. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870978/
35. Nayak, S.,Misra, Bijan B. and Behera,Himansu Sekhar, 2013, “Impact of Data
Normalization on Stock Index Forecasting.” https://pdfs.semanticscholar.org/f412/4
953553981e32c39273bb2745a140311d160.pdf
36. Ohl,Eduard Baum and Ly´ocsa, Sˇtefan, 2009, Stationarity of time series and the
problem of spurious regression.https://mpra.ub.uni-muenchen.de/27926/1/Station-
arity_of_time_series_and_the_problem_of_spurious_regression.pdf
37. Osborne,Jason W., 2010, Improving your data transformations: Applying the Box-
Cox transformation, ISSN 1531-7714 http://citeseerx.ist.psu.edu/viewdoc/downloa
d?doi=10.1.1.470.7417&rep=rep1&type=pdf
38. Shang, Han Lin 2015, Selection of the optimal Box-Cox transformation parameter
for modelling and forecasting age-specifi c fertility, Journal of Population Research,
2015, 32(1), 69-79 https://arxiv.org/abs/1503.02344v1
39. Souhaib Ben Taieb, 2014, Machine learning strategies for multi-step-ahead time
series forecasting. Computer Science department, University of Brussels, Belgium.
http://souhaib-bentaieb.com/pdf/2014_phd.pdf
40. Stodden,Victoria., 2008, Model Selection with Many More Variables than Ob-
servations. Microsoft Research Asia, Stanford University. https://web.stanford.
edu/~vcs/talks/MicrosoftMay082008.pdf
41. Thissen,U. and Brakel,R. 2003, Using support vector machines for time series
prediction. Chemometrics and Intelligent Laboratory Systems Volume 69, Issues
1–2. https://www.sciencedirect.com/science/article/abs/pii/S0169743903001114
42. Tibshirani,R., 1996, Regression shrinkage and selection via the LASSO. Journal
of the Royal Statistical Society Vol. 58, No. 1 (1996), pp. 267-288. https://www.
jstor.org/stable/2346178?seq=1#metadata_info_tab_contents
43. Zhang,GP., Qi, M., 2005, Neural network forecasting for seasonal and trend time
series. European Journal of Operational Research. 2005; 160(2):501–514. https://
doi.org/10.1016/j.ejor.2003.08.037
Romanian Statistical Review nr. 4 / 2019 81
Understanding Patterns in the Consumption of Agro-Food Products in Romania - An Analysis at Regional LevelAndreea MIRICĂ, PhD. Assistant Lecturer ([email protected])
Bucharest University of Economic Studies
Roxana-Ionela GLĂVAN, PhD. Assistant Lecturer ([email protected])
Bucharest University of Economic Studies
Iulia Elena TOMA, PhD. Candidate ([email protected])
Bucharest University of Economic Studies
Lucian PĂTRAȘCU, PhD. ([email protected])
Bucharest University of Economic Studies
ABSTRACT
The agro-food sector has faced several challenges since the end of World
War II. This article performs an analysis of this sector from the consumer perspec-
tive. More precisely, it aims to fi nd certain patterns in the consumption of agro-food
products. In this respect, quarterly data with regard to the average consumption of the
agro-food products at regional level are used. The data are provided by the Romanian
National Institute of Statistics. In order to fi nd patterns in the consumers’ behaviour,
JDemetra+ version 2.2.2 was used to analyse the time series with regard to seasonal
patterns and calendar eff ects (Trading days, Julian Easter). TRAMO-SEATS and X13
were assessed as seasonal adjustment methods for all series that showed signifi cant
seasonality. Moreover, only the automatic procedure was used in all cases. The X13
procedure provided the best results in most of the cases.
Keywords: agro-food products, seasonal adjustment, JDemetra+
JEL Classifi cation: Q17
1. INTRODUCTION
The newest releases of the “EU agricultural outlook for 2018-2030”
report published on December 2018 by the European Commission show that
France, Germany, the UK and Romania are projected to account for about 55%
Romanian Statistical Review nr. 4 / 201982
of EU main cereal production in 2030. Based on the recent food trends, the
consumers are more inclined to have a closer look at the origin, environmental
friendliness and organic certifi cation of the food products they select. This
aspect has an important economic impact in the overall production chain.
Understanding such challenging factors can increase competitiveness and
bring the required technologies to drive forward the better suited agro-food
products tailored to adjust to the new demanding trends.
After World War II a priority for Europe became the development
of economic and commercial relations. Based on historical studies it can be
observed that Romania has an experience in exporting various agro-food
products. Throughout the last decades, Romania has lost the capacity to sell
goods and agro-food products in the context of the great changes from the
late 80s. Since joining the European Union in 2007, the main component
of the agro-food sector, the agriculture, has taken a slight path approach
towards increasing self-consumption (Davidova et al., 2009) and generating
new goods with high added value in the market. These are observed in the
context of EU capital investments in the agricultural sector and related
industries.
Considering the important challenges the agro-food sector is facing
especially due to the changes in the consumption patterns, it is crucial for
every country to perform an in-depth analysis of this phenomenon.
Romania has high potential in food production (PWC, 2017).
However, for potential investors to be able to exploit the knowhow and
the natural resources existing in Romania in order to best respond to the
consumers’ needs, they must understand consumption patterns, as consumers
are the main actors of the business environment. Toma and Mirică (2018)
show that exploring seasonality at a low disaggregation level is very important
for business decision makers to understand business environment. Therefore,
this article will explore the seasonal patterns in the consumption of agro-food
products in Romania.
2. DATA AND METHODS
In order to achieve the purpose of this section, quarterly data on the
average consumption per person for several agro-food products were retrieved
from the TEMPO Online Database of the Romanian National Institute of
Statistics. Data were retrieved at regional level as this is the lowest level of
disaggregation available. Moreover, the available time frame is 2015-2018,
which complies with the minimum standards in offi cial statistics with regard
to the length of time series for the purpose of seasonal adjustment (Buono et
Romanian Statistical Review nr. 4 / 2019 83
al. 2018; UNECE, 2012). Also, a time series length of four years is enough for
detecting Easter eff ect (Findley et al., 2005).
In order to explore seasonal patterns of these series, the tools provided
by JDemetra+ 2.2.2 will be used. JDemetra+ 2.2.2 is the latest version of the
software offi cially recommended by Eurostat for seasonal adjustment (Eurostat,
2019). This sofware provides an easy to use tool for detecting seasonality, outliers
as well as an automatic procedure for seasonal adjustment (Grudkowska, 2017).
The automatic procedure of this software is very user friendly and provides high
quality results (Mirică et al. 2017). However, for problematic time series, the
decomposition method and the ARIMA Model must be choosed manually based
on the methodology proposed by Mirică et al. (2016).
In order to assess the presence of seasonality, JDemetra+ off ers several
tests for the raw series, of which the Autocorrelation at seasonal lags test will
be used (Mirică et al. 2017). Series will be seasonally adjusted only if there is
a strong evidence of seasonality.
Next, all the series that show strong seasonal pattern are seasonally
adjusted using Tramo-Seats and X13, the two methods incorporated in
JDemetra+ 2.2.2. In order to perform the seasonal adjustment, the Romanian
Calendar is defi ned, comprising all the legal holidays in this country including
the Julian Easter. For the results to be easy to interpret, the information
proposed by Andrei et el. (2019) will be extracted from the output for each
series: transformation method, the presence of Easter and Trading Days eff ects,
outliers, the result of the residual seasonality tests, the overall quality and
the AIC. The seasonal adjustment method will be chosen taking into account
the overall quality of the results of each method. Next, in the case of equal
quality, the method with the lowest AIC will prevail. With regard to the AIC,
it is important to note that Motulsky and Christopoulos (2004), show that the
sign of this indicator is of no practical importance and one should choose the
model with the lowest AIC.
3. RESULTS
Firstly, all the series, for each agro-food product and region of Romania
are tested for the presence of seasonality. The results of the Autocorrelation at
seasonal lags test are displayed in Table 1. As one can observe, the consumption
of maize fl our, milk, fats, as well as mineral water and soft drinks has no
seasonality in all regions. On the other hand, there are agro-food products
that are consumed on a seasonally basis in all regions: Vegetables and canned
vegetables in fresh vegetable equivalent, Confi ture, jam, compote, jellies and
Chocolate, sweets, Turkish delight and other sugar confectionery. For fruits
Romanian Statistical Review nr. 4 / 201984
and eggs, there is strong seasonal pattern in consumption in all regions except
for Bucharest-Ilfov. The consumption of bread and bakery products presents
seasonality only in the North-West region, while the consuption of fl our and
potatoes presents seasonality only in North-East and the consumption of
rice only in the South-Muntenia Region. The consumption of fresh meat has
seasonal patterns in South-East and South-West Oltenia while the consumption
of meat products in South-East, South-West Oltenia and South-Muntenia. The
consumption of cheese and cream presents strong seasonality in the Center,
South-West Oltenia and South-Muntenia regions. The consumption of Maize,
sunfl ower and soya oil has seasonal patterns in North – West and South-
Muntenia. Sugar is consumed on a seasonal basis in South-East and South-
Muntenia. The consumption of alcoholic drinks displays strong seasonal
patterns in South-Muntenia and Center.
If we analyse the situation by region, one can observe that Bucharest-
Ilfov has the lowest number of series that present seasonal patterns, closely
followed by the West region. On the other hand, South-Muntenia has the
highest number of such series.
Results of the Autocorrelation at seasonal lags test for series concerning
the average consumption per person by agro-food product and region
– P values and interpretation, source: designed by the authors using
JDemetra+ 2.2.2.
Table 1North -
West Center
North -
East
South -
East
Bucharest -
Ilfov
South -
Muntenia
South - West
OlteniaWest
Bread and
bakery
products
Seasonality
present
0.0025
Seasonality
not present
0.2821
Seasonality
not present
0.1220
Seasonality
not present
0.2531
Seasonality
not present
0.9622
Seasonality
present
0.0026
Seasonality
not present
0.1424
Seasonality
not present
0.4537
Maize fl our
Seasonality
not present
1.0000
Seasonality
not present
0.9855
Seasonality
not present
0.3150
Seasonality
not present
1.0000
Seasonality
not present
1.0000
Seasonality
perhaps
present
0.0282
Seasonality
not present
0.9482
Seasonality
not present
0.0811
Flour
Seasonality
not present
0.2739
Seasonality
not present
0.1835
Seasonality
present
0.0016
Seasonality
not present
0.0886
Seasonality
not present
0.9724
Seasonality
not present
0.1463
Seasonality
not present
1.0000
Seasonality
not present
0.4806
Rice
Seasonality
not present
0.3387
Seasonality
not present
0.1312
Seasonality
not present
0.8842
Seasonality
not present
1.0000
Seasonality
not present
0.0848
Seasonality
present
0.0045
Seasonality
not present
0.0615
Seasonality
not present
1.0000
Fresh meat
Seasonality
not present
0.0755
Seasonality
not present
0.0535
Seasonality
not present
0.0694
Seasonality
present
0.0009
Seasonality
not present
0.2737
Seasonality
not present
0.0612
Seasonality
present
0.0019
Seasonality
not present
1.0000
Meat products
Seasonality
not present
0.8477
Seasonality
not present
0.1906
Seasonality
perhaps
present
0.0136
Seasonality
present
0.0010
Seasonality
perhaps
present
0.0159
Seasonality
present
0.0003
Seasonality
present
0.0001
Seasonality
not present
0.2348
Milk
Seasonality
not present
0.7798
Seasonality
not present
0.6186
Seasonality
not present
0.4785
Seasonality
not present
1.0000
Seasonality
not present
1.0000
Seasonality
not present
0.9972
Seasonality
not present
0.0901
Seasonality
not present
0.9964
Romanian Statistical Review nr. 4 / 2019 85
North -
West Center
North -
East
South -
East
Bucharest -
Ilfov
South -
Muntenia
South - West
OlteniaWest
Cheese and
cream
Seasonality
not present
0.6547
Seasonality
present
0.0020
Seasonality
not present
0.2355
Seasonality
not present
0.2722
Seasonality
not present
0.0784
Seasonality
present
0.0005
Seasonality
present
0.0003
Seasonality
not present
1.0000
Eggs
Seasonality
present
0.0011
Seasonality
present
0.0006
Seasonality
present
0.0003
Seasonality
present
0.0002
Seasonality
perhaps
present
0.0323
Seasonality
present
0.0023
Seasonality
present
0.0006
Seasonality
not present
0.3625
Fats
Seasonality
perhaps
present
0.0372
Seasonality
not present
0.5340
Seasonality
not present
0.3151
Seasonality
not present
0.0893
Seasonality
not present
0.8957
Seasonality
not present
0.3149
Seasonality
not present
0.3877
Seasonality
not present
1.0000
Maize,
sunfl ower,
soya oil
Seasonality
present
0.0043
Seasonality
not present
0.2146
Seasonality
not present
0.2281
Seasonality
not present
0.0721
Seasonality
not present
0.1960
Seasonality
present
0.0015
Seasonality
not present
0.2343
Seasonality
not present
1.0000
Fruit
Seasonality
present
0.0005
Seasonality
present
0.0000
Seasonality
present
0.0003
Seasonality
present
0.0007
Seasonality
perhaps
present
0.0174
Seasonality
present
0.0001
Seasonality
present
0.0009
Seasonality
present
0.0014
Potatoes
Seasonality
not present
0.0778
Seasonality
not present
0.6932
Seasonality
present
0.0015
Seasonality
not present
0.4988
Seasonality
not present
0.0749
Seasonality
not present
0.0503
Seasonality
not present
0.1593
Seasonality
not present
0.7442 Vegetables
and canned
vegetables in
fresh vegetable
equivalent
Seasonality
present
0.0000
Seasonality
present
0.0001
Seasonality
present
0.0000
Seasonality
present
0.0000
Seasonality
present
0.0001
Seasonality
present
0.0000
Seasonality
present
0.0001
Seasonality
present
0.0004
Sugar
Seasonality
perhaps
present
0.0141
Seasonality
not present
0.3574
Seasonality
not present
0.1104
Seasonality
present
0.0001
Seasonality
perhaps
present
0.0378
Seasonality
present
0.0098
Seasonality
not present
0.1057
Seasonality
not present
0.7073
Confi ture,
jam, compote,
jellies
Seasonality
present
0.0002
Seasonality
present
0.0001
Seasonality
present
0.0001
Seasonality
present
0.0001
Seasonality
present
0.0009
Seasonality
present
0.0000
Seasonality
present
0.0001
Seasonality
present
0.0001
Chocolate,
sweets,
Turkish delight
and other sugar
confectionery
Seasonality
present
0.0070
Seasonality
present
0.0022
Seasonality
present
0.0000
Seasonality
present
0.0001
Seasonality
present
0.0057
Seasonality
present
0.0009
Seasonality
present
0.0001
Seasonality
present
0.0026
Mineral water
and other soft
drinks
Seasonality
not present
0.1143
Seasonality
not present
0.1610
Seasonality
not present
0.4609
Seasonality
not present
0.1571
Seasonality
not present
0.7285
Seasonality
not present
0.4895
Seasonality
perhaps
present
0.0458
Seasonality
not present
0.1050
Alcoholic
drinks
Seasonality
perhaps
present
0.0289
Seasonality
present
0.0092
Seasonality
not present
0.1117
Seasonality
not present
0.4691
Seasonality
not present
0.2579
Seasonality
present
0.0024
Seasonality
not present
0.9090
Seasonality
not present
0.0735
Next, the automatic procedure used for TRAMO-SEATS and X13
was applied to seasonally adjust the time series that present strong evidence
of seasonality. The results are displayed in Table 2. For most of the series, the
X13 method is more suitable for seasonal adjustment. However, there are some
Romanian Statistical Review nr. 4 / 201986
series where one can’t decide between the two methods: the consumption of
fresh meat in South – East, meat products in South – Muntenia; the consumption
of eggs in North – West and South – East, respectively; the consumption of
fruits in North – West; the consumption of vegetables and canned vegetables
in fresh vegetable equivalent North – West; the consumption of confi ture,
jam, compote, jellies North – West. Moreover, there were some cases when
TRAMO-SEATS provided better results: the consumption of fruit Center,
South – Muntenia and South – East, respectively; the consumption of potatoes
North – East; the consumption of vegetables and canned vegetables in fresh
vegetable equivalent in South – East and Bucharest – Ilfov; the consumption
of Chocolate, sweets, Turkish delight and other sugar confectionery in North
– East; South - West Oltenia and West, respectively.
With regard to the calendar eff ect, interesting results were obtained.
Firstly, threre is no trading days eff ect, meaning that the consumption of agro-
food products is not infl uenced by the day of the week. Secondly, for some
products there is a signifi cant negative Easter eff ect of various lenghts: for
the consumption of Cheese and cream in South – Muntenia the eff ect lasts
for 15 days while in South - West Oltenia for 8 days; for the consumption of
Eggs the eff ect lasts for 15 days in North – East as well as South – Muntenia;
for the consumption of Meat products the eff ect lasts for 8 days in South –
East; for the consumption of Maize, sunfl ower, soya oil the eff ect lasts for
8 days in South – Muntenia; for the consumption of Fruit the eff ect lasts for
15 days both in North – East and South – Muntenia; for the consumption of
Vegetables and canned vegetables in fresh vegetable equivalent the eff ect lasts
for 8 days in the Center as well as South - West Oltenia; for the consumption
of Alcoholic drinks the eff ect lasts for 15 days in the Center region. The results
are in line with the ones obtained in the scientifi c literature. For example,
analysing US data, McElroy et al. (2018) also obtained a negative pre-Easter
eff ect for groceries.
Romanian Statistical Review nr. 4 / 2019 87
The results of the seasonal adjustment process for the series concerning
the average consumption per person of various agro-food products using
TRAMO-SEATS and X13 with national calendar
Table 2
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
Bread and bakery products North – West
TRAMO-
SEATS
RSA full
log-
transformedno no 1 yes severe -5.3859
X13
RSA5c
log-
transformedno no no no good -0.1254
Flour North – East
TRAMO-
SEATS
RSA full
No
transformationno no no no good -36.8021
X13
RSA5c
log-
transformedno no no no good -44.1579
Rice South – MunteniaTRAMO-
SEATS
RSA full
log-
transformedno no no no good -88.34243
X13
RSA5c
log-
transformedno no no no good -96.4454
Fresh meat South – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -16.8829
X13
RSA5c
log-
transformedno no no no good -16.8829
Fresh meat South - West Oltenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -6.6265
X13
RSA5c
log-
transformedno no no no good -16.6115
Meat products South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -33.4772
X13
RSA5c
log-
transformedno no no no good -33.4772
Meat products South – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -17.1068
Romanian Statistical Review nr. 4 / 201988
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
X13
RSA5c
log-
transformed
Yes, 8
days,
coef.
-0.66
no no no good -25.0815
Meat products South - West OlteniaTRAMO-
SEATS
RSA full
log-
transformedno no 1 no good -41.6206
X13
RSA5c
log-
transformedno no 1 no good -50.4455
Cheese and cream South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -31.2457
X13
RSA5c
log-
transformed
Yes, 15
days,
coef.
-0.09
no no no good -45.3041
Cheese and cream Center
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -29.53659
X13
RSA5c
log-
transformedno no no no good -36.8126
Cheese and cream South - West Oltenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -37.1563
X13
RSA5c
log-
transformed
Yes, 8
days,
coef.
-0.1829
no no no good -51.9694
Eggs North – West
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 24.5093
X13
RSA5c
log-
transformedno no no no good 24.0148
Eggs Center
TRAMO-
SEATS
RSA full
No
transformationno no no no good 15.9171
X13
RSA5c
log-
transformedno no no no good 16.9665
Romanian Statistical Review nr. 4 / 2019 89
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
Eggs North – EastTRAMO-
SEATS
RSA full
log-
transformedno no no no good 20.3638
X13
RSA5c
log-
transformed
Yes, 15
days,
coef. -0.1
no no no good 12.4254
Eggs South – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 24.8740
X13
RSA5c
log-
transformedno no no no good 24.8740
Eggs South – Muntenia
TRAMO-
SEATS
RSA full
No
transformationno no no no good 30.4485
X13
RSA5c
log-
transformed
Yes, 15
days,
coef.
-0.14
no no no good 21.1764
Eggs South - West OlteniaTRAMO-
SEATS
RSA full
log-
transformedno no no no good 23.5356
X13
RSA5c
log-
transformedno no no no good 20.1071
Maize, sunfl ower, soya oil North – WestTRAMO-
SEATS
RSA full
log-
transformedno no no no good -41.7316
X13
RSA5c
log-
transformedno no no no good -42.2766
Maize, sunfl ower, soya oil South – MunteniaTRAMO-
SEATS
RSA full
log-
transformedno no no no good -53.5028
X13
RSA5c
log-
transformed
Yes, 8
days,
coef.
-0.22
no no no good -69.7247
Fruit North – West TRAMO-
SEATS
RSA full
log-
transformedno no no no good 3.6303
Romanian Statistical Review nr. 4 / 201990
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
X13
RSA5c
log-
transformedno no no no good 3.6303
Fruit Center
TRAMO-
SEATS
RSA full
No
transformationno no no no good -9.1804
X13
RSA5c
log-
transformedno no no no good -8.8778
Fruit North – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 2.7937
X13
RSA5c
log-
transformed
Yes, 15
days,
coef.
-0.11
no no no good 0.5475
Fruit South – East
TRAMO-
SEATS
RSA full
No
transformationno no no no good 6.7519
X13
RSA5c
log-
transformedno no no no good 10.5440
Fruit South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 2.19170
X13
RSA5c
log-
transformed
Yes, 1
day, coef
-130.4
no no no good 2.8940
Fruit South - West Oltenia
TRAMO-
SEATS
RSA full
No
transformationno no no no good 2.3428
X13
RSA5c
log-
transformed
Yes, 1
day but
coef.
aprox. 0
no no no good -3.1467
Fruit West
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 16.7622
X13
RSA5c
log-
transformedno no no no good 16.7622
Potatoes North – East
Romanian Statistical Review nr. 4 / 2019 91
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
TRAMO-
SEATS
RSA full
log-
transformedno no 1 no good -4.2044
X13
RSA5c preprocessing: failed
Vegetables and canned vegetables in fresh vegetable equivalent North – West
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 9.9716
X13
RSA5c
log-
transformedno no no no good 9.9716
Vegetables and canned vegetables in fresh vegetable equivalent Center
TRAMO-
SEATS
RSA full
No
transformationno no no no good 14.9162
X13
RSA5c
log-
transformed
Yes, 8
days,
coef.
-0.57
no no no good 6.7268
Vegetables and canned vegetables in fresh vegetable equivalent North – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 11.5725
X13
RSA5c
log-
transformedno no 1 no good 3.6528
Vegetables and canned vegetables in fresh vegetable equivalent South – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 15.1868
X13
RSA5c
log-
transformedno no no no good 17.9460
Vegetables and canned vegetables in fresh vegetable equivalent Bucharest – Ilfov
TRAMO-
SEATS
RSA full
No
transformationno no no no good 24.0226
X13
RSA5c
log-
transformedno no no no good 26.7319
Vegetables and canned vegetables in fresh vegetable equivalent South –
MunteniaTRAMO-
SEATS
RSA full
log-
transformedno no no no good 10.0872
X13
RSA5c
log-
transformedno no no no good 9.6269
Romanian Statistical Review nr. 4 / 201992
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
Vegetables and canned vegetables in fresh vegetable equivalent South - West Oltenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 26.9600
X13
RSA5c
log-
transformed
Yes, 8
days,
coef.
-0.44
no no no good 19.8792
Vegetables and canned vegetables in fresh vegetable equivalent West
TRAMO-
SEATS
RSA full
log-
transformedno no no no good 30.5993
X13
RSA5c
log-
transformedno no no no good 30.5759
Sugar South – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -40.0591
X13
RSA5c
log-
transformedno no 1 no uncertain -62.1106
Sugar South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -45.7166
X13
RSA5c
log-
transformedno no no no good -58.6181
Confi ture, jam, compote, jellies North – West
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -23.8896
X13
RSA5c
log-
transformedno no no no good -23.8896
Confi ture, jam, compote, jellies Center
TRAMO-
SEATS
RSA full
No
transformationno no no no good -36.5782
X13
RSA5c
log-
transformedno no 2 no severe -53.0063
Confi ture, jam, compote, jellies North – East
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -28.5528
X13
RSA5c
log-
transformedno no no no good -33.1440
Romanian Statistical Review nr. 4 / 2019 93
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
Confi ture, jam, compote, jellies South – East
TRAMO-SEATS RSA full
log-
transformedno no no no good -34.2008
X13 RSA5c
log-
transformedno no no no good -43.3323
Confi ture, jam, compote, jellies Bucharest – IlfovTRAMO-SEATS RSA full
log-
transformedno no no no good -38.4625
X13 RSA5c
log-
transformedno no no no good -41.7283
Confi ture, jam, compote, jellies South – MunteniaTRAMO-SEATS RSA full
No
transformationno no no no good -34.8264
X13 RSA5c
log-
transformedno no no no good -38.8469
Confi ture, jam, compote, jellies South - West Oltenia
TRAMO-SEATS RSA full
No
transformationno no no no good -37.8559
X13 RSA5c
log-
transformedno no 1 no good -55.4987
Confi ture, jam, compote, jellies WestTRAMO-SEATS RSA full
No
transformationno no no no good -28.5876
X13 RSA5c
log-
transformedno no no no good -28.0057
Chocolate, sweets, Turkish delight and other sugar confectionery North – West
TRAMO-SEATS RSA full
log-
transformedno no no no good -39.8765
X13 RSA5c
log-
transformedno no no no good -60.5710
Chocolate, sweets, Turkish delight and other sugar confectionery Center
TRAMO-SEATS RSA full
log-
transformedno no no no good -53.4022
X13 RSA5c
log-
transformedno no 1 no good -71.9031
Chocolate, sweets, Turkish delight and other sugar confectionery North – East
TRAMO-SEATS RSA full
No
transformationno no no no good -56.0022
Romanian Statistical Review nr. 4 / 201994
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
X13
RSA5c
log-
transformedno no no no good -52.4192
Chocolate, sweets, Turkish delight and other sugar confectionery South – East
TRAMO-
SEATS
RSA full
No
transformationno no no no good -59.79847
X13
RSA5c
log-
transformedno no no no good -69.0889
Chocolate, sweets, Turkish delight and other sugar confectionery Bucharest – Ilfov
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -34.5570
X13
RSA5c
log-
transformedno no no no good -35.54072
Chocolate, sweets, Turkish delight and other sugar confectionery
South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -58.54067
X13
RSA5c
log-
transformedno no no no good -70.6273
Chocolate, sweets, Turkish delight and other sugar confectionery
South - West OlteniaTRAMO-
SEATS
RSA full
No
transformationno no no no good -53.7011
X13
RSA5c
log-
transformedno no no no good -46.5465
Chocolate, sweets, Turkish delight and other sugar confectionery West
TRAMO-
SEATS
RSA full
No
transformationno no no no good -40.5711
X13
RSA5c
log-
transformedno no no no good -34.8893
Alcoholic drinks CenterTRAMO-
SEATS
RSA full
No
transformationno no no no good 0.6393
X13
RSA5c
log-
transformed
Yes, 15
days, coef.
-0.16
no no no good -1.2334
Alcoholic drinks South – Muntenia
TRAMO-
SEATS
RSA full
log-
transformedno no no no good -14.7629
Romanian Statistical Review nr. 4 / 2019 95
Series
transformation
Easter
Eff ect
Trading
days
eff ect
Outlier
detected
and
corrected
Residual
seasonality
Overall
qualityAIC
X13
RSA5c
log-
transformedno no no no good -24.7765
4. CONCLUSIONS
Understanding the consumption patterns of agro-food products is a
necessary step in the fast changing economy that may contribute to sustainable
business growth in this economic sector. Currently, the focus on the origin
traceability and quality of products is seen as a consumer behavior change that
has occurred on the agro-food product market (Opara, 2003).
In the present research, using the most recent quarterly data from the
National Institute of Statistics Romania, we explore seasonal patterns on the
consumption of agro-food products at regional level.
The results reveal that the consumption of some agro-food products
has no seasonality in all regions. However, there are products like vegetables
and canned vegetables in fresh vegetable equivalent, Confi ture, jam, compote,
jellies and Chocolate, sweets, Turkish delight and other sugar confectionery
that are consumed on a seasonally basis in all regions.
Also, the analysis shows that seasonal patterns in consumption for
fruits and eggs persist in all regions except for Bucharest-Ilfov. South-West
Oltenia and South-East Regions present seasonal patterns in the consumption
of fresh meat and meat products. South-Muntenia Region has seasonal patterns
in the consumption of rice, meat, cheese and cream, maize, sunfl ower and
soya oil, sugar products.
The consumption of alcoholic drinks shows a strong seasonal pattern
in South-Muntenia and Center Regions. The North-East Region presents
seasonality only in the consumption of fl our and potatoes, while the North –
West Region for bread and bakery products and maize, sunfl ower and soya oil.
The situation by region shows that Bucharest-Ilfov West region have
the lowest number of series that present seasonal patterns, compared to South-
Muntenia that has the highest number of such series.
When the automatic procedure TRAMO-SEATS and X13 was
applied to seasonally adjust the time series, X13 procedure obtained the best
results. Even so, there were some circumstances when TRAMO-SEATS
provided better results and some cases where one can’t decide between the
two methods.
Romanian Statistical Review nr. 4 / 201996
When the series are checked by the calendar eff ect it is observed no
trading days eff ect, meaning that the consumption of agro-food products is not
infl uenced by the day of the week. Also, the results are showing a negative pre-
Easter eff ect of various lengths for some products and regions. Furthermore
they reveal that unobserved factors may contribute to the current trends in the
consumption patterns of agro-food products. The study of such unobserved
eff ects need to be addressed by using other decision criteria.
REFERENCES
1. Andrei, T., Mirică, A., Glăvan, I. R., Ferariu, G. A., and Mincu-Rădulescu, G. I.,
2019, Seasonal adjustment of tourism data for Romania using JDemetra+, paper
presented at the http://simpstat.ase.ro/wp-content/uploads/2019/06/ICAS2019-
Conference-Program..pdf
2. Buono D., Infante, E., and Mazzi, G. L., 2018, Short versus long time series: An
empirical analysis in Handbook on Seasonal Adjustment, Eurostat https://ec.europa.
eu/eurostat/documents/3859598/8939616/KS-GQ-18-001-EN-N.pdf
3. Davidova, S., Fredriksson, L., and Bailey, A., 2009, Subsistence and semi-
subsistence farming in selected EU new member states, Agricultural Economics, no.
40, pp. 733–744.
4. European Commission, 2018, EU agricultural outlook for markets and income, 2018-
2030. European Commission, DG Agriculture and Rural Development, Brussels.
5. Eurostat, 2019, Seasonal Adjustment https://ec.europa.eu/eurostat/cros/content/
download_en
6. Findley, D. F., Wills, K., and Monsell, B. C., 2005, Issues in estimating easter
regressors using regarima models with x-12-arima. In Proceedings of the American
Statistical Association.
7. Grudkowska S., 2017, JDemetra+ User Guide Version 2.2 https://ec.europa.eu/
eurostat/cros/system/fi les/jdemetra_user_guide_version_2.2.pdf
8. McElroy, T. S., Monsell, B. C., and Hutchinson, R. J., 2018, Modeling of Holiday
Eff ects and Seasonality in Daily Time Series, Statistics, 1.
9. Mirică, A., Andrei, T., Dascălu, E. D., Mincu-Rădulescu, G. I., and Glăvan, I. R.,
2016, Revision policy of seasonally adjusted series – case study on Romanian quarterly
GDP, Economic Computation & Economic Cybernetics Studies & Research, 50(3).
10. Mirică, A., Toma, I. E., and Begu, L. S., 2017, Seasonal Adjustment–Consensus
between Direct and Indirect Method. Case Study: Seasonal Adjustment of
Romanian National Accounts Using Jdemetra+ 2.1. In 30th International Business
Information Management Association Conference (pp. 526-541).
11. Motulsky, H., and Christopoulos, A., 2004, Fitting models to biological data using linear
and nonlinear regression: a practical guide to curve fi tting. Oxford University Press.
12. Opara, L.U., 2003, Traceability in agriculture and food supply chain: A review of
basic concepts, technological implications, and future prospects, WFL Publisher,
Science and Technology, Food, Agriculture & Environment, 1(1), 101-106.
13. PWC, 2017, Potenţialul dezvoltării sectorului agricol din România (available only
in Romanian at https://www.juridice.ro/wp-content/uploads/2017/03/Raport_PwC-
agricultura.pdf)
14. Toma, I. E., and Mirică, A., 2018, Using Statistical Data to Better Understand
Business Environment-Case Study on Export and Import Data at County Level.
Romanian Statistical Review, (2).
15. UNECE, 2012, Practical Guide to Seasonal Adjustment With Demetra+ http://www.
unece.org/index.php?id=40568