data processing
DESCRIPTION
dafzfTRANSCRIPT
Meaning
MIS is an Information system which helps in providing the management of an organization
with information which is used by management for decision making.
A management information system (MIS) is a subset of the overall internal controls of a
business covering the application of people, documents, technologies, and procedures by
management accountants to solving business problems such as costing a product, service or a
business-wide strategy. Management information systems are distinct from regular information
systems in that they are used to analyze other information systems applied in operational
activities in the organization. Academically, the term is commonly used to refer to the group of
information management methods tied to the automation or support of human decision making,
e.g. Decision Support Systems, Expert systems, and Executive information systems.
During the period of preindustrial revolution most of the data processing was done manually. It
was after the industrial revolution that the computers slowly started replacing manual labour. The
modern digital computer was basically designed to handle scientific calculations. During the
period 1940 to 1960 computers were commercially used for census and payroll work. This
involved
Large amount of data and its processing. Since then the commercial application exceeded the
scientific applications for which the computer were mainly intended for. MIS is an Information
system which helps in providing the management of an organization with information which is
used by management for decision making.
The Basic characteristics of an effective Management Information System are as follows:
I.
Management-oriented: The basic objective of MIS is to provide information support to the
management in the organization for decision making. So an effective MIS should start its
journey from appraisal of management needs, mission and goal of the business organization. It
may be individual or collective goals of an organization. The MIS is such that it serves all the
levels of management in an organization i.e. top, middle and lower level.
II.
Management directed: When MIS is management-oriented, it should be directed by the
management because it is the management who tells their needs and requirements more
effectively than anybody else. Manager should guide the MIS professionals not only at the stage
of planning but also on development, review and implementation stages so that effective system
should be the end product of the whole exercise in making an effective MIS.
III.
Integrated: It means a comprehensive or complete view of all the sub systems in the organization
of a company. Development of information must be integrated so that all the operational and
functional information sub systems should be worked together as a single entity. This integration
is necessary because it leads to retrieval of more meaningful and useful information.
V.
Common data flows: The integration of different sub systems will lead to a common data flow
which will further help in avoiding duplicity and redundancy in data collection, storage and
processing. For example, the customer orders are the basis for many activities in an organization
viz. billing, sales for cashing, etc. Data is collected by a system analyst from its original source
only one time. Then he utilizes the data with minimum number of processing procedures and
uses the information for production output documents and reports in small numbers and
eliminates the undesirable data. This will lead to elimination of duplication that simplify the
operations and produce an efficient information system.
Data processing is, broadly, "the collection and manipulation of items of data to
produce meaningful information. In this sense it can be considered a subset of information processing, "the
change (processing) of information in any manner detectable by an observer."
The term is often used more specifically in the context of a business or other organization to refer to the class
of commercial data processing applications
Data-Processing System
a system of interrelated techniques and means of collecting and processing the data needed t
oorganize control of some system. Automatic, or electronic, data-processing systems make use o
felectronic computers or other modern information-processing equipment. Without a computer
adata-processing system can be constructed for only a small controlled system. The use of
acomputer means that the data-processing system can perform not just individual information-
processing and computing tasks but a set of tasks that are interconnected and can be carried outin
a single sequence of operations.
Data-processing systems should be distinguished from automated control systems. The primaryf
unction of the latter is the performance of calculations associated, for example, with the solution
ofproblems of control and with the selection of optimal variants of plans on the basis of models a
ndthe techniques of mathematical economics. The chief purpose of automated control systems is
toincrease efficiency of control or management. The functions of data-processing systems, howe
ver,are to collect, store, retrieve, and process the data needed to carry out the calculations at thelo
west possible cost. When an automatic data-processing system is constructed, efforts are madeto
identify and automate laborious, regularly repeating routine operations on large files of data. Ada
ta-processing system is usually a part of an automated control system and represents the firststag
e in the development of an automated control system. Data-processing systems, however,also fun
ction as independent systems. In many cases it is more efficient to use a single system toprocess
similar data for a large number of control problems handled by different automated controlsyste
ms —that is, to use a shared-access data-processing system.
The first data-processing systems were constructed in the USA in the 1950’s, when it became cle
arthat the use of a computer to solve individual problems, such as calculating wages or keeping tr
ackof
of goods and materials, was inefficient. It was seen at that time that integratedprocessing of the d
ata fed to the computer was necessary.
The USSR has a number of large data-processing systems, most of which are the bases o
fautomated control systems. Examples are the systems that have been set up at such largeindustri
al enterprises as Frezer, Kalibr, the Likhachev Automotive Plant, the L’vov Television Plant,and
the 15th Anniversary of the Ukrainian Lenin Komsomol Donetsk Plant. Data-processingsystems
are coming into use not only in industrial enterprises but also in planning bodies,statistical agenci
es, ministries, and banking institutions. They are finding application in
and in the supply of materials and equipment. The introduction of a data-processing system is
aprerequisite for the development of an automated control system.
The experience that has been gained with data-processing systems permits identification of theba
sic principles of the construction and techniques of development of such systems. The mostimpor
tant principle is the principle of integration, which requires that the raw data undergoingprocessi
ng be fed to the data-processing system once. The problems being solved in the data-processing s
ystem are coordinated in such a way that the raw data and the data resulting from thesolution of s
ome problems are used as the initial data for as many of the other problems aspossible. This coor
dination eliminates duplication of the operations of collection, preparation, andchecking of data a
nd ensures the integrated use of the data. As a result, the costs of obtaining thenecessary informat
ion are reduced, and the efficiency of the data-processing system is increased.
Closely related to the principle of integration is the principle of centralization of data processing.
When a data-processing system is constructed, many information-processing tasks are removedfr
om the control of the respective subdivisions and are concentrated at a single computing center o
rat a small number of such centers. Large data files are established at these centers; the files area
vailable for integrated processing. Special information retrieval systems, called automatic databa
nks, are set up in the data-processing system to manage and make optimal use of the files. Theaut
omatic data bank receives data that are subject to repeated use, and, in conformity with theoperat
ing schedule of the data-processing system, the data are used to form the work files for theproble
ms being solved. The data bank also supplies information in response to inquiries. Thecentralizat
ion of data processing in constructing a data-processing system usually assumes areorganization
of the structure of control.
The principle of the systems approach to the organization of the sequence of operations consists i
nthe following: When the data-processing system is constructed, there must be integratedmechan
ization and automation of the operations at all stages of data collection and processing, andthe ha
rdware used must be self-consistent with respect to throughput and other parameters. If thisis not
done, the unity of the sequence of operations is disrupted, and the efficiency of the data-
processing system drops sharply.
Before a data-processing system is constructed, the following are subjected to thoroughinvestigat
ion and analysis: the controlled system, the control problems, the structure of control, thecontent
of the information, and the information flows. On the basis of the analysis of the results ofthe inv
estigation, an information model of the data-processing system is developed that establishesnew i
nformation flows and the relation between the data-processing tasks. The hardware is chosen,and
the sequence of operations of the data-processing system is worked out, according to thevolumes
of data being processed, stored, and transmitted as determined from the informationmodel of the
data-processing system. The successful construction of a data-processing systemrequires the part
icipation not only of specialists but also of managers and other personnel who aredirectly involve
d in the solution of control problems at all stages of the development andintroduction of the syste
m.
Manual data processing
Although widespread use of the term data processing dates only from the nineteen-fifties data
processing functions have been performed manually for millennia. For
example bookkeeping involves functions such as posting transactions and producing reports like
the balance sheet and the cash flow statement. Completely manual methods were augmented by
the application of mechanical or electronic calculators. A person whose job it was to perform
calculations manually or using a calculator was called a "computer."
The 1850 United States Census schedule was the first to gather data by individual rather
than household. A number of questions could be answered by making a check in the appropriate
box on the form. From 1850 through 1880 the Census Bureau employed "a system of tallying,
which, by reason of the increasing number of combinations of classifications required, became
increasingly complex. Only a limited number of combinations could be recorded in one tally, so
it was necessary to handle the schedules 5 or 6 times, for as many independent tallies. It took
over 7 years to publish the results of the 1880 census" using manual processing methods.
Automatic data processing
The term automatic data processing was applied to operations performed by means of unit
record equipment, such asHerman Hollerith's application of punched card equipment for
the 1890 United States Census. "Using Hollerith's punchcard equipment, the Census Office was
able to complete tabulating most of the 1890 census data in 2 to 3 years, compared with 7 to 8
years for the 1880 census. ... It is also estimated that using Herman Hollerith's system saved
some $5 million in processing costs"[5] (in 1890 dollars) even with twice as many questions as in
1880.
Electronic data processing
Computerized data processing, or Electronic data processing represents the further evolution,
with the computer taking the place of several independent pieces of equipment. The Census
Bureau first made limited use of electronic computers for the 1950 United States Census, using
a UNIVAC I system, delivered in 1952.
Further evolution
"Data processing (DP)" has also previously been used to refer to the department within an
organization responsible for the operation of data processing applications. The term data
processing has mostly been subsumed under the newer and somewhat more general
term information technology (IT). "Data processing" has acquired a negative connotation,
suggesting use of older technologies. As an example, in 1996 the Data Processing Management
Association(DPMA) changed its name to the Association of Information Technology
Professionals. Nevertheless, the terms are roughly synonymous.
Processing of data--editing, coding, classification and tabulation
Editing
What is editing?
Editing is the process of correcting faulty data, in order to allow the production of reliable
statistics.
Data editing does not exist in isolation from the rest of the collection processing cycle and the
nature and extent of any editing and error treatment will be determined by the aims of the
collection. In many cases it will not be necessary to pay attention to every error.
Errors in the data may have come from respondents or have been introduced during data entry
or data processing. Editing aims to correct a number of non-sampling errors, which are those
errors that may occur in both censuses and sample surveys; for example, non-sampling errors
include those errors introduced by misunderstanding questions or instructions, interviewer bias,
miscoding, non-availability of data, incorrect transcription, non-response and non-contact. But
editing will not reveal all non-sampling errors - for example, while an editing system could be
designed to detect transcription errors, missing values and inconsistent responses, other
problems such as interviewer bias may easily escape detection.
Editing should aim:
to ensure that outputs from the collection are mutually consistent, for example, a component
should not exceed an aggregate value; two different methods of deriving the same value should
give the same answer;
to detect major errors, which could have a significant effect on the outputs;
to find any unusual outputs and their causes.
The required level of editing
The function of editing is to help achieve the aims of a collection so, before edits are created or
modified, it is important to know these aims - since these have a major say in the nature of the
editing system created for the given collection. We need to know about features such as:
the outputs from the collection
the level at which outputs are required
their required accuracy
how soon after the reference period the outputs are needed
the users and uses of the collection. A collection may be simple (with limited data collected) and
designed to meet the requirements of only one type of user (e.g., Survey of new Motor Vehicle
Registration and Retail TRADE ) or it may collect more complex data and aim to meet the
needs of many different types of users (e.g. Agricultural Finance Survey, Household Expenditure
Survey, etc.). If there are many types of users there is a likelihood of conflicting requirements
amongst the users, which can lead to stresses on the collection.
the reliability of each item (eg. is the definition easily understood or is the item sensitive?)
While the goal of editing is to produce data that represent as closely as possible the activity
being measured, there are usually a number of constraints (such as the time and number of
people available for data editing) within which editing is conducted. These constraints will also
influence the design of the editing system for the given collection.
The structure of an edit
An edit is defined by specifying:
the test to be applied,
the domain, which is a description of the set of data that the test should be applied to, and
the follow-up action if the test is failed.
The test
This is a statement of something that is expected to be true for good data. A test typically
consists of data items connected by arithmetic or comparison operators. Ideas for suitable tests
may arise from people with a knowledge of the subject matter, the aims of the collection or
relationships that should hold between items.
Examples
this item should not be missing
the sum of these items equals that item
The Domain
The domain is defined by specifying the conditions which the data must satisfy before the test
can be applied.
Example
A test may only be relevant to those businesses in a certain industry and the domain will
therefore consist of all records belonging to that industry.
The Follow-up
The edit designer must also think about the appropriate follow-up action if a test is failed. Some
edits will be minor failures that simply require human attention, but do not need to be amended.
Other edits identify major failures that require human attention and an amendment. The sort of
treatment given to an edit failure is commonly done by classifying edits to a grade of severity,
such as fatal, query and warning
.
Example
Where a record lacks critical information which is essential for further processing a fatal error
should be displayed.
It is important to note that even if we go through comprehensive editing processes errors may
still occur, as editing can identify only noticeable errors. Information wrongly given by
respondents or wrongly transcribed by interviewers can only be corrected when there are clues
that point to the error and provide the solution. Thus, the final computer file will not be error-
free, but hopefully should be internally consistent.
Generally different levels of editing are carried out at several stages during data processing.
Some of the stages involved are provided below.
Clerical Coding
This stage includes mark-in of the forms as they are returned, all manual coding (eg. country and
industry coding) and manual data conversion (eg. miles to kilometres).
Clerical Editing
This stage includes all editing done manually by clerks before the unit data are loaded into a
computer file.
Input Editing
Input editing deals with each respondent independently and includes all "within record" edits
which are applied to unit records. It is carried out before any aggregates for the production of
estimates are done. An important consideration in input editing is the setting of the tolerances for
responses. Setting low tolerances will result in the generation of large numbers of edit failures
and impact directly on resources and in the meeting of timetables.
Ideally, an input edit system has been designed after carefully considering and setting edit
tolerances, clerical scrutiny levels, resource costs (against benefits), respondent load and timing
implications.
Output Editing
Output editing includes all edits applied to the data once it has been weighted and aggregated in
preparation for publication. If a unit contributes a large amount to a cell total, then the response
for that unit should be checked with a follow-up.
Output editing is not restricted to examination of aggregates within the confines of a survey. A
good output edit system will incorporate comparisons against other relevant statistical indicators.
Return to top
Types of Edits Commonly Used
Validation Edit
Checks the validity or legality of basic identification or classificatory items in unit data.
Examples
the respondent's reference number is of a legal form
state code is within the legal range
sex is coded as either M or F
Missing Data Edit
Checks that data that should have been reported were in fact reported. An answer to one question
may determine which other questions are to be answered and the editing system needs to ensure
that the right sequence of questions has been answered.
Examples
in an employment survey a respondent should report a value for employment
a respondent who has replied NO to the question: Do you have any children? should not have
answered any of the questions about the ages, sexes and education of any children
Logical Edit
Ensures that two or more categorical items in a record do not have contradictory values.
Example
a respondent claiming to be 16 years old and receiving the age pension would clearly fail an edit
Consistency (or reconciliation) edits
Checks that precise arithmetical relationships hold between continuous numeric variables that
are subject to such relationships. Consistency edits could involve the checking of totals or
products.
Examples
totals: a reported total should equal the sum of the reported components
totals of different breakdowns of the same item should be equal (eg. the Australian estimate
should be the same whether obtained by summing state estimates or industry estimates
products: if one item is a known percentage of another then this can be checked (eg. land tax
paid should equal the product of the taxable value and the land tax rate)
income from the sales of a commodity should equal the product of the unit price and the amount
sold
Range Edit
Checks that approximate relationships hold between numeric variables that are subject to such
relationships. A range edit can be thought of as a loosening of a consistency edit and it's
definition will include a description of the range of acceptable values (or tolerance range).
Examples
If a company's value for number of employees increases by more than a certain predefined
amount (or proportion) then the unit will fail. Note that both the absolute change and the
proportional change should be considered since a change from 500 to 580 may not be as useful
as a change from 6 to 10. So, if the edit was defined to accept the record if the current value is
within 20% of the previous value a change from 500 to 580 would be accepted and the change
from 6 to 10 would be queried.
In a survey which collects total amount loaned for houses and total number of housing loans
from each lending institution it would probably be sensible to check that the derived item
average housing loan is within an acceptable range.
After collecting data, the method of converting raw data into meaningful statement; includes data
processing, data analysis, and data interpretation and presentation.
Data reduction or processing mainly involves various manipulations necessary for preparing the
data for analysis. The process (of manipulation) could be manual or electronic. It involves
editing, categorizing the open-ended questions, coding, computerization and preparation of
tables and diagrams.
Editing data :
Information gathered during data collection may lack uniformity. Example: Data collected
through questionnaire and schedules may have answers which may not be ticked at proper
places, or some questions may be left unanswered. Sometimes information may be given in a
form which needs reconstruction in a category designed for analysis, e.g., converting
daily/MONTHLY INCOME in annual income and so on. The researcher has to take a decision as
to how to edit it.
Editing also needs that data are relevant and appropriate and errors are modified. Occasionally,
the investigator makes a mistake and records and impossible answer. “How much red chilies do
you use in a month” The answer is written as “4 kilos”. Can a family of three members use four
kilo chilies in a month? The correct answer could be “0.4 kilo”
Care should be taken in editing (re-arranging) answers to open-ended questions. Example:
Sometimes “don’t know” answer is edited as “no response”. This is wrong. “Don’t know” means
that the respondent is not sure and is in a double mind about his reaction or considers the
questions personal and does not want to answer it. “No response” means that the respondent is
not familiar with the situation/object/event/individual about which he is asked.
Coding of data:
Coding is translating answers into numerical values or assigning numbers to the various
categories of a variable to be used in data analysis. Coding is done by using a code book, code
sheet, and a computer card. Coding is done on the basis of the instructions given in the
codebook. The code book gives a numerical code for each variable.
Now-a-days, codes are assigned before going to the field while constructing the
questionnaire/schedule. Pose data collection; pre-coded items are fed to the computer for
processing and analysis. For open-ended questions, however, post-coding is necessary. In such
cases, all answers to open-ended questions are placed in categories and each category is assigned
a code.
Manual processing is employed when qualitative methods are used or when in quantitative
studies, a small sample is used, or when the questionnaire/schedule has a large number of open-
ended questions, or when accessibility to computers is difficult or inappropriate. However,
coding is done in manual processing also.
Data editing and coding
Many data processing activities that are typically completed when collecting data via paper
questionnaires were unnecessary in these studies because the questionnaires were computer
administered. For example, making sure that questions were asked in the correct sequence,
checking for out-of-range or inconsistent responses, and filling in the appropriate question text
based on a respondent's previous answers were all controlled by the interviewing application
software. Inconsistent responses that failed the programme's edit checks were brought to the
attention of the interviewer who could resolve the inconsistency with the respondent during the
interview, improving the quality of data and minimizing the need for back-end editing. Although
these software programs automatically performed many of the decisions formerly made by
interviewers using paper questionnaires, the data for each study did require some additional
editing and coding. Editing operations included processing each interview through a series of
programming routines that evaluated question responses and assigned codes to indicate the
presence or absence of each mental health disorder assessed by the study. A number of other
summary variables based on individual question items were also created in preparation for the
project's analysis phase. In addition, each study included several open-ended questions, which
were coded.
Data classification/distribution :
Sarantakos (1998: 343) defines distribution of data as a form of classification of scores obtained
for the various categories or a particular variable. There are four types of distributions:
1. Frequency distribution
2. Percentage distribution
3. Cumulative distribution
4. Statistical distributions
Frequency distribution:
In social science research, frequency distribution is very common. It presents the frequency of
occurrences of certain categories. This distribution appears in two forms:
Ungrouped: Here, the scores are not collapsed into categories, e.g., distribution of ages of the
students of a BJ (MC) class, each age value (e.g., 18, 19, 20, and so on) will be presented
separately in the distribution.
Grouped: Here, the scores are collapsed into categories, so that 2 or 3 scores are presented
together as a group. For example, in the above age distribution groups like 18-20, 21-22 etc., can
be formed)
Percentage distribution:
It is also possible to give frequencies not in absolute numbers but in percentages. For instance
instead of saying 200 respondents of total 2000 had a MONTHLY INCOME of less than Rs.
500, we can say 10% of the respondents have a monthly income of less than Rs. 500.
Cumulative distribution:
It tells how often the value of the random variable is less than or equal to a particular reference
value.
Statistical data distribution:
In this type of data distribution, some measure of average is found out of a sample of
respondents. Several kind of averages are available (mean, median, mode) and the researcher
must decide which is most suitable to his purpose. Once the average has been calculated, the
question arises: how representative a figure it is, i.e., how closely the answers are bunched
around it. Are most of them very close to it or is there a wide range of variation?
Tabulation of data :
After editing, which ensures that the information on the schedule is accurate and categorized in a
suitable form, the data are put together in some kinds of tables and may also undergo some other
forms of statistical analysis.
Table can be prepared manually and/or by computers. For a small study of 100 to 200 persons,
there may be little point in tabulating by computer since this necessitates putting the data on
punched cards. But for a survey analysis involving a large number of respondents and requiring
cross tabulation involving more than two variables, hand tabulation will be inappropriate and
time consuming.
Usefulness of tables:
Tables are useful to the researchers and the readers in three ways:
1. The present an overall view of findings in a simpler way.
2. They identify trends.
3. They display relationships in a comparable way between parts of the findings.
By convention, the dependent variable is presented in the rows and the independent variable in
the columns.
Data Processing/Tabulation
Data Processing/Tabulation - Services offered by a company that has over the years earned a reputation as the industry's most experienced service provider, offering accurate data processing services to Market Research companies. These services are solicited by leading agencies and independent market researchers alike who require tabulation using Quantum, SPSS etc..
Data processing
Experts in Data Processing /Tabulation services! We Offer high quality services with minimum hassle and within the client’s budget. We provide Data processing/tabulation services for various Market Research clients/Management consultancies across the globe. Our most esteemed clients are from Europe, North America, Middle East, South East Asia and Japan.
Our competitive research designs, analytics and presentations help our clients focus on business development, client servicing and project management, enabling them to be more focused and business oriented.We are highly experienced in helping clients with basic tabulations as well as complex ones like vicariate and multivariate analysis.
Data tabulation refers to generating tables from data (collected in a survey, for example) after it has been validated and analyzed. This helps researchers and businesses to interpret survey results, or any other data.
Capital Typing has extensive experiencing handling hundreds of data tabulation projects for clients across the globe every year. Our data analysts have expertise using data tabulation tools such as Quantum, SPSS, SPSS Dimensions, Wincross and others. These applications support all tabulation requirements from cross tabulation, generating pivot tables to complex weighting. We can provide all your tabulation services, from simple to large-scale multi-country, multi-wave data collected in different file formats.
By outsourcing your data tabulation needs to Capital Typing, you can free valuable resources from a rather time consuming process which requires uncompromising attention to detail.
Our data processing professionals will create tabulations for analysis from physical documents such as mail or onsite surveys, or from respondent data collected over the phone or online.
Depending on client requirements, our Cross-Tabulation Reports may include:
Table of Contents Banner tables (including labeling, flexible basing and statistics) Volumetric and Sigma bases Up to 20 columns per Banner Descriptive Statistics (Mean, Median, Standard Error, Standard deviation, Mode) Minimum and Maximum values Data Weighting Vertical and/or Horizontal percentages Rankings Selection of statistical testing (T-Test/Chi-Square, ANOVA …) Customized formatting options for final output (Word, Excel or PowerPoint)
Reports are delivered as hard copy or are e-mailed immediately as files (Word, PDF, Excel, etc.)
Some of the benefits of entrusting data tabulation requirements to Capital Typing are:
All data tabulation is cross-referenced and quality checked (quality control is an integral part of our processes and procedures).
We are flexible enough to work with your schedule and meet your delivery time needs. No matter how many changes and/or corrections to the tables or the specifications, our
rates remain unchanged.
Data Processing and TabulationConstructing a frequency distribution table:
The procedure we use to describe a set of data in a frequency table is as follows:
1. Decide on the number of classes.
2. Determine the class interval or width.
3. Set the individual class limits.
4. Tally the classes.
5. Count the number of items in each class.
EXAMPLE:
The following data represent the heights in centimeters of 50 patients in an outpatient clinic.
145 95 148 112 132
140 162 118 170 144
145 127 148 165 138
173 113 104 141 142
116 178 123 141 138
127 143 134 136 137
155 93 102 154 142
134 165 123 124 124
138 160 157 138 131
114 135 151 138 157
1. What are the lowest and highest heights?
Lowest height= 93 = L Highest height= 178 = H
1. How many classes you are going to decide?
Since n = 50 we use the (2k method)
If k= 5 then ( 25 = 32 ) , if k= 6 then ( 26 = 64 )
Choose k such that ( 2k ≥ n )
Choose k = 6 = number of classes
1. What is the width of each class ?
Class interval = width = (H – L)/ k = (178 – 93) / 6 = 14.2
the width = 15
1. Tally the heights into the classes:
Class Tally Frequency
____________________________________________
91 – 105 //// 4
106 – 120 //// 5
121 – 135 //// //// / 11
136 – 150 //// //// //// /// 18
151 – 165 //// //// 9
166 – 180 /// 3
1. Prepare tabular summaries of the height data (Frequency table):
Salary Frequency R. Frequency P. Frequency C. Frequency
_________________________________________________________________
91 – 105 4 0.08 8 4
106 – 120 5 0.10 10 9
121 – 135 11 0.22 22 20
136 – 150 18 0.36 36 38
151 – 165 9 0.18 18 47
166 – 180 3 0.06 6 50
—— ——- ——
50 1.00 100
R = Relative, P = Percentage , C = Cumulative
1. What percentage heights are more than 150 ?
24 %
1. What proportion of the heights are 135 or less ?
20 / 50
1. Determine class boundaries for your class marks.
Class interval Class boundaries
_________________________________
91 – 105 90.5 – 105.5
106 – 120 105.5 – 120.5
121 – 135 120.5 – 135.5
136 – 150 135.5 – 150.5
151 – 165 150.5 – 165.5
166 – 180 165.5 – 180.5
EXPERIMENT:
The following data are the hours of personal computer usage during one week for a sample of
30 persons .
4 1 10 5 3 5 1 6 3 3
3 4 2 14 5 4 3 4 11 3
4 4 8 5 4 3 7 10 6 7
Construct a frequency table showing
1. What are the highest and lowest weekly usage of
Personal computer in this sample? what is the range ?
1. How many classes are you going to choose? Why?
1. What is the width of each class? Why?
1. Tally the usage hours of personal computers into the classes.
1. Construct the frequency distribution table. Showing the relative frequency, percentage
frequency, cumulative frequency, and class boundaries.
1. What proportion of weekly usage of personal computer are 3.5 hours or more?
1. What percentage of weekly usage of personal computer are less than 2 hours?
1. What did you understand from the word frequency (define).
How to start process of data classification
TITUS was the first entrant into the data classification industry and continues to be the market
and thought leader with a wide range of solutions supporting data classification in Microsoft
Outlook to data classification for iOS and Android mobile devices.
In the field of data management, data classification as a part of Information Lifecycle
Management (ILM) process can be defined as a tool for categorization of data to enable/help
organization to effectively answer following questions:
What data types are available?
Where are certain data located?
What access levels are implemented?
What protection level is implemented and does it adhere to compliance regulations?
When implemented it provides a bridge between IT professionals and process or application
owners. IT staff is informed about the data value and on the other hand management (usually
application owners) understands better to what segment of data centre has to be invested to keep
operations running effectively. This can be of particular importance in risk management, legal
discovery, and compliance with government regulations. Data classification is typically a manual
process; however, there are many tools from different vendors that can help gather information
about the data.
Note that this classification structure is written from a Data Management perspective and
therefore has a focus for text and text convertible binary data sources. Images, videos, and audio
files are highly structured formats built for industry standard API's and do not readily fit within
the classification scheme outlined below.
First step is to evaluate and divide the various applications and data into their respective category
as follows:
Relational or Tabular data (around 15% of non audio/video data)
Generally describes proprietary data which can be accessible only through application
or application programming interfaces (API)
Applications that produce structured data are usually database applications.
This type of data usually brings complex procedures of data evaluation and migration
between the storage tiers.
To ensure adequate quality standards, the classification process has to be monitored by
subject matter experts.
Semi-structured or Poly-structured data (all other non audio/video data that does not conform
to a system or platform defined Relational or Tabular form).
Generally describes data files that have a dynamic or non-relational semantic structure
(e.g. documents,XML,JSON,Device or System Log output,Sensor Output).
Relatively simple process of data classification is criteria assignment.
Simple process of data migration between assigned segments of predefined storage tiers.
Types of data classification - note that this designation is entirely orthogonal to the application
centric designation outlined above. Regardless of structure inherited from application, data may
be of the types below
1. Geographical: i.e. according to area (supposing the rice production of a state or country etc.) 2.
Chronological: i.e. according to time (sale of last 3 months) 3. Qualitative: i.e. according to
distinct categories. (E.g.: population on the basis of poor and rich) 4. Quantitative: i.e. according
to magnitude (a) discrete and b)continuous
Basic criteria for semi-structured or poly-structured data classification
Time criteria is the simplest and most commonly used where different type of data is
evaluated by time of creation, time of access, time of update, etc.
Metadata criteria as type, name, owner, location and so on can be used to create more
advanced classification policy
Content criteria which involve usage of advanced content classification algorithms are most
advanced forms of unstructured data classification
Note that any of these criteria may also apply to Tabular or Relational data as "Basic Criteria".
These criteria are application specific, rather than inherent aspects of the form in which the data
is
Resented..
Basic criteria for relational or Tabular data classification
These criteria are usually initiated by application requirements such as:
Disaster recovery and Business Continuity rules
Data centre resources optimization and consolidation
Hardware performance limitations and possible improvements by reorganization
Note that any of these criteria may also apply to semi/poly structured data as "Basic Criteria".
These criteria are application specific, rather than inherent aspects of the form in which the data
is presented.
Benefits of data classification
Benefits of effective implementation of appropriate data classification can significantly improve
ILM process and save data centre storage resources. If implemented systemically it can generate
improvements in data centre performance and utilization. Data classification can also reduce
costs and administration overhead. "Good enough" data classification can produce these results:
Data compliance and easier risk management. Data are located where expected on predefined
storage tier and "point in time"
Simplification of data encryption because all data need not be encrypted. This saves valuable
processor cycles and all related consecutiveness.
Data indexing to improve user access times
Data protection is redefined where RTO (Recovery Time Objective) is improved.
Research streamlines data processing to solve problems more efficiently
Researchers at North Carolina State University have developed a new analytical method that
opens the door to faster processing of large amounts of information, with applications in fields as
diverse as the military, medical diagnostics and homeland security.
"The problem we address here is this: When faced with a large amount of data, how do you
determine which pieces of that information are relevant for solving a specific problem," says Dr.
Joel Trussell, a professor of electrical and computer engineering at NC State and co-author of a
paper describing the research. "For example, how would you select the smallest number of
features that would allow a robot to differentiate between water and solid ground, based on
visual data collected by video?"
This is important, because the more data you need to solve a problem, the more expensive it is to
collect the data and the longer it will take to process the data. "The work we've done here allows
for a more efficient collection of data by targeting exactly what information is most important to
the decision-making process," Trussell says. "Basically, we've created a new algorithm that can
be used to determine how much data is needed to make a decision with a minimal rate of error."
One application for the new algorithm, discussed in the paper, is for the development of
programs that can analyze hyperspectral data from military cameras in order to identify potential
targets. Hyperspectral technology allows for finer resolution of the wavelengths of light that are
visible to the human eye, though it can also collect information from the infrared spectrum --
which can be used to identify specific materials, among other things. The algorithm could be
used to ensure that such a program would operate efficiently, minimizing data storage needs and
allowing the data to be processed more quickly.
But Trussell notes that "there are plenty of problems out there where people are faced with a vast
amount of data, visual or otherwise, -- such as medical situations, where doctors may have the
results from multiple imaging tests. For example, the algorithm would allow the development of
a more efficient screening process for evaluating medical images -- such as mammograms --
from a large group of people."
Another potential application would be for biometrics, such as homeland security efforts to
identify terrorists and others on the Department of Homeland Security watchlist based on video
and camera images.
The research, "Constrained Dimensionality Reduction Using A Mixed-Norm Penalty Function
With Neural Networks," was funded by the U.S. Army Research Office and co-authored by
Trussell and former NC State Ph.D. student Huiwen Zeng. The work is published in the March
issue of IEEE Transactions on Knowledge and Data Engineering.
NC State's Department of Electrical and Computer Engineering is part of the university's College
of Engineering.
Data-Processing System
a system of interrelated techniques and means of collecting and processing the data needed to org
anize control of some system. Automatic,or electronic, data-processing systems make use of elec
tronic computers or other modern information-processing equipment. Without acomputer a data-
processing system can be constructed for only a small controlled system. The use of a computer
means that the data-processing system can perform not just individual information-processing an
d computing tasks but a set of tasks that are interconnected andcan be carried out in a single sequ
ence of operations.
Data-processing systems should be distinguished from automated control systems. The primary f
unction of the latter is the performance ofcalculations associated, for example, with the solution
of problems of control and with the selection of optimal variants of plans on the basisof models a
nd the techniques of mathematical economics. The chief purpose of automated control systems is
to increase efficiency ofcontrol or management. The functions of data-processing systems, howe
ver, are to collect, store, retrieve, and process the data needed tocarry out the calculations at the l
owest possible cost. When an automatic data-processing system is constructed, efforts are made t
o identifyand automate laborious, regularly repeating routine operations on large files of data. A
data-processing system is usually a part of anautomated control system and represents the first st
age in the development of an automated control system. Data-processing systems,however, also f
unction as independent systems. In many cases it is more efficient to use a single system to proce
ss similar data for a largenumber of control problems handled by different automated control syst
ems —that is, to use a shared-access data-processing system.
The first data-processing systems were constructed in the USA in the 1950’s, when it became cle
ar that the use of a computer to solveindividual problems, such as calculating wages or keeping tr
ack of stocks of goods and materials, was inefficient. It was seen at that timethat integrated proce
ssing of the data fed to the computer was necessary.
The USSR has a number of large data-processing systems, most of which are the bases of autom
ated control systems. Examples are thesystems that have been set up at such large industrial enter
prises as Frezer, Kalibr, the Likhachev Automotive Plant, the L’vov TelevisionPlant, and the 15t
h Anniversary of the Ukrainian Lenin Komsomol Donetsk Plant. Data-processing systems are co
ming into use not only inindustrial enterprises but also in planning bodies, statistical agencies, mi
nistries, and banking institutions. They are finding application intrade and in the supply of materi
als and equipment. The introduction of a data-processing system is a prerequisite for the develop
ment of anautomated control system.
The experience that has been gained with data-processing systems permits identification of the b
asic principles of the construction andtechniques of development of such systems. The most imp
ortant principle is the principle of integration, which requires that the raw dataundergoing proces
sing be fed to the data-processing system once. The problems being solved in the data-processing
system are coordinatedin such a way that the raw data and the data resulting from the solution of
some problems are used as the initial data for as many of the otherproblems as possible. This coo
rdination eliminates duplication of the operations of collection, preparation, and checking of data
and ensuresthe integrated use of the data. As a result, the costs of obtaining the necessary inform
ation are reduced, and the efficiency of the data-processing system is increased.
Closely related to the principle of integration is the principle of centralization of data processing.
When a data-processing system isconstructed, many information-processing tasks are removed fr
om the control of the respective subdivisions and are concentrated at a singlecomputing center or
at a small number of such centers. Large data files are established at these centers; the files are a
vailable for integratedprocessing. Special information retrieval systems, called automatic data ba
nks, are set up in the data-processing system to manage andmake optimal use of the files. The aut
omatic data bank receives data that are subject to repeated use, and, in conformity with the operat
ingschedule of the data-processing system, the data are used to form the work files for the proble
ms being solved. The data bank also suppliesinformation in response to inquiries. The centralizat
ion of data processing in constructing a data-processing system usually assumes areorganization
of the structure of control.
The principle of the systems approach to the organization of the sequence of operations consists i
n the following: When the data-processingsystem is constructed, there must be integrated mecha
nization and automation of the operations at all stages of data collection andprocessing, and the h
ardware used must be self-consistent with respect to throughput and other parameters. If this is n
ot done, the unity ofthe sequence of operations is disrupted, and the efficiency of the data-
processing system drops sharply.
Before a data-processing system is constructed, the following are subjected to thorough investiga
tion and analysis: the controlled system,the control problems, the structure of control, the content
of the information, and the information flows. On the basis of the analysis of theresults of the inv
estigation, an information model of the data-processing system is developed that establishes new
information flows and therelation between the data-processing tasks. The hardware is chosen, an
d the sequence of operations of the data-processing system isworked out, according to the volum
es of data being processed, stored, and transmitted as determined from the information model of t
he data-processing system. The successful construction of a data-processing system requires the
participation not only of specialists but also ofmanagers and other personnel who are directly inv
olved in the solution of control problems at all stages of the development and introductionof the s
ystem.