datameet 4: data cleaning & census data

60
UNDERSTANDING INDIA … CENSUS 2011 Bhavin Dalal Datameet 4 DATA CLEANING & PROFILING

Upload: pykih-software-llp

Post on 16-Jul-2015

463 views

Category:

Technology


1 download

TRANSCRIPT

UNDERSTANDING INDIA … CENSUS 2011

Bhavin DalalDatameet 4

DATA CLEANING & PROFILING

What is Data Quality?? Data quality is a perception or an assessment of

data's fitness to serve its purpose in a given context.  Aspects of data quality include: 

Accuracy – How much accurate the data is ? Completeness – Is all the data present ? Update status – How old is the data ? Relevance – Is data relevant to solve the purpose ? Consistency – Is data consistent from different sources? Reliability – How much can we rely on the data ? Appropriate presentation – Is the data presented in a way

that makes it usable ? Accessibility – Is the data accessible by all those who

require it?

2

Data Quality Problems

Referential Integrity Use of NULL Value checking for reasonableness

Date value for example Value constrained to pre-defined domain Eg:

Salutation

3

Before doing data quality

Profiling of data Conformity check

Standardization Gender -> M/F or Male/Female or Unknown or Null ?

Duplicate Values Survivorship

Best quality set from different records

4

Basic Data Cleaning Steps

Removing spaces and nonprinting characters Fixing Number and Number Signs Fixing Date and Time Merging and Splitting Columns

Eg: Names (First Name + Last Name / Full Name)

Need for transformation Checking data quality through joining and

matching

5

Finding duplicate values

Below are the algorithms to find duplicates based on the phonetics Hamming() Jaro-winkler() Levenshtein()

Damerau-Levenshtein() --- Advanced version Q-gram() Cosine() Soundex()

6

Hamming

Number of positions with same symbol in both strings. Only defined for strings of equal length.

distance(‘abcdd‘,’abbcd‘) = 3

7

Jaro-winkler

This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

8

Levenshtein

Minimal number of ins e rtio ns , d e le tio ns and re p la c e m e nts needed for transforming string a into string b.

9

N-gram / Q-gram

Sum of absolute differences between N-gram vectors of both strings.

10

Cosine

1 minus the cosine similarity of both N-gram vectors.

11

Soundex

SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary to produce a four-character code.

SOUNDEX (‘Ahmedabad') = A531 SOUNDEX (‘Amdavad') = A531

12

Steps to Data Cleaning13

Sujit Joshi

88 Ashoka Appts

Juhu

Bombay

Tel: 6201670

Cell: 998054046

Email: [email protected]

Mr. Sujit Joshi

88 Ashoka Apartments

Gandhigram Road

Juhu

Mumbai – 400 049

India

Tel: (22) 26201670

Cell: 998054046X

Email: [email protected]

Old telephone number

Missing postcode

Abbreviated house name

Missing salutation

Salutation added

House name standardised

Postcode & Country added

Correct telephone number for known changes (add 2 to 7 digit numbers; include STD code for the city)

Old telephone number

Incorrect email id

Tag Cell Number to be of invalid format

Email id typo corrected

Components of Address

Steps in Data Cleansing

Parsing Correcting Standardizing Matching Consolidating

15

Parsing

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.

16

Parsing

Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL

Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

17

Correcting

Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

18

Correcting

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

19

Standardizing

Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

20

Standardizing

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

21

Matching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

22

Match Patterns

Business Name

Street BranchType

Customer#/Tax ID

City VendorCode

Pattern PatternI.D.

Exact

Exact Exact

ExactExactExactExactExact

Exact

Exact Exact

ExactVClose

Exact

Exact

Exact

ExactExact

ExactExact

VClose

VClose

VClose

VCloseVClose

Close

Close

Close

Blanks

Blanks

AAAAAA

ABAAA-

ABA-AA

ABCCAA

BBACAA

P110

P115

P120

S300

S310

23

Matching

Corrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678

24

Consolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

25

Consolidating

Corrected Data (Data Source #1)

Corrected Data (Data Source #2)

Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678

26

Sometime Such Algo’s Don’t Work !!! So we do manual cleaning

27

Example of Manual Cleaning

Car Name Correct Name

Waganer Wagon RSujhuki SuzukiBenj Mercedes BenzFaurtuner FortunerScopeio ScorpioSevrole ChevroletFurrarree FerrariLandcrusher Land Cruiser

28

Car Cleaning Approach29

Other data that we have cleaned Occupation Marital Status Gender And many other fields …

30

Data Capture Tips31

Top Ten Data Capture Tips

Every contact is data capture opportunity Make it easy for end user to give you

information Incentivise your end user to part with their

details Collect data in-line with private regulations Decide what data you need and prioritise

32

Top Ten Data Capture Tips

Don’t ask everything at once – build it over time

Set targets for breadth, depth and quality Collect data in standardized format Streamline the data from point of capture to

storage If you cant collect it, BUY it!!!

33

End of Part 134

Understanding India … Census 2011

35

Census in India 36

The first census in India in modern times was conducted in 1872.

Population census has been carried out every 10 years.

The census is carried out by the office of the Registrar General and Census Commissioner of India, Delhi, an office in the Ministry of Home Affairs, Government of India, under the 1948 Census of India Act.

CENSUS37

The 15th Indian National census was conducted in two phases House listing Population enumeration.

The Census covered 640 districts 5767 tehsils 7742 towns More than 6 lac villages.

2.7 million officials visited households in 7,742 towns and 6,40,867 villages, classifying the population in different segments

POPULATION COMPARISON38

20212011

2001 The population of India has increased by more than 181 million during the decade 2001-2011.This addition is slightly lower than the population of Brazil, the fifth most populous country in the world !!

India as compared to the world39

The gap between India, the country with the second largest population in the world and China, the country with the largest population in the world has narrowed from 238 million in 2001 to nearly 131 million in 2011. On the other hand, the gap between India and the United States of America, which has the third largest population, has now widened to about 902 million from 741 million in 2001.

State wise population 2001

40

State wise population 2011

41

Census report of 201142

Se x Ra tio43

The sex ratio of India is 940. The sex ratio at the National level has risen by seven points since the last Census in 2001. This is the highest since 1971.

Sex Ratio Trend in India44

The sex ratio in India has been historically negative or in other words, unfavourable to females. Sex ratio reached its lowest in 1991 but since then has kept rising.

45

State-wise Sex Ratios

Census Facts 201146

Thane district of Maharashtra is the most populated district of India. Dibang Valley of Arunachal Pradesh is the least populated. Kurung Kumey of Arunachal Pradesh registered highest population growth

rate of 111.01 percent. Longleng district of Nagaland registered negative population growth rate of

(-58.39). Mahe district of Puducherry has highest sex ratio of 1176 females per 1000

males. Daman district has lowest sex ratio of 533 females per 1000 males. Serchhip district of Mizoram has highest literacy rate of 98.76 percent. Alirajpur of MP is the least literate district of India with figure of 37.22

percent only. North East Delhi has the higest density with figure of 37346 person per

square kilometer. Dibang Valley has the least density of 1 person per sq. km.

States having highest population

47

Uttar Pradesh - (19.96 Crore) increased at the rate of 20% from 2001

Maharashtra- (11.24 Crore) increased at the rate of 15% since last census.

Delhi is most densely populated with a density of 11297 per sq km ( an increase of 21% from 2001)

Bihar is the most densely populated state with a density of 1102 per sq km ( an increase of 25% from 2001).

States with highest literacy48

Interesting Facts49

Interesting facts- Telecom50

“More phones than toilets” Census 2011 sheds light on changing India.

63.2 per cent households in India now have a telephone/mobile facility( 82 per cent in urban and 54 per cent in rural area.)

The penetration of mobile phone is 59 per cent and landline is 10 per cent.

More than half of Indian households (some 53.1 per cent) do not have access to something as basic as a toilet.

Facts- Communication51

The penetration of computers and laptops in India is only 9.4 per cent or less than one out of 10 households with only 3 per cent having internet facility. 

 The penetration of internet is 8 per cent in urban as compared to less than 1 per cent in rural area.

Maharashtra is the biggest Indian Internet market with 18% .

47.2 % of Indians own a Television 19.9 % of Indians own a Radio/Transistors 13.42 Million broadband connections (Home +

Offices ) combined.

Facts- Literacy and Population52

Uttar Pradesh is the most populous state and the combined population of Uttar Pradesh and Maharashtra is more than that of the USA.

Ten states and union territories have attained literacy rate of above 85 per cent.

According to the Census report India's population is now bigger than the combined population of USA, Indonesia, Brazil, Pakistan and Bangladesh.

74% of Indians can now read, write and do basic maths (like adding, subtracting) — that means that 3 out of every 4 Indians are literate.

Facts : General53

Females outnumber males in Goa. Population

50% <=25 yrs of age 65% <=35 yrs of age

It is anticipated that the median age of an Indian citizen will be 29 years in 2020, in comparison to 48 for Japan and 37 for China.

India covers 2.4% of the land territory of the world and represents more than 17.5% of the population of the world.

Facts : General54

Total expenditure and materials used :   • Cost Rs. 2200 crore    • Cost per person Rs. 18.33    • No. of Census Functionaries  2.7 million   • No. of Languages in which Schedules were canvassed 16   • No. of Languages in which Training Manuals prepared 18   • No. of Schedules Printed  340 million    • No. of Training Manuals Printed 5.4 million   • Paper  Utilised 12,000 MTs    • Material Moved 10,500 MTs

What do we do with Census ??

Census is more than population, literacy and sex ratio.

Census can provide insights about various dimensions !!!

The data is available in the xls format The data is available free of cost The data is clean It has proper database architecture with codes

in place

55

Two stages of Census56

Houselisting Population Enumeration

Houselisting questionaire57

Population enumeration 58

References59

http://www.census2011.co.in/ http://articles.timesofindia.indiatimes.com/2011-03-31/india/29365558_1_uts-percentage-decadal-growth-rates-census http://censusindia.gov.in/ http://en.wikipedia.org/wiki/2011_census_of_India http://www.mapsofindia.com/census2011/

THANK YOU60