problem statement this research project has developed five methods for unstructured text processing;...

21
Problem Statement This research project has developed five methods for unstructured text processing; however, the best combination of methods to extract homogeneous POS clusters has not been determined

Upload: hubert-johnson

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Problem Statement

This research project has developed five methods for unstructured text processing; however, the best combination of methods to extract homogeneous POS clusters has not been determined

Methods

Splitting (S): assign words to multiple clusters by splitting centers by associative and non-associative surrounds

Objective: To place multi-POS words into every correct POS (sub) cluster

Bidirectional Clustering: cluster surrounds before centersObjective: To reduce the feature dimensionality and

improve precision/recall of POS clustering, by aggregating a large number raw surrounds features by similar centers

Methods

Chunking: extract multi-word centers using high-frequency surrounds

Objective: To find multi-word centers using previously learned surrounds

Distance Function (D): using correlation and correlation confidence or correlation only as a measure of cluster similarity

Objective: To measure the distance of surrounds or centers based on co-occurrence counts

Governing Propositions

Expression start tag (E): Using surrounds where the ‘fore’ = <expression start>

Objective: To enable extracting features from the first word of an expression

Performance Criteria

Achieve no less than 70% correctness in assigning a POS clusters

Task Objective

Evaluate clustering performance for various combinations of methods using the Children’s Corpus and determine which combination is most suitable for continued research.

Splitting Goals and Hypotheses

G1. Determine if POS splitting can accurately place multi-POS words into correct clusters

H1.1: For any pair of Centers, using occurrence count correlation, each Surrounds feature can be classified as associative or non-associative

H1.2: By incrementally associating only the Center’s associative features with clusters, a multi-POS center can be assigned to multiple subclusters representing homogeneous POS centers

Bidirectional Clustering Goals and Hypotheses

G2: Determine if clustering Surrounds before centers (bidirectional clustering) produces a better Center clustering, by POS, than clustering Centers only (unidirectional clustering)

H2.1 Clustering Surrounds aggregates Surrounds that capture Centers of the same POS

H2.2 Bidirectional clustering increases distance among clusters and decreases distance between clusters when compared to unidirectional clustering

Chunking Goals and Hypotheses

G3: Determine if using high-occurrence count Surrounds can be used effectively to extract multi-word entities/chunks

H3.1 Surrounds that are effective in extracting a single POS for one-word chunks, can also be effective for extracting multi-word chunks (induction)

H3.2 High-occurrence Surrounds indicate a single POS with statistical significance

Distance Function Goals and Hypotheses

G4: Determine which product-moment correlation distance function yields the most accurate POS clustering

D1: Correlation only: cluster X, Y : argmaxX,Y { pearson(X,Y) }

D2: Correlation + Confidence: cluster X, Y : argmaxX,Y { pearson(X,Y) :

confidence(pearson(X,Y) , |X|∩|Y|) > conf_threshold}

H4.1 Pearson Product Moment Correlation yields an improved POS clustering when applying some correlation confidence threshold

?? Should we apply Brown Clustering as an alternative policy ??

Start Tag Goals and Hypotheses

G5: Determine if clusters influenced by Surrounds features having fore = <expression start> (start surrounds) produce less consistent POS clusters than clusters dominated with Surrounds where fore is a distinct word

H5.1 Start Surrounds tend to influence a small proportion of final clusters

H5.2 The information loss associated with a start tag, produces inconsistencies in POS clustering

Governing Propositions1. A corpus with short sentences, simple grammar, and limited vocabulary will produce

best results, but the results will inform further research with more advanced linguistic constructs.

2. Not all Surrounds that are discovered can and should be used for this experiment. The experiment will be restricted to using the 2,000 surrounds with the largest number of occurrence counts.

3. Not all words in the Lexicon can and should be clustered by POS. The experiment will be restricted to using the 1,000 Centers with the largest number of occurrence counts.

4. The clustering algorithm will be hierarchical, agglomerative, using a correlation-based distance metric.

5. For each experiment, only one co-occurrence matrix will be applied; there is no incremental enhancement of the co-occurrence counts.

Performance Metrics

• Cluster Correctness:

• Cluster POS: Most commonly occurring POS in cluster (winner takes all)

• Correct: Number of Centers in cluster where

• Incorrect: Number of Centers in cluster where

Solution Pipeline

Parsing N-gram extraction Chunking

Surrounds Clustering

Center Clustering

Evaluation

Word

Lex

Center-Surrounds Counting

Centers

Surrounds

Co-Occurrence Matrix

Center Clusters

Results (POS, …)

Corpus

D=1

E=T

D=1

S=T

S=T

FactorsSplitting: {Yes, No}

Minimum split ratio: ξ=0.3Tolerance Angle: ϑ= 30 degrees

Bidirectional Clustering: {Yes, No}Surrounds Minimum Conf Value: ψ=0.9

Chunking: {Yes, No}Max center word length: v < 4Number of reference surrounds: τ=200

Distance Function {Correlation Value, Correlation Value + Confidence}Center Minimum Correlation Value: r=0.8Center Minimum Correlation Confidence: π=0.9Brown Clustering???

Start Tag Use {Yes, No}

Experiment Configurations (1-16)Ex ID Splitting Bidirectional

ClusteringChunking Distance

FunctionExpression

Start

1 No No No r=0.8 No2 No No No r=0.8 Yes3 No No No r=0.8, π=0.9 No4 No No No r=0.8, π=0.9 Yes5 No No τ=200 r=0.8 No6 No No τ=200 r=0.8 Yes7 No No τ=200 r=0.8, π=0.9 No8 No No τ=200 r=0.8, π=0.9 Yes9 No ψ=0.9 No r=0.8 No

10 No ψ=0.9 No r=0.8 Yes11 No ψ=0.9 No r=0.8, π=0.9 No12 No ψ=0.9 No r=0.8, π=0.9 Yes13 No ψ=0.9 τ=200 r=0.8 No14 No ψ=0.9 τ=200 r=0.8 Yes15 No ψ=0.9 τ=200 r=0.8, π=0.9 No16 No ψ=0.9 τ=200 r=0.8, π=0.9 Yes

Experiment Configurations (17-32)Ex ID Splitting Bidirectional

ClusteringChunking Distance

FunctionExpression

Start

17 ξ=0.3, ϑ=30 No No r=0.8 No18 ξ=0.3, ϑ=30 No No r=0.8 Yes19 ξ=0.3, ϑ=30 No No r=0.8, π=0.9 No20 ξ=0.3, ϑ=30 No No r=0.8, π=0.9 Yes21 ξ=0.3, ϑ=30 No τ=200 r=0.8 No22 ξ=0.3, ϑ=30 No τ=200 r=0.8 Yes23 ξ=0.3, ϑ=30 No τ=200 r=0.8, π=0.9 No24 ξ=0.3, ϑ=30 No τ=200 r=0.8, π=0.9 Yes25 ξ=0.3, ϑ=30 ψ=0.9 No r=0.8 No26 ξ=0.3, ϑ=30 ψ=0.9 No r=0.8 Yes27 ξ=0.3, ϑ=30 ψ=0.9 No r=0.8, π=0.9 No28 ξ=0.3, ϑ=30 ψ=0.9 No r=0.8, π=0.9 Yes29 ξ=0.3, ϑ=30 ψ=0.9 τ=200 r=0.8 No30 ξ=0.3, ϑ=30 ψ=0.9 τ=200 r=0.8 Yes31 ξ=0.3, ϑ=30 ψ=0.9 τ=200 r=0.8, π=0.9 No32 ξ=0.3, ϑ=30 ψ=0.9 τ=200 r=0.8, π=0.9 Yes

Corpustitle words surroundsexamplesA BOOK OF NONSENSE 4407 3194 There was a Young Person of Smyrna,

Whose Grandmother threatened to burn her;But she seized on the Cat, and said, "Granny, burn that!You incongruous Old Woman of Smyrna!"

HARRYS LADDER TO LEARNING 13832 9857 Humpty Dumpty sat on a wall,Humpty Dumpty had a great fall,Not all the king's horses, nor all the king's men,Could set Humpty Dumpty up again.

LITTLE STORIES FOR LITTLE CHILDREN

1484 1127 Had each a nice doll, and they took care of them. One day Tom call-edthem to play at ball, and they ran a-way to play, and left the two dollson a chair. By and by the cat came in the room, and pull-ed the dolls topieces, think-ing I dare say, that it was fine fun to tear them to bits,and scam-per round the room with poor dol-ly's nose in her mouth.

MCGUFFEYS FIRST ECLECTIC READER

7347 5174 Ned is on the box. He has a pen in his hand. A big rat is in the box.Can the dog catch the rat?Come with me, Ann, and see the man with a black hat on his head.The fat hen has left the nest. Run, Nat, and get the eggs.

MCGUFFEYS SECOND ECLECTIC READER

18515 13028 1. A little play does not harm any one, but does much good. After play, weshould be glad to work.2. I knew a boy who liked a good game very much. He could run, swim, jump, and play ball; and was always merry when out of school.3. But he knew that time is not all for play; that our minutes, hours, anddays are very precious.

Corpustitle words surrounds examplesMOTHER GOOSE 16944 12487 Little Bo-Peep has lost her sheep,

And can't tell where to find them;Leave them alone, and they'll come home, And bring their tails behind them.

MY DOG TRAY 1976 1528 At last the cruel woman said She had no bones to throw away; She could not keep a useless cur, She really must drive off old Tray.

And, with a broomstick in her hand, She hunted the poor dog about, Until, with many a cruel blow, From his old home she drove him out.

NEW NATIONAL FIRST READER 7498 4730 Boy, come down from that tree! Come away, and soon there will be littlebirds in the nest.What a bad boy, to take the eggs of a bird!Did you see the boys with the drum and gun, Ned?Yes. I saw Roy beat his drum, rub-a-dub, rub-a-dub! I am glad the boyshave a drum. It is fun to march, march, march.Will you give me the apple you have in your hand, Ned?

OLD-TIME STORIES, FAIRY TALES AND MYTHS RETOLD BY CHILDREN

8790 6275 Splash! Splash! The mother duck was in the water. Then she calledthe ducklings to come in. They all jumped in and began to swim. Thebig, ugly duckling swam, too.The mother duck said, "He is not a turkey. He is my own little duck.He will not be so ugly when he is bigger."

Corpustitle words surrounds examplesTHE GREAT BIG TREASURY OF BEATRIX POTTER

30902 24080 For behind the wooden wainscotsof all the old houses in Gloucester,there are little mouse staircases andsecret trap-doors; and the mice runfrom house to house through thoselong, narrow passages.

THE TALE OF MRS. TITTLEMOUSE 1588 1097 He sat such a while that he had to be asked if he would take somedinner?First she offered him cherry-stones. "Thank you, thank you, Mrs.Tittlemouse! No teeth, no teeth, no teeth!" said Mr. Jackson.He opened his mouth most unnecessarily wide; he certainly had not atooth in his head.

THE TALE OF PETER RABBIT 1036 734 He found a door in a wall; but it was locked and there was no roomfor a fat little rabbit to squeeze underneath.An old mouse was running in and out over the stone doorstep, carryingpeas and beans to her family in the wood. Peter asked her the way tothe gate but she had such a large pea in her mouth she could notanswer. She only shook her head at him.

THE TALE OF TOM KITTEN 887 616 When the three kittens were ready, Mrs. Tabitha unwisely turned them outinto the garden, to be out of the way while she made hot buttered toast."Now keep your frocks clean, children! You must walk on your hind legs.Keep away from the dirty ash-pit, and from Sally Henny Penny, and from the pig-stye and the Puddle-Ducks."

THE TINY PICTURE BOOK 1710 1211 FROGS! frogs! I hear their merry croak From river, pond, and stream;O, now I know that Spring has come, And all will soon be green.

General Pipeline

Parsing N-gram extraction Chunking

Surrounds Clustering

Center Clustering

Evaluation

Word

Lex

Center-Surrounds Counting

Centers

Surrounds

Co-Occurrence Matrix

Center Clusters POS Grade

Corpus

D=1

E=T

D=1

S=T

S=T