course project presentation - mingyu · course project presentation comp4431 artificial...

My lightblueCourse Project PresentationCOMP4431 Artificial Intelligence

Department of Computing, The Hong Kong Polytechnic University

MA Mingyu [email protected], BSc (Hons) Computing, 14110562D

derek.maDecember 1, 2017

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Preparation

Training Strategies and Observations

Reflections

Contents

2

3

Preparation

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 4

Training data lies the

solid foundation for

smallblue to grow up.

Training Data

is the Heart


Challenges for Data

5

• What kind of data is needed?

• Where can I get the proper data?

• How to pre-process the data?

• What kind of topics should be trained?


What kind of data is neededPre-training and literature review

6

What kind of data is needed?Where can I get the proper data?

How to pre-process the data?What kind of topics should be trained?

1. Full structure sentences with consistent grammarDecrease the complexity of training samples

2. Not too long sentencesGuess: generation model

3. Wider coverage of vocabularyHandle more topics

4. Commutative Q&A interactionsThe chatbot links sentences around


Where can I get the proper data?Common datasets in academia

7



Dataset

SentenceStructureandConsistentGrammar

NotTooLong

Sentences

WideCoverageofVocabulary

CommutativeQ&A

Interactions

NUSSMSCorpus No Yes Yes Yes

Cornell MovieDialogs Yes No Yes Yes

CornellCourt Dialogs Yes No Yes No

UCSB SpokenEnglish No No Yes Yes

Eslfast Yes Yes Yes Yes


How to pre-process the dataClean and polite data make sure controlled ethics

8

What kind of data is needed?Where can I get the proper data?How to pre-process the data?

What kind of topics should be trained?

• Processes• Remove out-of-vocabulary words• Shorten sentence length• Enhance commutative elements

• Methods• Python program preliminary check• Manually double check


What kind of topics should be trained?Topic selection by vocabulary analysis

9



Word2vec(Mikolov et al., 2013)• Learn word vectors from context• Computationally-efficient predictive model

Semantic meaning of vocabulary and check relationships

Processes• Train relationships by most common 50000

words dataset with 50000 iterations• Test the model using our vocabulary• Plot the relationships by semantic meanings


What kind of topics should be trained?Topic selection by vocabulary analysis

10



No significant clusters

11 common daily topics• Each topics may have multiple dialogs• Each dialogs have three mutation versions

• Different version are slightly different in language

11 / 26

Training Strategies and Observations

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 12

dialogs

47unique sentences

270+sentences

8219


Simulated Annealing and Training Flow

13

topic 1 topic 2 topic 3 topic 4 topic 5 topic 6

RNN/LSTM Structure• Common approaches for chat bot• Still not good at memorySimulated AnnealingSo proper repeating can “cool down” the high-entropy new incoming conversations and let the chat bot settle the structure and knowledge.


Repeating and Its Effect

14

topic 1 topic 2 topic 3 topic 4 topic 5 topic 6

Repeating Ratio: 15Sequence matters

RNN/LSTM can reflect the sequence of input<1,2,3,…,1,2,3> takes shorter rumination time than <1,2,3,…,3,2,1>


Significant Rumination Effect and Long Rumination Time

15

After several repeating, a rumination can significantly improve the performanceLong rumination time

1h for 1000 inputsPossible explanation

the time for rumination is when data is still in high-entropy


Training Tools

16

A automatic Chrome extensionJavaScript + Chrome• Input data• Modify response• Like modified response• Open new session for

next topic

17 / 26

Reflections


Reflections

18

Utilize “Ruminate” operations

faster training and better results

Avoid “big data” strategy

history is crowded and hard to ruminate

Thanks!

My lightblue: Course Project Presentation

derek.ma

MA Mingyu Derek

course project presentation - mingyu · course project presentation comp4431 artificial...

Documents