course project presentation - mingyu · course project presentation comp4431 artificial...
TRANSCRIPT
My lightblueCourse Project PresentationCOMP4431 Artificial Intelligence
Department of Computing, The Hong Kong Polytechnic University
MA Mingyu [email protected], BSc (Hons) Computing, 14110562D
derek.maDecember 1, 2017
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Preparation
Training Strategies and Observations
Reflections
Contents
2
3
Preparation
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 4
Training data lies the
solid foundation for
smallblue to grow up.
Training Data
is the Heart
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Challenges for Data
5
• What kind of data is needed?
• Where can I get the proper data?
• How to pre-process the data?
• What kind of topics should be trained?
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
What kind of data is neededPre-training and literature review
6
What kind of data is needed?Where can I get the proper data?
How to pre-process the data?What kind of topics should be trained?
1. Full structure sentences with consistent grammarDecrease the complexity of training samples
2. Not too long sentencesGuess: generation model
3. Wider coverage of vocabularyHandle more topics
4. Commutative Q&A interactionsThe chatbot links sentences around
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Where can I get the proper data?Common datasets in academia
7
What kind of data is needed?Where can I get the proper data?
How to pre-process the data?What kind of topics should be trained?
Dataset
SentenceStructureandConsistentGrammar
NotTooLong
Sentences
WideCoverageofVocabulary
CommutativeQ&A
Interactions
NUSSMSCorpus No Yes Yes Yes
Cornell MovieDialogs Yes No Yes Yes
CornellCourt Dialogs Yes No Yes No
UCSB SpokenEnglish No No Yes Yes
Eslfast Yes Yes Yes Yes
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
How to pre-process the dataClean and polite data make sure controlled ethics
8
What kind of data is needed?Where can I get the proper data?How to pre-process the data?
What kind of topics should be trained?
• Processes• Remove out-of-vocabulary words• Shorten sentence length• Enhance commutative elements
• Methods• Python program preliminary check• Manually double check
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
What kind of topics should be trained?Topic selection by vocabulary analysis
9
What kind of data is needed?Where can I get the proper data?
How to pre-process the data?What kind of topics should be trained?
Word2vec(Mikolov et al., 2013)• Learn word vectors from context• Computationally-efficient predictive model
Semantic meaning of vocabulary and check relationships
Processes• Train relationships by most common 50000
words dataset with 50000 iterations• Test the model using our vocabulary• Plot the relationships by semantic meanings
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
What kind of topics should be trained?Topic selection by vocabulary analysis
10
What kind of data is needed?Where can I get the proper data?
How to pre-process the data?What kind of topics should be trained?
No significant clusters
11 common daily topics• Each topics may have multiple dialogs• Each dialogs have three mutation versions
• Different version are slightly different in language
11 / 26
Training Strategies and Observations
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 12
dialogs
47unique sentences
270+sentences
8219
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Simulated Annealing and Training Flow
13
topic 1 topic 2 topic 3 topic 4 topic 5 topic 6
RNN/LSTM Structure• Common approaches for chat bot• Still not good at memorySimulated AnnealingSo proper repeating can “cool down” the high-entropy new incoming conversations and let the chat bot settle the structure and knowledge.
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Repeating and Its Effect
14
topic 1 topic 2 topic 3 topic 4 topic 5 topic 6
Repeating Ratio: 15Sequence matters
RNN/LSTM can reflect the sequence of input<1,2,3,…,1,2,3> takes shorter rumination time than <1,2,3,…,3,2,1>
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Significant Rumination Effect and Long Rumination Time
15
After several repeating, a rumination can significantly improve the performanceLong rumination time
1h for 1000 inputsPossible explanation
the time for rumination is when data is still in high-entropy
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Training Tools
16
A automatic Chrome extensionJavaScript + Chrome• Input data• Modify response• Like modified response• Open new session for
next topic
17 / 26
Reflections
Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek
Reflections
18
Utilize “Ruminate” operations
faster training and better results
Avoid “big data” strategy
history is crowded and hard to ruminate
Thanks!
My lightblue: Course Project Presentation
derek.ma
MA Mingyu Derek