dictionary-based data generation for fine-tuning bert for
TRANSCRIPT
University of Wisconsin Milwaukee University of Wisconsin Milwaukee
UWM Digital Commons UWM Digital Commons
Theses and Dissertations
August 2020
Dictionary-based Data Generation for Fine-Tuning Bert for Dictionary-based Data Generation for Fine-Tuning Bert for
Adverbial Paraphrasing Tasks Adverbial Paraphrasing Tasks
Mark Anthony Carthon University of Wisconsin-Milwaukee
Follow this and additional works at: https://dc.uwm.edu/etd
Part of the Applied Mathematics Commons, Computer Sciences Commons, and the Mathematics
Commons
Recommended Citation Recommended Citation Carthon, Mark Anthony, "Dictionary-based Data Generation for Fine-Tuning Bert for Adverbial Paraphrasing Tasks" (2020). Theses and Dissertations. 2476. https://dc.uwm.edu/etd/2476
This Thesis is brought to you for free and open access by UWM Digital Commons. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of UWM Digital Commons. For more information, please contact [email protected].
DICTIONARY-BASED DATA GENERATION FOR FINE-TUNING BERT FOR
ADVERBIAL PARAPHRASING TASKS
by
Mark Carthon III
A Thesis Submitted in
Partial Fulfillment of the
Requirements for the Degree of
Master of Science
in Mathematics
at
The University of Wisconsin-Milwaukee
August 2020
ii
ABSTRACT
DICTIONARY-BASED DATA GENERATION FOR FINE-TUNING BERT FOR
ADVERBIAL PARAPHRASING TASKS
by
Mark Carthon III
The University of Wisconsin-Milwaukee, 2020
Under the supervision of Professor Istvan Lauko
Recent advances in natural language processing technology have led to the emergence of
large and deep pre-trained neural networks. The use and focus of these networks are on transfer
learning. More specifically, retraining or fine-tuning such pre-trained networks to achieve state
of the art performance in a variety of challenging natural language processing/understanding
(NLP/NLU) tasks. In this thesis, we focus on identifying paraphrases at the sentence level using
the network Bidirectional Encoder Representations from Transformers (BERT). It is well
understood that in deep learning the volume and quality of training data is a determining factor
of performance. The objective of this thesis is to develop a methodology for algorithmic
generation of high-quality training data for paraphrasing task, an important NLU task, as well as
the evaluation of the resulting training data on fine-tuning BERT to identify paraphrases. Here
we will focus on elementary adverbial paraphrases, but the methodology extends to the general
case. In this work, training data for adverbial paraphrasing was generated utilizing an Oxford
iii
synonym dictionary, and we used the generated data to re-train BERT for the paraphrasing task
with strong results, achieving a validation accuracy of 96.875%.
iv
© Copyright by Mark Carthon, 2020
All Rights Reserved
v
To
Yahweh,
Sam,
my loved ones,
and all those who supported me
vi
TABLE OF CONTENTS
List of Figures………………………………………………………………………………......vii
List of Abbreviations..................................................................................................................viii
1 History of NLP………………………………………………………………………………....1
2 Applications of NLP…………………………………………………………………………...9
3 Motivation for Data Generation……………...................................…...................................11
4 Regular Expressions………………………………………………………………………….16
5 Structure of Transformer Networks………………………………………………………...22
6 Data Generation…………………………………………………………………...................30
7 Implementation of Training……………………………………………………………....…40
References.....................................................................................................................................42
Appendix.......................................................................................................................................43
vii
LIST OF FIGURES
Figure 1. The Transformer – model architecture.....................................................................22
Figure 2. BERT architecture......................................................................................................25
Figure 3. BERT performing spam classification......................................................................25
Figure 4: BERT input representation........................................................................................28
viii
LIST OF ABBREVIATIONS
AI Artificial Intelligence
UWM University of Wisconsin - Milwaukee
MT Machine Translation
NLP Natural Language Processing
NLU Natural Language Understanding
BERT Bidirectional Encoder Representations from Transformers
IBM International Business Machines
ALPAC Automatic Language Processing Advisory Committee
ETAOINSHRDLU Frequency order of letters in the English Language
ATN Augmented Transition Network
LSTM Long short-term memory
SYSTRAN System Analysis Translator
RNN Recurrent Neural Network
FFNN Feed Forward Neural Network
MARGIE Memory Analysis Response Generation in English
APE(X)C All Purpose Electric (X) Computer
OTAZ Oxford Thesaurus A to Z
OT Oxford Thesaurus
RE Regular Expressions
GPU Graphics Processing Unit
ix
TPU Tensor Processing Unit
CoLab Collaboratory
NER Named Entity Recognition
NEL Named Entity Labeling
1
History of Natural Language
Processing (NLP)
The field of NLP comes from a field called Machine Translation (MT). Machine Translation is
the study of how to algorithmically translate one language to another. One of the first people to
work in Machine Translation was a Persian intellectual named Al-Kindi (circa 801 AD - 873
AD). Al-Kindi was born in a city named Kufa but was educated in Bagdad. He was a prominent
figure in the Grand Library of Bagdad. This was a public academy and the intellectual center of
Bagdad during the Islamic Golden Age. Several Abbasid caliphs appointed Al-Kindi to oversee
the translation of Greek scientific and philosophical texts into the Arabic language. Al-
Kindi developed various methods to do this. These methods relied on the frequency of the
characters that were used in the texts. He developed techniques for
systematic language translation including, but not limited to: cryptanalysis, frequency analysis,
probability theory, and statistics, all of which are used in modern machine translation.
In 1623, Rene Descartes (and others) proposed an artificial universal language; in this language,
equivalent ideas in different languages would share the same symbol. Both Leibniz and
Descartes put forward proposals for codes which would relate words between languages. Of
course, all these proposals were purely theoretical, and none of them resulted in the development
of an actual algorithm for language translation. During the mid-1930s when the first patents for
”translating machines” were applied for, one proposal by Georges Antsrouni was simply an
2
automatic bilingual dictionary using paper tape. Paper tape was a form of data storage that
consisted of a long strip of paper in which holes were punched. Paper tape was used mostly
during the 19th and 20th centuries. They were effective with teleprinter communication as input
for computers of the 1950s and 1960s, and later as a storage medium for minicomputers and
CNC machine tools. Another proposal was made by Peter Troyanski, a Russian who included a
bilingual dictionary, and a method for dealing with grammatical roles between languages based
on Esperanto.
Esperanto was the most widely spoken constructed international auxiliary language. It was
created by a Polish ophthalmologist called Ludwik Lejzer Zamenhof in 1887. The purpose of
this universal language Esperanto was to create a flexible language that would serve as a
universal second language to foster world peace, international understanding, and to build a
”community of speakers”. Peter Troyanski’s proposal was a system that was separated into three
stages. The first stage consisted of a native-speaking editor in the source language to organize the
words into their logical forms and to exercise the syntactic functions. A logical form of a
syntactic expression is a precisely specified semantic version of that expression in some formal
system. Think of this first stage as tokenization for BERT. The second stage
of Troyanski’s proposal was for the machine (such a machine did not exist when this proposal
was initially made) to translate these logical forms into the target language. Stage three required
a native-speaking editor to normalize this output. This system did not become relevant on a big
scale until the 1950s when computers became widely utilized.
3
In 1946, Andrew Donald Booth, a British electrical engineer, physicist, and computer scientist,
proposed the idea of using digital computers for translation of natural languages. Warren Weaver
was a researcher at the Rockefeller Foundation, and he presented one of the first set of computer-
based machine translation proposals in 1949; it was called ”Translation Memorandum”. Warren
Weaver proposals came about thanks to advancements in information theory, successes in
cryptography during World War II, and theories about invariants at the foundation of natural
language. In 1950, a breakthrough was made by Alan Turing when he published his
famous paper ”Computing Machinery and Intelligence” which proposed the ”Turing Test” as a
criterion of intelligence. This criterion depends on the ability of a computer program to
impersonate a human in a real-time written conversation with a human judge, sufficiently well
that the judge is unable to reliably distinguish between the computer program and a real
human on the basis of conversational content alone.
In 1951, Yehosha Bar-Hillel began research in MT and later claimed that without a ”universal
encyclopedia” a machine would never be able to deal with the problem of one word having
multiple definitions because during this time, semantic ambiguity could only be solved by
writing source texts for machine translation in a controlled language that uses a vocabulary in
which each word has exactly one meaning. Also, in 1954, Georgetown University created an MT
research team. On January 7, 1954 Georgetown and IBM teamed up to give a public
demonstration of its Georgetown-IBM experiment system which involved fully automatic
translation of more than 60 Russian sentences into English. This demonstration was widely
reported in the newspapers and gained public interest. However, this feat was limited because it
had only 250 words and translated 49 carefully selected Russian sentences into English. Most of
4
the words were related to the field of Chemistry. Despite its shortcomings, this demonstration
encouraged others to put more resources into furthering MT on a worldwide scale.
Andrew Donald Booth designed the All Purpose Electronic (X) Computer, APE(X)C, at
Birkbeck College (now the University of London). In 1954 a demonstration was made on the
APE(X)C of a rudimentary translation of English into French. In 1955, Japan and Russia began
to develop their own MT research programs and in 1956 the first ever MT conference was held
in London. In 1957, Noam Chomsky, an American linguist developed the idea of syntactic
structures which was a work that sought to prove that semantics and syntax are two separate
concepts. He did this by forming the sentence ”Colorless green ideas sleep furiously”, which
shows that semantics and syntax really are different.
In the 1960s, some NLP processing systems were developed, such as: SHRDLU which was a
natural language system that worked in restricted ”block worlds” with restricted vocabularies.
This block world was a breakthrough in Artificial Intelligence (AI) which made it easier for
systems to model and reason about abstract symbols. Also, ELIZA was another NLP processing
system which was a simulation of a psychotherapist. This was written by
Joseph Weizenbaum who was a German American computer scientist from MIT. ELIZA was
able to provide human-like interaction whenever the patient was saying things
within ELIZA’s small database. This was quite remarkable considering that ELIZA was given no
information about human thought or emotion.
5
During the 1960s, The United States and the Soviet Union did research in Russian - English
translation of scientific journals and technical articles. This endeavor was successful in that the
rough translations helped to give a basic understanding of the material. In 1962, the Association
for Machine Translation and Computational Linguistics was formed in the U.S.A. and many
researchers joined it. Research in MT continued until the US Government commissioned the
Automatic Language Processing Advisory Committee (ALPAC) report. This committee was
made up of seven scientists who decided that there was a lack of progress in the last ten years of
MT research despite significant funds being allocated toward this research area.
This ALPAC report concluded that MT was more expensive, less accurate, and slower than a
human translator. This report also concluded that MT was unlikely to make very much progress
towards human-level translation. Consequently, MT funding was greatly reduced.
There was a beacon of hope though because this ALPAC report recommended that tools (i.e.
automatic dictionaries) be developed to aid human translators and that some research in
computational linguistics should continue to receive funding and support. Because of this report,
MT research was practically abandoned for over a decade. This had a ripple effect as the Soviet
Union and the United Kingdom decided to cut back on their MT research efforts as well.
However, Canada, France, and Germany continued to research in MT. In 1969, Roger Schank,
an American A.I. theorist, introduced the conceptual dependency theory for NLU; this model
was partially inspired by the work of Sydney Lamb, an American linguist. The biggest efforts in
MT in the US, during the period after the ALPAC, was given by Systran and Logos; these
companies did MT work for the US Department of Defense.
6
In 1970, the Systran system was installed for the Us Air Force, it was used by the Commission of
the European Communities in 1976, and Xerox used Systran to translate technical manuals in
1978. Also, in 1970, William A. Woods developed the augmented transition network (ATN)
to represent natural language input. These ATNs were graph-theoretic structures that are used in
formal languages. Instead of phrase structure rules, ATNs used an equivalent set of recursively
called finite state automata. Eventually, these ATNs were generalized into so-called
’generalized ATNs”.
In 1970, The French Textile Institute used MT to translate abstracts between French, English,
German, and Spanish. In 1971, Brigham Young University (BYU) started a project to translate
Mormon texts automatically. During the 1970s, it was common for computer programmers to
write ’conceptual ontologies’, which structured real-world information into data that was of such
a form that a computer would be able to understand what it was. Some examples of this is:
MARGIE (Schank, 1975), SAM (Cullingford, 1978), and many more. In the 1960s, MT research
concentrated on limited language pairs and input, but in the 1970s, the concentration was on low-
cost systems that could translate technical and commercial documents. Up to the 1908s, most
NLP systems were based on complex sets of handwritten rules.
By the 1980s, both the diversity and the number of installed systems for machine translation had
increased. Several systems that relied on mainframe technology were in use (i.e. Systran and
Logos). Systran’s first implementation system was implemented in 1988 by the French Postal
Service’s online service called Mintel. There was a market for lower-end machine translation
systems because of the much-improved availability of microcomputers. Thus, microcomputers
7
became more common in Europe, Japan, USA, China, Korea, and the Soviet Union. Japan had
a fifth-generation computer and tried to create software for translating between English and
many other Asian languages.
During the 1980s, translation relied on intermediary linguistic representation such as: syntactic,
morphological, and semantic analysis. In the late 1980s, computational power increased in line
with Moore’s Law and the fact that researchers were relying on Chomskyan theories less
because its theory discouraged the kind of corpus linguistics that is at the foundation of the
machine learning approach to language processing. As a result, computational power also
became less expensive. Also, more research and attention was allocated towards statistical
models for machine translation. Some of the earliest used machine learning algorithms (i.e.
decision trees) produced systems with difficult if-then rules to resolve and maintain which
were similar to existing hand-written rules, but the statistical models made soft, probabilistic
decisions based on attaching real-valued weights to the features of the input data.
In the 1990s, MT began to move away from large mainframe computers and towards personal
computers and workstations. Recent research has focused more on unsupervised and semi-
supervised learning algorithms. These algorithms learn from data that has not been annotated or
a combination of non-annotated and annotated data with the desired answers given and with the
creation of the World Wide Web, lots of non-annotated data is available. In the 2010s,
representation learning and deep neural network machine learning methods became widespread
in NLP. These methods have been shown to produce state-of-the-art results in many NLP tasks
such as: language modeling, parsing, etc. Many state-of-the-art techniques use word embeddings
8
to capture semantic properties of words and an increase in end-to-end learning of higher-
level tasks.
9
Applications of NLP
Natural Language Processing (NLP) is a sub-topic of Artificial Intelligence and its purpose is to
train computers to understand, interpret, read, hear, and communicate using human (natural)
languages. Neural networks have long been associated with NLP research, but their use has
become increasingly practical, widespread, and dominant in applications in both image and
language processing from the early 2000s. In 1997, a variety of neural network architectures
such as Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) models were
introduced and found great success with NLP tasks, voice and text recognition, and text and
speech generation. RNN are iterative models used in sequence to sequence translation,
use previous outputs as inputs for future iterations of the algorithm. In 2001, Yoshio Bengio and
his team proposed the first neural language model using a feed-forward neural network (FFNN).
The big difference between RNNs and FFNNs are that RNNs use previous outputs in an iterative
fashion as future input, but the simpler architecture FFNNs do not, the data is feed forward, an
output is produced, then the new input is fed forward and so on.
In 2011, Apple released Siri which became known as the world’s first successful NLP assistant
to be used by the public. The Automated Speech Recognition module in Siri can translate the
user’s words into digitally interpreted tokens/concepts. Siri has predefined commands within its
software and uses the tokens it receives from the user’s speech and matches it up with one or
more of its predefined commands. Then it carries out the said commands. Machine Learning
techniques are necessary because NLP engines need to recognize a command even if the
10
command is not expressed using the exact sequence of tokens that the engine has been
programmed with; it has to allow for some freedom of expression while still capturing the main
idea of the user’s words and matching it to the predefined set of commands.
In November 2014, Amazon announced Alexa alongside Echo. The inspiration for Alexa stems
came from the computer voice and conversational system on the Starship Enterprise in Sci-fi
works such as Star Trek. One of the reasons the Amazon developers chose the name Alexa is
because x is a hard consonant; hard consonants are recognized with higher precision than other
types of letters. The Amazon developers also liked the name Alexa because it was reminiscent of
the Library of Alexandria, which is where the field of MT began (the precursor to NLP).
Google Assistant was unveiled during Google’s developer conference on May 18, 2016. It was
part of the unveiling of the Google Home smart speaker and a new messaging app called Allo.
Google’s CEO claimed that Google Assistant was designed to be a conversational experience, a
two-way experience, and ”an ambient experience that extends across devices”.
11
Motivation for Data Generation
The goal of this thesis is to investigate the facilitation of the training of large neural networks to
learn and understand (i.e.: develop the capacity to identify and differentiate) grammatical
structure and extended vocabulary, including a wide range of idiomatic language use in the
English language. More specifically, we would aim to re-train a Bidirectional Encoder
Representations from Transformers (BERT) neural network to enhance its capacity to understand
the language well enough to be able to identify paraphrases for a large variety of sentences. The
task of computational paraphrase identification, a high-level artificial intelligence task in the
language understanding domain (NLU), is to evaluate whether the two sentences in a sentence
pair are semantically equivalent.
BERT is constructed as a large and deep neural network with novel network structures (attention
mechanism) to be able to encode longer range morphological and semantic relations within a
text. It has several variants in terms of network size as well as training language data used, but in
all variants it has over 100 million “learnable “ network parameters to be adjusted in an
optimization process. BERT has been trained in a supervised training fashion on two tasks for
which generations of annotated data is easy to produce: Masked Language Modeling and Next
Sentence Prediction, and it has been demonstrated that the network has a strong capacity to be
trained of fine-tuned to produce state of the art performance in a wide range of NLP tasks
(transfer learning).
12
To either train from scratch or to retrain/fine tune for close to human performance on a complex
task such large pre-trained networks in a supervised training regime, extensive amount of labeled
training data is needed. For strong performance of the targeted trained network in a general NLP
setting the training data deployed need to represent a rich variety of the targeted language use.
Training such large systems is expected to require millions of annotated training samples that
show a representative distribution of the targeted language use. Such richness in the training data
would be key to improve the performance of state-of-the-art systems in automated language
processing and language understanding, but currently for most tasks it is not available in the
necessary volume and quality.
Development of the needed training data sets with the current methods (i.e.: based on hand-
labeling by humans) are prohibitively expensive: it would require hundreds of thousands of
expert human workhours or possibly substantially more. Thus, current industry standard data sets
for training and evaluation for several NLP tasks are relatively small. For instance the General
Language Understanding Evaluation (GLUE) benchmark’s MRPC component to evaluate a
network’s capacity to train/perform for the task of paraphrasing is relatively small (on the order
of 5000 sentences) and it is also restricted to a limited a (journalistic) language domain. One of
the objectives in this thesis is to explore a method to generate a large annotated training dataset
for fine tuning a trained network in a supervised learning setting for the task of paraphrase
identification, i.e. to decide if two sentences are semantically identical, or correspond to each
other as paraphrases.
13
We base our paraphrase sample generation method on content extracted from an extensive
synonym dictionary: The Oxford Thesaurus, An A-Z Dictionary of Synonyms, Oxford
University Press (1994) (OTAZ). This dictionary puts a focus on compiling complete synonymic
word and expression sets with an emphasis on representing rarer word and expressions uses,
associated with semantic content of synonym sets. That information is very hard for networks to
learn from corpus (even from billion-word corpora) due to very low frequency use even in very
large text collections (corpus). However, some common synonyms may have been omitted from
the collection to limit the volume of the original publication. We expect that the most common
synonymic uses are well-represented in the training data that was originally used for training
BERT; thus, they are already represented in the models we aim to fine-tune.
Another attractive option as a source for the adverbial, adjective, noun and verbal synonym sets
could be WordNet database, however WordNet does not contain rarer part of speech sets (such
as Interjections, Conjunctions,...) of words and expressions and is more limited in the number of
example sentences listed to illustrate semantic function. Additionally, to its polysemous
synonymic organization of its headwords, OTAZ provides additional stylistic and usage
information on the synonym sets and importantly a rich set of example sentences (1-4 example
sentences per synonym sets) using the headword/expression of each synonym set entry. The idea
of sentence pair generation using lexicographical resources for enhanced training of neural
language networks for classification of sentence pairs relies on substitution of synonymic words
and expressions into example sentences from thesauruses, synonym dictionaries and similar
lexicographical resources.
14
In OTAZ fundamentally such a substitution task poses challenges of varying complexities
depending on the part of speech categories the synonym sets belong to. The difficulties are
ranging from the simple, (e.g.: for the case of adverbial synonym sets with very little to
no morphological variation induced with substitution), to the sophisticated in the case of
synonymic verbal expressions that are required to be adjusted in number, tense and attracted
noun cases during a substitution step to provide correct sentence pairs. In this work, for
simplicity, we limit our investigation for generation and training use of adverbial synonym sets
only. There was not a substantial amount of data online for download, so we had to create our
own training data. We used text data from the Oxford synonym dictionary cited above.
In the format of this dictionary assigned to a given word or expression heading was its part-of-
speech, a list of synonyms or synonymic expressions, and one or more (up to four) example
sentences. Using regular expression as our editing tool, the data for adverbial synonyms was
converted to a text file with each example sentence on a separate line paired with its head
expression’s synonymic set -. Then we used regular expressions to organize the text, and to
generate alternate matching sentence paraphrases by substitution of the marked expression
heading with its synonymic alternatives. This provided on average about 10 paraphrase
alternatives for each example sentence. As network training would require a balanced set of
negative as well as positive examples, we also produced non-matching paraphrases from each
example sentence, by associating it with a synonym set that were not matching the marked
(head) expression in the example and performing the substitutions indicated above.
15
The substitution steps were performed with an iterative application of a complex regular
expressions; the matching and replacement steps are as outlined below. These test processing
methods are computationally highly efficient. After substitution we formed all possible sentence
pairing in both paraphrase and non-paraphrase sentence matchings and attached an appropriate
label to each resulting sentence pair (1 for paraphrase and 0 for non-paraphrase pairs).
This procedure generated approximately 45000 paraphrase pairs and with a similar amount of
non-paraphrase pairs we obtained a set of close to 90000 tagged training sentence pairs. These
pairs were generated from the approximately 1000 adverbial synonym set and example sentence
combinations that are found in OTAZ. We emphasize that these samples are produced varying
adverbial expressions only. This paraphrase dataset is rich enough to finetune large pretrained
neural networks (e.g. BERT) for the recognition of adverbial paraphrasing.
In the generation process, we used a set of special characters to delineate the structure of the text
database that we formulated from OTAZ. After completing the substitution, pairing steps, and
eliminating the auxiliary characters from the results, we obtained our training data. This dataset
is structured as a pair of sentences with a binary label to indicate whether the sentences are
paraphrases of each other; this dataset was in a tab-separated format. This text document was
loaded with the base BERT network model to perform fine-tuning.
16
Regular Expressions
Regular expressions are powerful text pattern recognition codes/software tools that can be very
efficiently implemented by programs (so-called Regex engines) based on the theory of finite
automata. A regular expression is an algebraic formula whose value is a pattern identifying of a
set of strings that are matched by the expression, called the language of the expression. Regular
expression engines are a built-in part of most programming languages as well as more advanced
text editors and are essential components of software that performs text mining and high volumes
of sophisticated text processing.
A regular expression defines a character sequence search pattern. Usually such patterns are used
by string searching algorithms for ”find” or ”find and replace” operations on strings, or for input
validation. It originated in the 1950s and a mathematician from the U.S., Stephen Cole Keene, is
given credit for creating it. Here are some basic components that are use in regular expressions:
Normal text or character sequences:
abc matches the string abc; Note: regex (language) special characters ()[]*^${}.?\- need to
be escaped, i.e. preceded by the character “\”, in a regular expression to be matched within a
normal text
Anchors : ^ and $
17
^ matches the beginning of a line (a zero-length string at the beginning of a line)
$ matches the end of a line (a zero-length string at the end of a line)
Repetition Quantifiers : * + ? and {}
* zero or more times: abc* matches a string that has ab followed by zero or more c
+ one or more times: abc+ matches a string that has ab followed by one or more c
? zero or one times: abc? matches a string that has ab followed by zero or one c
{n} n times: abc{2} matches a string that has ab followed by 2
{n} n or more times: cabc{2,} matches a string that has ab followed by 2 or more c
{n,m} n to m times: abc{2,5} matches a string that has ab followed by 2 up to 5 c
(text) grouping/capturing: a(bc)* matches a string that has a followed by zero or more
copies of the sequence bc and captures the sequence bc
OR operator: | or []
a(b|c) alternate expressions: matches a string that has a followed by b or c (and captures b or c)
a[bc] alternate characters: same as previous, but without capturing b or c
a[^bc] character exclusion: matches a string that has a followed one character that is not b or c
Character classes: \d \w \s and .
\d matches a single character that is a digit; same as [0-9]
18
\w matches a word character (alphanumeric character plus underscore); same as [a-z0-9_]
\s matches a whitespace character (includes tabs and line breaks); same as [\t\r\n ]
. matches any (one) character
For example, if you have a text document and you want to search for a word character followed
by a digit one or more times, you will use the regexp: \w\d+. We used regular expressions to
”clean up” the text documents that we wanted to use for training. We deleted all unnecessary and
undesirable characters in the text, we rearranged words in the text, and we created
multiple examples sentences from the given example sentence using regular expressions. Regexp
made this task easier because it allowed us to make complex edits on very large text documents
without having to manually go through single lines. We were able to check that everything
worked in the desired way by changing a few lines at a time, first, then once we were confident
that we tested and verified the correctness a specific edit, we would change the entire document.
While editing large text files with regular expressions it is imperative that one is very familiar
with the document and to follow best practices. Our original data to be edited is well-structured
text database organized on a line per record basis and special characters are used to delimit fields
within each line, thus it is well-suited for editing using Regex.
A snippet of the reformulated lexicographical database obtained from OT is represented below.
The database stored in a text file is line based with Unicode character sets. Here D, @, &, %, #
and ∼ represent specially selected, special separator Unicode characters not appearing in the
text-based data of our database which are used as special markers. Each paragraph in the
19
representation below, headed by a lead expression and ending with one or more example
sentences, shows an adverbial synonym set and its use. To a single lead expression there may be
associated several alternative use sense or synonym set. Each paragraph below is in fact a (long)
line in the database ending in a newline character. This representation allows us to use regular
expressions for complex data generation very efficiently.
Below are several lines of the reformulated data from the OTAZ with special characters used as
markers:
Đabove+♥adv.♦1*╕above, overhead, on_high, aloft, in_the_sky,
in_the_heavens, <Tab>: Far aboveΣ, the clouds scudded swiftly
by.♪
Đabroad+♥adv.♦1*╕abroad, overseas, in_foreign_lands,
in_foreign_parts, <Tab>: We were abroadΣ on assignment for a few
years.♪
Đabsolutely+♥adv.♦1*╕absolutely, unqualifiedly, unconditionally,
unreservedly, unexceptionally, unequivocally, unquestionably,
positively, definitely, really, genuinely, decidedly, surely,
truly, certainly, categorically, <Tab>: She is absolutelyΣ the
best dancer I have ever seen.♪
20
Đabsolutely+♥adv.♦1*╕absolutely, unqualifiedly, unconditionally,
unreservedly, unexceptionally, unequivocally, unquestionably,
positively, definitely, really, genuinely, decidedly, surely,
truly, certainly, categorically, <Tab>∷ I absolutelyΣ refuse to
go.♪
Đabsolutely+♥adv.♦2*╕absolutely, totally, utterly, completely,
entirely, fully, quite, altogether, wholly, <Tab>: It is
absolutelyΣ necessary that you undergo surgery.♪
Đaccordingly+♥adv.♦1*╕accordingly, hence, therefore,
consequently, thus, in_consequence_of, in_consequence_whereof,
so, and_so, <Tab>: Smoking was forbidden; accordinglyΣ, we put
out our cigars.♪
Đaccordingly+♥adv.♦2*╕accordingly, suitably, in_conformity,
in_compliance,÷ conformably, appropriately, compliantly, <Tab>:
Dinner-jackets were required, and the men dressed accordinglyΣ.♪
Đactually+♥adv.♦1*╕actually, really, in_reality, in_fact,
in_actuality, in_point_of_fact, in_truth, absolutely,
as_a_matter_of_fact, indeed, truly, literally, <Tab>: The
interest rates actuallyΣ charged by banks may vary from those
quoted publicly.♪
21
In later discussion, we will refer to the replacement procedure based on the following entry as an
example; the delineating characters are shown in blue:
Đabout+♥adv.♦4* ╕about, here_and_there, far_and_wide, hither_and_
yon, hither_and_thither, helter-skelter, <Tab>: My papers were
scattered aboutΣ as if a tornado had struck.♪
22
Structure of Transformer
Networks
23
The BERT network is a type of Transformer network. A Transformer network is a large neural
network that has both encoders and decoders (BERT only has encoders) and was introduced in
2017 by Vaswani et al. in the well-known paper titled ”Attention is All You Need”.
Transformer networks are designed to efficiently identify association patterns between distant
terms in sequences and are designed to be trained for sentence translation and other NLP tasks.
The input could be a French sentence, for example, and the Transformer based on a large set of
training examples that is usually referred as a corpus, would be trained to translate the French
sentence into English, and then output the English translation. The architecture of the
Transformers is composed of a stack of encoders and a stack of decoder network components.
The encoders basically take the French sentence and transform it into the language of
the network , and then the last encoder feeds this information into the stack of decoders. The
decoders take the information from the last encoder and they turn the information from the
language internal to the network into the desired language output for the user (which is the
corresponding English language sentence in this example). The underlying network architecture
in a distributed fashion (i.e. assigned to each artificial neuron) contains a very large set of
parameters (numbering in the millions for large and deep state-of-the-art networks) that are
identified in a large-scale optimization exercise, called training, based on available training
examples. This demanding computation exercise requires highly parallelized hardware
architectures such as GPUs or TPUs.
Encoders in a Transformer can turn the input into language the model understands via a
technique called word embeddings. These word embeddings are the process that the network
takes to take a given word and turn it into a vector of a certain size. When using the Transformer
24
network, the word embeddings all have the same size. These vectors contain the word and some
of its meaning. For example, the word ’King’ and the word ’Queen’ each have their own vector
from the word embeddings. If you take the vector for King and subtract the vector for Queen, the
difference will be a vector that is ”small” in magnitude. These word embeddings are possible
after a network has been trained. So, the encoders use word embeddings to gather the necessary
information from the input.
Each of the encoders has a structure to it. In each encoder, there is a Self-Attention mechanism
and a feed-forward network. The self-attention mechanism was invented in the paper
”Attention is All You Need”. The goal of 11 attention is to help the network to understand the
connection that each of the words have with each other. For example, in the sentence: The dog
took a bone a buried it in the ground. What does the word ’it’ refer to in this sentence? Someone
who is fluent in English would know that the word ’it’ would refer to the bone. The goal of
attention is to help the network to understand that the word it refers to the
bone. Therefore, attention is so important; for the network to properly understand the sentences,
it must understand how each word in a sentence relates to all the other words.
How does the self-attention work? Self-attention requires a few things: word embeddings,
queries, keys, and values. For each word embedding, a query, key, and value vector is created
for it. Once a network has been trained, the network will have a weight matrix 𝑊𝑄 for the
queries, a weight matrix for the keys 𝑊𝐾, and a weight matrix 𝑊𝑉 for the values. Each word
embedding vector is multiplied by each of the weight matrices. This would produce a query
vector 𝑞𝑖, a key vector 𝑘𝑖, and a value vector 𝑣𝑖 corresponding to the word embedding 𝑥𝑖. Next
25
you take 𝑞𝑖 ∏ 𝑘𝑖𝑖 . Then you do 𝑞1𝑘1
+√𝑑𝑖𝑚(𝑘) and take the “SoftMax” of it. This SoftMax normalizes
the constant so that it is between 0 and 1 (a probability). Then you take each value vector times
the SoftMax score, then take the sum of all of them. This will give you a weighted sum of all the
value vectors. This calculation is very similar for matrices. Here it is:
𝑋 × 𝑊𝑄 = 𝑄
𝑋 × 𝑊𝐾 = 𝐾
𝑋 × 𝑊𝑉 = 𝑉
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑄 × 𝐾𝑇
+√𝑑𝑘
) × 𝑉 = 𝑍
In this calculation, 𝑋 is the input, 𝑄 is the query, 𝐾 is the key, 𝑉 is the value, 𝑍 is the
output/head.
There is also the notion of multi-headed attention. This notion helped the network’s attention
layer to improve its performance. With multi-headed attention, the network produces multiple
attention heads or multiple 𝑍 matrices. In this process we have multiple 𝑄/𝑉/𝐾 matrices. In the
Transformer model, we have 8 attention heads. This will give us
𝑍0, 𝑍1, … , 𝑍7.
Since the model only expects one attention head, we concatenate all the Z’s together to get Z
then multiply Z by another weight matrix W which was trained by the model to give us a Z of the
desired dimensions. Why would multi-head attention be useful? Consider the sentence,
”Dave bought a phone, but didn’t know how to operate it.” What is the word ”it” referring to? Of
course, we humans are fluent enough in language to know that the word ”it” refers to the phone,
but the model is not as fluent in language as we are. The multi-head attention aids the network to
26
learn that the word ”it” refers to the phone. As we all know, the order of words in a sentence is
important. The network has a way of understanding the position of each token in a sentence by
using a ”positional encoding” that it learns through training. There are some residuals in the
encoders. What happens is that the input X is fed into the encoder, self-attention is performed,
the result of the self-attention Z is taken then the network computes X + Z, normalizes it, then
feeds it into the feed-forward part of the encoder. The result of the feed-forward part of the
encoder 𝑍′ is added to Z and normalized. Then this result 𝑍′′ is fed into the next encoder and the
process repeats. Once our original input has made its way through all the encoders, it proceeds to
the decoder.
The structure of the decoder is as follows: The output of the encoder is fed into the decoder.
The decoder takes this through self-attention, adds and normalizes, then does an encoder-decoder
attention, then adds and normalizes, then goes through the feed-forward, then adds and
normalizes, then repeats this process as it gets pushed into the next decoder. Once the original
output of the encoder layers gets pushed through all the decoders, it goes through a linear layer,
then gets its SoftMax score. This linear layer is what takes the vector that the decoders spit out
and turns it into one of the tokens in its dictionary. It does this by projecting the output vector
from the decoders into a logits vector whose length is the number of words in the given language
that it knows. The SoftMax is used to help the network decide which of the words is the most
probable.
BERT, a transformer neural network is the product of several NLP advancements, such as, Semi-
Supervised Sequence Learning, ELMo, ULMFiT, the OpenAI transformer, and the Transformer.
It is a transformer that only has encoders and no decoders.
27
Figure 2: BERT architecture
BERT has been pre-trained on massive datasets, such as, all of Wikipedia and many books.
BERT is mostly used for sentence classification and sentiment classification. As an example, we
can see BERT used for spam classification outlined in Figure 3.
Figure 3: BERT performing spam classification
BERT takes a sentence as input, performs a word-embedding technique on all of the tokens,
feeds these tokens through its encoders (see Figure 2), then sends it through a classifier to output
a classification based on the probability of the input being spam or not spam.
28
In addition to spam classification, BERT can also do fact-checking at a high level. The
tokenization process for BERT means that it takes words or parts of words that it recognizes
from its training and turns them into word-embeddings. A word embedding is a vectorized text
encoding, i.e. when a token is turned into a relatively high-dimensional vector representation.
Usually, a word embedding is a vector of size 300 for a given token; a token is usually a word,
but a token could also be one part of a word, or a hyphenated set of words, etc. These word
embedding vectors have been developed internally by BERT itself for the purpose of identifying
words (or pairs of words) that can be assembled to represent words from its dictionary. This
word-embedding process allows it to perform vector/matrix calculations for the test data
mentioned previously in this chapter. BERT’s input, along with word/token embedding also
represents the token’s position within the text as well as representations of sentences; this
concept is relevant for multi-sentence tasks. Our paraphrase classification problem relies on this
multi-sentence tokenization.
Figure 4: BERT input representation
29
The input is compressed into a sequence of vectors. These vectors enter the network at its first
parametrized layer of encoders. The encoders process these vectors in a series of highly
parallelizable numerical linear algebra steps within the layer. Then this result is fed into a second
layer of an identically structured parametrized layer. This process is repeated several times (12
times in the base BERT network). This leads to progressive information processing by the
network. The output of the last layer is typically processed by a FFNN component that is trained
to assign probabilities to the token positions. These probabilities measure the likeliness that a
given token in the input belongs to a defined ranged of output classes. This process is well-suited
for sequence translation, Named Entity labeling, and Named Entity Recognition (NER).
In classification tasks, the last encoder produces a special token called the [CLS] token or
classification token. This token is fed into a simple FFNN component that is trained to assign
probabilities to the input tokens. These probabilities measure the likeliness of a given input token
belonging to a defined range of output classes.
The training of such a network is a (large) optimization exercise. The aim of this exercise is to
identify a set of network parameters that maximize the network’s performance. The network
wants to maximize its performance in classifying the training data while ensuring that it has the
capacity to generalize to examples it has not seen before. In other words, the network wants to
maximize its classification performance on the data it has seen without overfitting. If this is
achieved, then BERT has been shown to perform very well in classification tasks.
30
Data Generation
The goal of this project was to train a BERT network to be able to recognize when a pair of
sentences that contained alternatives of adverbial words or expressions were paraphrases of each
other. The first task was to create data for a BERT Transformer, as outlined above. This data
needed to be example sentence pairs such that a sentence pair would be labelled with a 1 if the
sentences presented were a paraphrase of one another. If one has a sentence pair such that neither
sentence is a paraphrase of the other, this pair is labelled with 0. In other words, the goal is to
create a large number of sentence pairings of substantial linguistic variety as data to train a
neural network to learn to recognize paraphrases.
Such data is not readily available in the required quality and quantity and is very expensive to be
produces by humans. The Oxford synonym dictionary (OT) references earlier contains heading
words or expressions, synonyms to those headings, and example sentences using those words.
An example of a line given in the dictionary is as follows:
about adv.: about, here and there, far and wide, hither and yon, hither and thither, helter-skelter:
My papers were scattered about as if a tornado had struck.
The presentation of every word in this synonym dictionary follows this basic structure. This
looked like a promising source for data. As such, this entire dictionary was scanned and entered
31
in a text editor for cleaning. Various special characters were inserted to make the cleaning and
generation process easier. For instance, the above example was changed to:
Đabout+♥adv.♦4* ╕about, here_and_there, far_and_wide, hither_and_yon, hither_and_thither, h
elter-skelter, : My papers were scattered aboutΣ as if a tornado had struck.♪
As you can see, the Đ marks the beginning of a new word, the + marks the end of the word,
the ♥ marks the beginning of the part of speech, the ♦ marks the end of the part of speech, the
number four denotes how many times this has been used (because many words have multiple
meanings and each word that is in the dictionary has a separate synonym sequence for each
different meaning it has), the * marks the end of the number, the ╕marks the beginning of the
first synonym. Multiword expressions have underscores to separate each word (replace spaces
within the expression), although some terms are separated by hyphens, for example, helter-
skelter. Each synonym is followed by a comma, even if it is the last synonym in the list. Each
example sentence follows a colon and ends with a punctuation mark and a character ♪ (musical
note). A tab always separates the synonyms and the first example sentence. These characters
make it easier to do regular expressions on the text because many of the symbols used are rarely
seen in dictionaries and thus, are unlikely to clash with anything in the text originally. If ! was
used instead of one of the special characters, there would have been a higher probability of error
with the regular expressions because ! is a more commonly used character in dictionaries.
After the dictionary has been cleaned up and organized by the special characters, sample
sentences are created. For this, a regular expression search pattern was created:
32
^([^\n╕\t]+ )╕([^\n╕:÷\t,]+)([,\t]÷* *)([^\n╕:\t]+)(\t:+)([^\n:]* )([^\n:]+Σ+)([^\n]+)♪([^\n]*)\n
along with the following regular expression replacement pattern:
\1\2\3╕\4\5\6\7\8♪\9\5\6\2τ\8\n
The following explains the search pattern:
^ means to start at the beginning of the line.
The captured expression ([^\n╕\t]+ ) means that the aim is to collect everything that is not an end
of line, nor a ╕, nor a tab if it occurs one or more times. This unit is referenced by \1 that appears
in the replacement pattern.
╕([^\n╕:÷\t,]+) means that a ╕will be matched preceding the regexp in the parenthesis, which is
referenced by 2 in the replacement pattern. \2, matches everything after ╕ that is not an end of
line, nor a ╕, nor a :, nor a ÷, nor a tab, nor a comma if it occurs one or more times. This pattern
captures the first synonym in the list, while \1 captures everything up until the ╕.
([,\t]÷* *) is denoted by \3 and it collects everything until the next comma or tab, whichever
occurs first. In this instance, \3 captures the next synonym in the list.
([^\n╕:\t]+) means \4 will matches everything after the second synonym that is not an end of
line, nor a ╕, nor a :, nor a tab if it occurs one or more times. In other words, \4 will match
everything after the second synonym up until the tab preceding the example sentence.
33
(\t:+) means \5 will collect the tab and colon that occurs one or more times, then stop preceding
the example sentence.
([^\n:]* ) means \6 will collect everything that is not an end of line, nor a colon if it occurs zero
or more times. It will stop immediately before the original synonym including the whitespace
preceding the first character of the original synonym. In other words, \6 collects everything in the
example sentence preceding the first character of the “keyword”.
([^\n:]+Σ+) means \7 will collect the keyword, which conveniently has a sigma at the end of it.
Since it is possible for some keywords to have more than one sigma, Σ+ is used to
collect sigmas if it occurs one or more times. It is important that Σ marks the keyword only in the
original example sentence.
([^\n]+) means \8 will collect everything after the “keyword” but stops before the character ♪. In
other words, \8 collects the rest of the example sentence.
♪([^\n]*)\n indicates the search pattern that locates the character ♪, which marks the end of the
example sentence. \9 collects everything after the character ♪ that is not an end of line (\9 is most
likely empty), and \n finds the end of the line.
The replacement pattern is as follows: \1\2\3╕\4\5\6\7\8♪\9\5\6\2τ\8\n
34
This means that everything up to the first synonym is the same, but the ╕ gets moved to the
position after the next synonym. Everything then continues as aforesaid until the original end of
line is reached. It is here that the new example sentence is placed. The example
sentence remains the same, except that the initial keyword is substituted for the next keyword in
the list of synonyms that had the ╕ after it in the previous iteration. \n then creates the end of
line. One should continue to apply this search and replacement pattern until it can no longer be
matched. After it matches no more, each synonym in the list should be in its own example
sentence. Then, the number of example sentences on each line is equal to the number of
synonyms on each line plus one since the original example sentence is copied.
It is important to note that τ marks the keyword in each example sentence after the original
example sentence. Also, for ease of application, the original example sentence is copied by the
regexp. Thus, it appears twice on each line where the keyword in the copy is denoted by a τ,
while the Σ denotes the same keyword in the original example sentence. Moreover, each example
sentence is separated by a tab, followed by a colon, followed by a whitespace.
The first letter in any sentence should be capitalized. In order to assume this is the case, the
following search pattern can be achieved by the matching pattern (\t:+ )([a-z]), (as the example
sentences in each row are preceded by a Tab character followed by a series of semicolons and a
space) along with this replacement pattern: \1\U\2\E
Above, (\t:+ ) means that \1 will collect a tab followed by a colon that occurs one or more times.
([a-z]) means that \2 will collect the first letter of a sentence only if it is lowercase. If the first
35
letter of a sentence is not lowercase, then this search pattern will not match, so there is no risk of
corrupting sentences that are already correct. It is important to note that this is functional only
because each example sentence is separated by one or more tabs and one or more colons. As
such, this is effectual for each example sentence on every line.
The replacement pattern \1\U\2\E means that the one or more tabs remain in place in front of
each example sentence. If ([a-z]) finds a match, then \U\2 converts it to uppercase (hence, the
capital U). \E ends the uppercase replacement mode.
At this point, the sentences have been created and rendered grammatically correct (excluding the
special characters). It is now time to create the sentence pairs that are required as data for the
BERT Transformer. A nested for-loop is used for this task. The algorithm is broken up into four
steps.
Step #1. Search pattern: ♪, and the replacement pattern is: ♪♫. The point of step one is to use
the character ♫ to separate the two sentences that will be paired.
Step #2. Search pattern is as follows:
(╕ \t:+ )([^\n\t]+♪)([^\n]*)(♫)(\t)([^\n\t]+)(\t)([^\n]*\n)
The component parts will now be addressed.
36
(╕ \t:+ ) means that \1 will search for the ╕, which is in between the last synonym and the first
example sentence, since all the synonyms have been used for sentence generation in the first
regex algorithm. \1 collects the ╕, the whitespace, the one or more tabs followed by the one or
more colons, followed by a whitespace.
([^\n\t]+♪) means that \2 will start after the colon in \1 and collect everything that is not an end of
line, nor a tab, one or more times, then it will collect the character ♪. In other words, \2 will
collect the first example sentence on the line in addition to the character ♪.
([^\n]*) means that \3 will collect everything in between the character ♪ in \2 and
the character ♫ in \4, although it is very likely that \3 will be empty on most lines.
(♫) as \4 is self-explanatory. (\t), which is \5 is also self-explanatory and is the tab in between the
first two example sentences.
([^\n\t]+) means that \6 will search for and collect anything after the tab that is not an end of line,
nor a tab, in that order, one or more times. This, in practice, is the second example sentence on
the line.
(\t) as \7 is the tab between the second and third example sentence.
([^\n]*\n) means that \8 collects everything else after the tab at the end of the second example
sentence. This may include other example sentences.
37
The replacement pattern for step #2 is \1\2\3\5\6\4\7\8\2\6\n
The replacement pattern will take the first example sentence, a tab, then the second example
sentence, then the character ♫, then the tab, then all other example sentences, then the first and
second example sentence, then the end of the line. In other words, the pairing of the first and
second example sentences is pushed to after the last example sentence and our sentence pairing
marker, character ♫, is pushed in between the second and third example sentences. The purpose
of this is that the second and third example sentences can be paired and pushed to the end in the
second iteration of this nested for-loop.
Step #3: search for character ♫ and replace it with whitespace because it needs to be available to
separate the second and third sentences in the next iteration.
For step #4, the search pattern is as follows: (╕ )(\t)(:+ [^\n\t]+)(♪)(\t)([^\n\t]+)(\t)
The replacement pattern is as follows: \2\3\1\5\6\4\7
(╕ ) means that \1 searches for the ╕ followed by a whitespace, which can be found where it
remained after the previous synonym.
(\t) means that \2 searches for the tab before the first example sentence.
38
(:+ [^\n\t]+) means that \3 searches for and collects the one or more colons before the first
example sentence, the whitespace, then everything that is not an end of line, nor a tab, one or
more times.
\4 is the character ♪ that separates the first and second example sentences.
\5 is the tab before the second example sentence.
([^\n\t]+) means that \6 will take the second example sentence.
(\t) is \7, which takes the tab after the second example sentence.
The replacement pattern is the following: \2\3\1\5\6\4\7
This takes the tab before the first example sentence, then the first example sentence, then
the ╕ to mark the beginning of the sentence pair for the next iteration, then the tab, then the
second example sentence, then the ta ♪ to separate the second example sentence from the third
example sentence. This creates a separation for the next pair of sentences in the next iteration,
then the tab is inserted to separate the second example sentence from the third example
sentence.
Repeat steps 1-4 until the search pattern finds no matches. This will create sentence pairs at the
end of the example sentences that were created. Once this is completed, one can create a new
text document with only the sentence pairs: one pair per line. Then, create the labels based on
whether both sentences in each pair come from the same line in the original text document that
was just created. If both sentences come from the same line, label that pair with a one.
Otherwise, label that pair with a zero. Once this is completed, and all auxiliary characters have
39
been removed, the outcome is annotated training data for training data for fine-tuning the BERT
Transformer for the task of paraphrase identification.
As the training data was generated from dictionary entries of exclusively adverbial expressions,
the retrained network is expected to be able to identify adverbial paraphrases only, however, in
this task, it is expected to perform well, since the training data contains a relatively large set of
examples with a very thorough coverage of adverbial phrase usage.
40
Implementation of Training
We trained BERT in Google CoLab. Google CoLab is a free cloud service that allows one to
write and execute Python code in your browser with free access to GPUs and popular libraries
such as Keras, TensorFlow, PyTorch, OpenCV. It is great for machine learning, data analysis,
and education. CoLab is a hosted Jupyter notebook service that requires no setup to use. The
resources for CoLab are not guaranteed and limited, and the usage limits sometimes fluctuate,
but the usage limits can be extended and more dependable with a ColabPro subscription.
The code can be found in the appendix, and it has been adopted from code on the GitHub site
(https://github.com/naveenjafer/BERT_Amazon_Reviews). It uses the Pytorch package and
Hugging Face’s popular Transformer Library to provide a solution to fine-tuning the BERT base
model applied for a one sentence binary classification problem. We changed the code so that we
could use it for solving a two-sentence binary classification problem. We can consider our
paraphrasing problem as such a classification exercise. It was run in the Google CoLab
environment which allowed us to access GPU compute facilities remotely for free.
We split our training data into 80% used for training and the remaining 20% for validation. This
gave us 72210 training samples and 18053 validation samples. We had a max length of 100
units, our Batch Size was 64, and we used 2 epochs to retrain. It is recommended that one uses 2
or 3 epochs for finetuning in Reference number 7. We used the Negative Log Likelihood Loss
41
Function and the Adam optimization. The training loop did 1129 iterations for each epoch and 10
validation iterations after each epoch. After the first epoch, the training accuracy was over 98%
and the validation accuracy was over 96%. After the second and final epoch, the training
accuracy was 100% and the validation accuracy was over 97%. The program ran for 22 minutes
total.
Since this thesis only addressed sentence pairs with adverbial expressions, in the future, I would
love to apply this methodology of generating annotated paraphrasing sentence pairs with other
parts of speech, such as, adjectives, nouns, and verbal expressions. Approximately 2.5 million
sentences can be generated for other parts of speech by following the methodology outlined
above. This data can be used for thorough retraining. Some of the objectives that can be achieved
with such data are alternate uses of words and expressions in natural (human) language. This
dataset is expected to boost the performance of Transformer-based networks in NLU. In general,
dictionary annotations and style can also be exploited; transfer learning methods will make
identification more efficient and will increase performance in differentiating stylistic language
(formal, colloquial, slang, etc.)
42
REFERENCES
1. “History of Natural Language Processing.” Wikipedia, Wikimedia Foundation, 3 June
2020, en.wikipedia.org/wiki/History_of_natural_language_processing.
2. “Natural Language Processing.” Wikipedia, Wikimedia Foundation, 14 July 2020,
en.wikipedia.org/wiki/Natural_language_processing.
3. Foote, Keith D. “A Brief History of Natural Language Processing (NLP).”
DATAVERSITY, 17 June 2019, www.dataversity.net/a-brief-history-of-natural-language-
processing-nlp/.
4. “Machine Translation.” Wikipedia, Wikimedia Foundation, 14 July 2020,
en.wikipedia.org/wiki/Machine_translation#:~:text=Machine%20translation%2C%20som
etimes%20referred%20to,speech%20from%20one%20language%20to.
5. “History of Machine Translation.” Wikipedia, Wikimedia Foundation, 30 Jan. 2020,
en.wikipedia.org/wiki/History_of_machine_translation.
6. Alammar, Jay. “The Illustrated Transformer.” The Illustrated Transformer – Jay
Alammar – Visualizing Machine Learning One Concept at a Time., 2018,
jalammar.github.io/illustrated-transformer/.
7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: “Pre-training
of Deep Bidirectional Transformers for Language Understanding“, Proceedings of
NAACL-HLT 2019, pages 4171–4186
8. Alammar, Jay. “The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer
Learning).” The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer
Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a Time.,
2018, jalammar.github.io/illustrated-bert/.
9. Naveenjafer. “Naveenjafer/BERT_Amazon_Reviews.” GitHub, 2020,
github.com/naveenjafer/BERT_Amazon_Reviews.
43
APPENDIX:
Python Code for Implementation
The code below was implemented in Google CoLab because it allowed us to use a high-power
machine remotely for free. Here is the code:
! pip install pandas
! pip install torch
! pip install transformers
from google.colab import drive
drive.mount('/content/drive')
projectFolder = "./drive/My Drive/Bert/"
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from torch.utils.data import DataLoader
import torch.optim as optim
import os
from torch.utils.data import Dataset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def read_and_shuffle(file):
df = pd.read_csv(file, delimiter='\t')
# Random shuffle.
df.sample(frac=1)
return df
44
def get_train_and_val_split(df, splitRatio=0.8):
train=df.sample(frac=splitRatio,random_state=200)
val=df.drop(train.index)
print("Number of Training Samples: ", len(train))
print("Number of Validation Samples: ", len(val))
return(train, val)
def get_max_length(reviews):
return len(max(reviews, key=len))
def get_accuracy(logits, labels):
# get the index of the max value in the row.
predictedClass = logits.max(dim = 1)[1]
# get accuracy by averaging over entire batch.
acc = (predictedClass == labels).float().mean()
return acc
def trainFunc(net, loss_func, opti, train_loader, test_loader, config):
best_acc = 0
for ep in range(config["epochs"]):
for it, (seq, attn_masks, labels) in enumerate(train_loader):
opti.zero_grad()
#seq, attn_masks, labels = seq.cuda(args.gpu),
attn_masks.cuda(args.gpu), labels.cuda(args.gpu)
seq, attn_masks, labels = seq.to(device), attn_masks.to(device),
labels.to(device)
logits = net(seq, attn_masks)
loss = loss_func(m(logits), labels)
loss.backward()
opti.step()
45
print("Iteration: ", it+1)
if (it + 1) % config["printEvery"] == 0:
acc = get_accuracy(m(logits), labels)
if not os.path.exists(config["outputFolder"]):
os.makedirs(config["outputFolder"])
# Since a single epoch could take well over hours, we
regularly save the model even during evaluation of training accuracy.
torch.save(net.state_dict(), os.path.join(projectFolder,
config["outputFolder"], config["outputFileName"]))
print("Iteration {} of epoch {} complete. Loss : {} Accuracy
: {}".format(it+1, ep+1, loss.item(), acc))
print("Saving at", os.path.join(projectFolder,
config["outputFolder"], config["outputFileName"]))
# perform validation at the end of an epoch.
val_acc, val_loss = evaluate(net, loss_func, val_loader, config)
print(" Validation Accuracy : {}, Validation Loss :
{}".format(val_acc, val_loss))
if val_acc > best_acc:
print("Best validation accuracy improved from {} to {}, saving
model...".format(best_acc, val_acc))
best_acc = val_acc
torch.save(net.state_dict(), os.path.join(projectFolder,
config["outputFolder"], config["outputFileName"] + "_valTested_" +
str(best_acc)))
def evaluate(net, loss_func, dataloaderq, config):
net.eval()
mean_acc, mean_loss = 0, 0
count = 0
with torch.no_grad():
46
for seq, attn_masks, labels in dataloaderq:
#seq, attn_masks, labels = seq.cuda(args.gpu),
attn_masks.cuda(args.gpu), labels.cuda(args.gpu)
seq, attn_masks, labels = seq.to(device), attn_masks.to(device),
labels.to(device)
logits = net(seq, attn_masks)
mean_loss += loss_func(m(logits), labels)
mean_acc += get_accuracy(m(logits), labels)
print("Validation iteration", count+1)
count += 1
'''
The entire validation set was around 0.1 million entries,
the validationFraction param controls what fraction of the
shuffled
validation set you want to validate the results on.
'''
if count > config["validationFraction"] * len(val_set):
break
return mean_acc / count, mean_loss / count
config = {
"splitRatio" : 0.8,
"maxLength" : 100,
"printEvery" : 100,
"outputFolder" : "Models",
"outputFileName" : "ParaphraseClassifier.dat",
"threads" : 4,
"batchSize" : 64,
"validationFraction" : 0.0005,
"epochs" : 2,
"forceCPU" : False
47
}
if config["forceCPU"]:
device = torch.device("cpu")
config["device"] = device
class SentimentClassifier(nn.Module):
def __init__(self, num_classes, device, freeze_bert = True):
super(SentimentClassifier, self).__init__()
self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
self.device = device
if freeze_bert:
for p in self.bert_layer.parameters():
p.requires_grad = False
self.cls_layer = nn.Linear(768, num_classes)
def forward(self, seq, attn_masks):
'''
Inputs:
-seq : Tensor of shape [B, T] containing token ids of sequences
-attn_masks : Tensor of shape [B, T] containing attention masks
to be used to avoid contibution of PAD tokens
'''
#Feeding the input to BERT model to obtain contextualized
representations
cont_reps, _ = self.bert_layer(seq, attention_mask = attn_masks)
#Obtaining the representation of [CLS] head
cls_rep = cont_reps[:, 0]
48
#Feeding cls_rep to the classifier layer
logits = self.cls_layer(cls_rep)
return logits.to(self.device)
from torch.utils.data import Dataset
from transformers import BertTokenizer
import torch
class AmazonReviewsDataset(Dataset):
def __init__(self, df, maxlen):
self.df = df
# A reset reindexes from 1 to len(df), the shuffled df frames are
sparse.
self.df.reset_index(drop=True, inplace=True)
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
self.maxlen = maxlen
def __len__(self):
return(len(self.df))
def __getitem__(self, index):
sent1 = self.df.loc[index, 'Sentence1']
sent2 = self.df.loc[index, 'Sentence2']
# Classes start from 0.
label = int(self.df.loc[index, 'Score'])
# Use BERT tokenizer since it needs to be able to match the tokens to
the pre trained words.
token1 = self.tokenizer.tokenize(sent1)
token2 = self.tokenizer.tokenize(sent2)
49
# BERT inputs typically start with a '[CLS]' tag and end with a
'[SEP]' tag. For
tokens = ['[CLS]'] + token1 + ['[SEP]'] + token2 + ['[SEP]']
if len(tokens) < self.maxlen:
# Add the ['PAD'] token
tokens = tokens + ['[PAD]' for item in range(self.maxlen-
len(tokens))]
else:
# Truncate the tokens at maxLen - 1 and add a '[SEP]' tag.
tokens = tokens[:self.maxlen-1] + ['[SEP]']
# BERT tokenizer converts the string tokens to their respective IDs.
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
# Converting to pytorch tensors.
tokens_ids_tensor = torch.tensor(token_ids)
# Masks place a 1 if token != PAD else a 0.
attn_mask = (tokens_ids_tensor != 0).long()
return tokens_ids_tensor, attn_mask, label
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
print("Configuration is: ", config)
# Read and shuffle input data.
df = read_and_shuffle(os.path.join(projectFolder, "PARAPHRASE-
DATASET/datap.csv"))
df
50
num_classes = df['Score'].nunique()
print("Number of Target Output Classes:", num_classes)
totalDatasetSize = len(df)
# Group by the column Score. This helps you get distribution of the Review
Scores.
symbols = df.groupby('Score')
scores_dist = []
for i in range(num_classes):
scores_dist.append(len(symbols.groups[i])/totalDatasetSize)
train, val = get_train_and_val_split(df, config["splitRatio"])
val.to_csv(os.path.join(projectFolder, "PARAPHRASE-DATASET/Validations.csv"))
train.to_csv(os.path.join(projectFolder, "PARAPHRASE-DATASET/Train.csv"))
# You can set the length to the true max length from the dataset, I have
reduced it for the sake of memory and quicker training.
#T = get_max_length(reviews)
T = config["maxLength"]
train_set = AmazonReviewsDataset(train, T)
val_set = AmazonReviewsDataset(val, T)
train_loader = DataLoader(train_set, batch_size = config["batchSize"],
num_workers = config["threads"])
val_loader = DataLoader(val_set, batch_size = config["batchSize"],
num_workers = config["threads"])
# We are unfreezing the BERT layers so as to be able to fine tune and save a
new BERT model that is specific to the Sizeable food reviews dataset.
net = SentimentClassifier(num_classes, config["device"], freeze_bert=False)
51
net.to(config["device"])
weights = torch.tensor(scores_dist).to(config["device"])
# Setting the Loss function and Optimizer.
loss_func = nn.NLLLoss(weight=weights)
opti = optim.Adam(net.parameters(), lr = 2e-5)
m = nn.LogSoftmax(dim=1)
torch.cuda.set_device(0)
trainFunc(net, loss_func, opti, train_loader, val_loader, config)