on the di culties of using machine learning on industrial texts...

On the Difficulties of Using Machine Learningon Industrial Texts

Ozlem Ozgobek1, Ole Selvig1, Jon Atle Gulla1, Lemei Zhang1, and CristinaMarco1

Department of Computer Science, NTNU, [email protected], [email protected], [email protected],

[email protected], [email protected]

Abstract. In oil and gas sector, maintenance reports from a floating oiland gas production, storage and offloading (FPSO) unit play an impor-tant role for fixing the failures on time and for learning from previousexperiences. Documenting these reports in a structured way is an im-portant task for such units. In this work, we experiment with differentmachine learning techniques to classify the maintenance reports. Howeverthe unstructured text from these reports combined with domain specificterms, abbreviations, misspellings and multiple language use makes theautomatic classification of these reports quite challenging. Our findingsshow that how challenging this task is with the existing report structureand content.

1 Introduction

Offshore oil & gas installations are complex structures that are expensive tobuild and need continuous monitoring for maintenance and upgrading. It is vitalto avoid any unnecessary production stops or accidents, as this would have seri-ous implications for both the companies themselves and their staff. Financiallyjust a short temporary halt of production comes at a substantial cost. Moreimportantly, however, is the risk of serious life-threatening accidents if partsbreak down or malfunctions arise. Offshore oil & gas production is by nature achallenging task, and the dependence on technology calls for proper proceduresfor reducing the risk of equipment failures and other accidents that affect thesecurity of people and climate.

Maintenance reports document maintenance activities that are either carriedout on a regular basis or triggered by particular events. Whereas some of themaddress small repairs like the change of broken light bulbs or screws, others con-cern more time-critical problems that need to be taken very seriously by themanagement. The intention is to use these maintenance reports to structure thedocumentation of these activities for later use and facilitate learning and ex-change of competence over time. Engineers acquire valuable experiences as they

Copyright held by the authors. NOBIDS 2018

take part in maintenance activities, but not all of them are equally experiencedand no individual engineer can be expected to fully understand the structuresand functions of a complete oil & gas installation.

Maintenance reports follow a particular format that forces the engineers toprovide some basic information about the maintenance work. There are a numberof structured fields that identify for example a piece of equipment, a location ora date, but the engineers are also encouraged to write a short textual descriptionof the work they are carrying out. Even though these descriptions are fairly shortand of variable quality, they are crucial when someone wants to check if theirpresent problem has been faced and solved at some time in the past.

Failure codes are of particular interest in maintenance reports. Unfortunately,many reports do not have any failures codes entered, since they were not knownat the time the report was created. For later maintenance activities this is prob-lematic for two reasons: (i) A maintenance task is not treated correctly or timelybecause they fail to realize that similar tasks have been successfully dealt with inthe past, and (ii) they are not able to generalize - or learn - from previous casesbecause a critical piece of data is missing in the documentation of old cases.

In principle, we can approach the missing failure codes as a classificationproblem. Experiments show, unfortunately, that the structured fields of mainte-nance reports are not sufficient to produce a classifier with a very high accuracy.The question is whether the textual descriptions may help us fill in the failurecodes instead. These texts were not written to support later classification andare probably quite typical for internal documentation in many companies. Theywere quickly written and with limited knowledge of the problem at hand, andthey are best understood by their fellow engineers. The descriptions are neitherprecise nor complete, but they do reveal something that is potentially relevantfor classifying the maintenance work. The contribution of this paper is two-fold:(i) we analyze the challenges of extracting the content of internal industrial textsand assess the usefulness of various linguistic operations, and (ii) we evaluate towhat extent machine learning techniques can classify the maintenance reportson the basis of their textual descriptions.

The structure of the paper is as follows. After presenting the dataset in Sec-tion 2, we discuss the methods for analyzing the textual descriptions in Section3. We then use some machine learning techniques to classify the maintenancereports and discuss the results of the experiments at the end.

2 Dataset

The dataset used in this work stems from a floating production, storage andoffloading (FPSO) unit in the Norwegian oil and gas industry in the NorthSea. It contains a total of 4968 failure reports on various maintenance activitiesin the oil and gas field. Failure reports are labelled with different code groupsaccording to their association with different parts of the FPSO unit, such aselectric generators, steam turbines etc. In the dataset there are 40 code groups. Ineach code group there are various fault types where each report can be associated

with. However the fault type field that may or may not be filled by the reporter.In the dataset the fault type field is not consistently filled. In general, the reportsconsist of a number of structured fields and one text describing the failure. InFig.1 the general structure of the failure reports can be seen.

Report texts describe asset failures from a FPSO unit in the North Sea. Theaverage length of the reports is 75 words, where the minimum length is 7 andthe maximum length is 1580 words. An interesting property of the reports inthe dataset is that the usage of multiple languages. As it is shown in Fig.2,the languages used in the reports are mostly Norwegian (58.7%) and English(22.6%), however it is possible to see the usage of Danish (3.7%) and Swedish(0.4%) as well. In addition, in 13.2% of the reports multi-language use occurswithin the same report. In total there are 32,011 unique words in the reports.

Fig. 1. The general structure and fields of failure reports.

2.1 Challenging properties of the dataset

Dealing with the automatic analysis of the free, unstructured text has challenges.When we consider the technical failure reports at FPSO units, the data comeswith even more challenges:

– Misspellings: In the failure reports there are several misspelling of wordsand abbreviations in addition to the usage of incorrect grammar and unfin-ished sentences.

– Domain specific terminology and jargon: The usage of domain specificterms and acronyms is available in the failure reports.

Fig. 2. Different languages used in maintenance reports.

– Multi-language use: As mentioned earlier, 13.2% of the failure reports hasthe usage of multiple languages at once.

– Abbreviations: For the ease of reporting there are several abbreviationsavailable in the reports. Several of those abbreviations are not explainedand it is only possible for the human domain experts to understand.

– Non-standard structures: The structure of reports is not standardized.Even though it seems that there is a form of structure in some of the reports,many of the reports do not follow the same structure.

– Imbalance in report types: The distribution of 4968 failure reports among40 different code groups is not balanced. As it is shown in Fig. 3 the differencebetween the distribution of reports vary a lot.

Fig. 3. Imbalance in report types.

3 Methods

3.1 Text simplification and pre-processing

Prior to conduct the classification experiments we conducted some normalizationof the texts. This normalization includes some text simplification strategies aswell as general NLP pre-processing techniques.

Sentence boundaries were standardized as much as possible, in order to en-hance the performance of the classifiers. In particular, newlines and non-standardsentence boundaries were normalized by substituting the symbols with a basicdot. Regular expressions were used to do these transformations.

Besides, most reports include specific dates and times describing when thereport was made or edited and by whom. This information was also transformedinto general named entities labels so as to reduce the vocabulary space the clas-sifier has to deal with as much as possible. Figure 4 shows an example of anoriginal maintenance report and Figure 5 illustrates the same text after sim-plification. This example illustrates how actual dates, times and proper namesfor equipment, location, people and measurement numbers, were mapped intogeneral labels during normalization. Protocol questions included in most reportsas templates for employees were also removed, with the purpose of excludingunnecessary information.

Fig. 4. Maintenance report example. Original text.

Fig. 5. Maintenance report example, after text simplification.

After these simplification, standard NLP pre-processing was performed, in-cluding tokenization, normalization of words with capital letters, stop-wordsremoval and stemming, whereby words are reduced into their base form. NLTKwas used for these purposes.

3.2 Feature extraction

Two different feature models were explored in this study. A baseline model,consisting on a bag-of-words model including TF-IDF frequencies, and two wordembedding models, in particular distributed memory (DM) and distributed bagof words (DBOW) were explored. Gensim and scikit-learn were correspondinglyused for these purposes.

The approach for obtaining such features varies slightly. TF-IDF models arefitted individually with text data from each code group or class. In contrast,document embedding models are fitted with the text data from all classes, andthen each of the selected class infers embeddings from the same global model.

3.3 Classification

Text classification is a fundamental field that has been well studied by researchersand practitioners of a variety of fields, including artificial intelligence, statis-tics, pattern recognition, cognitive psychology, computer vision and medicine.In this section, we first formally define the classification task. Then differenttypes of classification algorithms, specifically, K nearest neighbor, MultinomialNaive Bayes, Random Forest and Support Vector Machine, are adopted andexplained.

Problem definition Let X = {x1,x2, ,xN} be the input set after pre-processing,and y be the output being assigned to a class c ∈ C after generated by infer-ence model f . xi is the input vector of the i-th document, and N represents thenumber of documents in training set. The classification problem can be denotedas: y = f(X ). Our goal is to assign y to a class c which is the same with groundtruth class c.

K Nearest Neighbor The fundamental mechanism behind nearest neighbormethod is to find K training samples which are closest in distance to the newlyarrived input point [4]. The distance can be any types of metric measurementsaccording to specific applications. In this paper, euclidean distance is adoptedto find the closest K points in training set.

d(x,x) = ||(x− x)|| =

√√√√ D∑i

(x2i − xi

2) (1)

https://www.nltk.org/https://radimrehurek.com/gensim/https://scikit-learn.org/

where x and x are the representations of training and target points respectively.d(·) is the distance between x and x and D is the dimension of input vector.The predicted class y can be achieved according to the conditional probabilitygiven the set of the nearest K neighbors Z:

y = argmacc∈Cp(y = c|Z) =1

k

∑i∈Z

I(yi = y) (2)

where I(·) is the indicator function having the value of 1 if y belongs to a specificclass else 0.

Naive Bayes Naive Bayes classifier classifies newly input documents usingBayes rule under the assumption of the independence between input features[7]:

y = argmacc∈Cp(c)

D∏i=1

p(xi|c) (3)

Multiple models can be used to calculate the posterior probability of the rele-vance of input x given class c, among which includes Multinomial [6] and Guas-sian models [8]. In our case, Multinomial model is used since it can cover bothdiscreet and continuous feature spaces [9].

p(x|c) =(∑

i xi)!∏i xi!

∏i

pxici

(4)

where xi is the relative value of the i-th feature of input document.

Random Forest Random Forest is an ensemble of decision tree algorithms [2],where each decision tree participates in the class prediction process through re-sult aggregation. Decision trees embody the classification approach [14] througha tree-like structure, which is made up of a root and internal nodes, branchesand leaves. Branches represent paths denoted by a range of values and internalnodes control splitting rules represented by selected features from input space.Information Gain is one of the most popular feature selection methods adoptedin this paper at each internal node during splitting process.

InformationGain(Xp, xi) = H(Xp)−m∑j=1

NiNp

H(Xj) (5)

where Xp is the set of training documents before split, and Xj ∈ Xp is a subsetof documents disjoint with each other after split. m is the number of values offeature xi appeared in training set. Ni andNp are the number of documents in Xpand Xj respectively. H(·) specifies the entropy. The feature and value are selectedwith the highest Information Gain value recursively until the splitting processreaches the leaf node and is assigned to a class. Thus, the final classificationresult of Random Forests is given as the averaged prediction of the individualdecision trees.

Support Vector Machine Support Vector machine (SVM) is another methodthat we use to categorize fault type labels of text descriptions [5]. SVM sets outto fit a hyperplane with largest margin to nearest examples of any class. Theseexamples are known as support vectors. The optimization process of SVM is tosupport vectors with largest margin between arbitrary classes while minimizingthe lost function L. In our paper, the extension version of hinge loss [13] isadopted to achieve the best margin. Specifically, for

L =∑j 6=c

max(0,WTj x−WT

c x + δ) + λ · R(θ)

where Wj and Wc are model parameters. δ is the tolerance threshold to betuned during training process. R(·) represents the regularization term used toprevent over-fitting problems and model complexity, which can be either L1

or L2 regularization. θ represents the set of model parameters and λ is theregularization parameter.

4 Experiments

Different experiments run in different phases: 1- NLP techniques in preprocessing2- Machine learning algorithms for classification

4.1 Evaluation Metrics on Classifications

Here we enumerate the evaluation metrics on classification tasks adopted in thispaper. The formulae are shown in Table 1.

– Accuracy (ACC): Despite its popular property in evaluation of classifica-tion tasks, some issues occur in an imbalanced domain. For instance, considera problem where only 1% of the examples belong to the positive class, a accu-racy of 99% is achievable by predicting negative class for all examples. Thus,ACC is at best misleading, when the ratio between positive and negativeexamples is significantly skewed.

– Harmonic Mean of Precision & Recall (Fβ):The definition is basedon two metrics, recall and precision. Recall captures the completeness ofhow many positive examples are predicted positive, while precision capturesthe exactness of how many predicted positive examples are true positive.Because of an imbalanced multi-class data distribution in our experiments,we adopt macro averaging in [10] to summarize the classifiers performanceover a set of classes C. Thus, the example size of a class does not affect themeasurement. For a given class c, Fβ is defined as the harmonic mean ofprecision and recall [12]. β is introduced to adjust the relative importanceof recall with respect to precision and is set to 1 in our case.

– Cohens Kappa (κ): It compares an observed accuracy with and expectedaccuracy [3]. Specifically, κ provides a score ranging from -1 to 1, from whichthe performance of a classifier can be evaluated with respect to random

chance. The bigger the value is, the better the performance shows. Theinterpretation of κ score follows the work in [1].

– Geometric Mean Accuracy (GMA): The metric takes the geometricaverage (GMA) of the recall score in each class and it benefit from the useof geometric mean when there is a multiplicative or exponential relationshipbetween class scores averaged especially for imbalanced data distribution[11].

Besides, two guessing strategies are chosen as baselines:

– Stratified (STRAT): It generates predictions by respecting training setsclass distribution.

– Most Frequent (MFREQ): It always predicts the most frequent label inthe training set.

Table 1. Classification evaluation metrics

Evaluation Metrics Definition

Accuracy (ACC) ACC =tp+ tn

tp+ tn+ fp+ fn

Fβ Fβ =(1 + β)2 · PrecM ·RecMβ2 · PrecM ·RecM

Cohens Kappa (κ) κ = 1 − 1 − po1 − pe

Geometric Mean Accuracy (GMA) GMA = |C|√∏

c∈C recall(c)

4.2 Parameter Settings

Hyper parameters are selected through grid-search for different tasks and classi-fiers. Then values with best performance of classification tasks are selected. Eachsampled experiment for a particular feature extraction method will be scored interms of its exhibited Fβ macro averaged score with K = 5 in K Nearest Neigh-bor predictions.

where tp, tn, fp and fn denote true positive, true negative, false positive and falsenegative respectively

PrecM =

∑c∈C

precision(c)

|C| where precision(c) =

∑iI(yi=c)I(y=c)∑

iI(y=c)

. RecM =∑c∈C

recall(c)

|C| where recall(c) =

∑iI(yi=c)I(yi=c)∑

iI(yi=c)

where po denotes the observed accuracy and pe denotes the expected accuracy ofclassification performance

5 Evaluation

As described before, there are several properties in oil and gas maintenance re-ports which make them particularly challenging to be automatically understood:specialized terminology and expressions, multiple languages, misspelled and non-standard words, inconsistent syntax and lack of proper sentence boundaries.

In order to assess the impact of the different simplification and pre-processingtechniques, as well as feature extraction models, a careful evaluation was per-formed. This evaluation was implemented in three steps: (i) assessment of thefeature extraction methods; (ii) evaluation of singular simplification and pre-processing techniques; (iii) evaluation of simplification and pre-processing meth-ods in combination.

5.1 Assessment of the feature extraction methods

As results from experiments with each of the feature extraction methods show,each method performs differently on different models. Figures 6, 7 and 8 show thedistributional graphs containing all the models sampled during hyperparametersearch. In these figures, the vertical axis represents the total number of modelsthat achieved any of the F scores along the horizontal axis. Note that axis scalesare not consistent among the graphs.

The graphs show that the performance of document embeddings (DM andDBOW) varies significantly. Although TF-IDF show a more consistent perfor-mance across various hyperparameters, and exhibited the best performance ofall the three methods, it also includes the worst performing model.

Fig. 6. Distributed memory (DM)

5.2 Evaluation of individual text simplification and pre-processingtechniques

Table 9 shows classifier performance for the text simplification techniques ex-plored here. As can be seen from this table, both SVM and MNB perform belowa simple baseline, questioning the validity of using text simplification techniquesfor the classification of maintenance reports. Specially it is low in all cases where

Fig. 7. Distributed bag of words (DBOW)

Fig. 8. Term Frequency-Inverse Document Frequency (TF-IDF)

Fig. 9. Classifier performance by text simplification technique. Classifiers in each groupare ranked in terms of F score exhibited with each technique.

the performance of classifiers with person-name filtering. Also it is possible tosee that different classifiers perform better on different experiments.

Table 10 shows the results of the classifiers when performing standard NLPpre-processing. Interestingly, the SVM experiment including stemming yieldedthe best results, while the worst results were obtained for the MNB classifierunder the same conditions. The reverse can be seen with non-alphanumeric fil-tering. In general, these pre-processing methods had a much a bigger impacton the performance of SVM, yielding results above the baseline. In contrast,MNB got results over the baseline only when non-alphanumeric filtering wasperformed.

Fig. 10. Classifier performance by NLP pre-processing technique. Classifiers in eachgroup are ranked in terms of F score exhibited with each technique.

5.3 Evaluation of individual text simplification and pre-processingtechniques in combination

Lastly, we evaluated the performance of the different pre-processing methods incombination. The top five performing combinations of pre-processing methodsfor SVM and MNB can be seen in Tables 11 and 12.

qs: question structure filtering; sw: stop-words filtering; l: lowercase (no capitalizationletters); st (stemming); dnt: datetime-number-tag filtering; nlta: non-alphanumericfiltering; dt: datetime filtering

Fig. 11. SVM: The top 5 best performing pre-processing and simplification techniquesin combination, ranked by F score.

Fig. 12. MNB: The top 5 best performing pre-processing and simplification techniquesin combination, ranked by F score.

Although there is no consistent ranking between the two classifiers, however,lowercase, stop words filtering and non-alphanumeric filtering appear amongstthe techniques with most impact on their performance. This is not surprising,as these techniques are generally used as pre-processing steps in most NLPpipelines. Slightly more unusual is the fact that stemming does not seem tobe very relevant for the classification algorithm’s performance. We suggest thatthis might be due to the fact that there is a high number of spelling errors andspecialized vocabulary which can make the vocabulary space simplification morechallenging.

6 Discussion and Conclusion

In this work we looked into the maintenance reports from a floating production,storage and offloading (FPSO) unit in the Norwegian oil and gas industry. Thesereports are used to structure the documentation of these activities for later useand facilitate learning and exchange of competence over time.

Failure codes which some of the reports in the dataset include, help theengineers to classify the type of failures in order to treat the failures on timeand to learn from the similar failures happened in the past. However, not all thereports include failure codes. In this work, we use machine learning in order topredict the missing failure codes. By using the textual description of the failure,after pre-processing of the text and feature extraction, we experimented withthree different machine learning classifiers: K Nearest Neighbor, Naive Bayesand Support Vector Machine. As explained in the previous sections, the resultsshow that different classifiers outperforms other methods on different tasks. Butunfortunately none of the methods are good enough to be useful in this taskand tested methods do not solve challenges. The challenges of the dataset, asexplained in detail in Section 2.1, makes data preparation and analysis very hard.Maintenance reports have special characteristics. Usage of multiple languages atonce, unstructured text with misspellings, technical terms, abbreviations arequite challenging for the existing natural language processing methods to dealwith.

References

1. DG Altman. Practical statistics for medical research chapman & hall london googlescholar. Haung, et al [16] USA (Black), 1991.

2. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.3. Jean Carletta. Assessing agreement on classification tasks: the kappa statistic.

Computational linguistics, 22(2):249–254, 1996.4. Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE

transactions on information theory, 13(1):21–27, 1967.5. Mark M Kornfein and Helena Goldfarb. A comparison of classification techniques

for technical text passages. In World congress on engineering, pages 1072–1075,2007.

6. David D Lewis and William A Gale. A sequential algorithm for training textclassifiers. In Proceedings of the 17th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 3–12. Springer-VerlagNew York, Inc., 1994.

7. Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naivebayes text classification. In AAAI-98 workshop on learning for text categorization,volume 752, pages 41–48. Citeseer, 1998.

8. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss,Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal ofmachine learning research, 12(Oct):2825–2830, 2011.

9. Jason D Rennie, Lawrence Shih, Jaime Teevan, and David R Karger. Tacklingthe poor assumptions of naive bayes text classifiers. In Proceedings of the 20thinternational conference on machine learning (icml-03), pages 616–623, 2003.

10. Marina Sokolova and Guy Lapalme. A systematic analysis of performance measuresfor classification tasks. Information Processing & Management, 45(4):427–437,2009.

11. Yanmin Sun, Mohamed S Kamel, and Yang Wang. Boosting for learning multipleclasses with imbalanced class distribution. In null, pages 592–602. IEEE, 2006.

12. CJ Van Rijsbergen. Information retrieval. dept. of computer science, university ofglasgow. URL: citeseer. ist. psu. edu/vanrijsbergen79information. html, 14, 1979.

13. Jason Weston, Chris Watkins, et al. Support vector machines for multi-class pat-tern recognition. In Esann, volume 99, pages 219–224, 1999.

14. Yongheng Zhao and Yanxia Zhang. Comparison of decision tree methods for findingactive objects. Advances in Space Research, 41(12):1955–1959, 2008.

on the di culties of using machine learning on industrial texts...

Documents