02 Oct 2024

[Dissertation] Learning industrial descriptions : NLP tasks for acronym expansion

Masters dissertation by Shaun Johnson, Faculty of Engineering, Built Environment and Information Technology University of Pretoria, Pretoria

Members

Shaun Johnson, MITC Big Data Science

Supervisor(s)

  • Dr. Vukosi Marivate

Abstract

The human language is cryptic since words can be interpreted differently based upon the context within which they occur. The exact meaning of a particular word in its context might be trivial for humans who are generally unaware of language ambiguities. Machines, on the other hand, are required to process, transform and analyse unstructured textual information to determine the underlying meaning. “Acronyms” are shorter versions of phrases and are advantageous to save time and space for both handwritten and typed out “expansions or meanings”. The main disadvantage caused by acronyms is confusion; if misunderstood they can unknowingly cause damage, have a negative effect, or abuse the receiver. Acronyms in one context might not be appropriate for a audience in another context for the same acronym. Solving acronym disambiguation could help reduce the negative effects of using acronyms. In this project we apply NLP technologies for a case study at a particular organisation in the Mining, Metals & Minerals ( MMM) sector. The MMM organisation plant sensors’ tags (the acronyms) are derived by domain experts from technical programmable logic controller ( PLC) names into pseudo English (metallurgical) descriptions, these being the ground truth expansions, to describe the sensors adequately for multiple stakeholders (including non-domain experts). There is varied human input, leading to inconsistency in initiating “tag names (acronyms)”, and this leads to uncertainty of various degrees in trying to derive an “accurate description from the tags (acronym expansions)”. The aim of this research is to gauge to what extent transfer learning can be applied between similar domains using large language models. For example, Scientific document understanding could possibly explain some Mining, Metals & Minerals acronyms. This leads us to the research question, can NLP pre-trained transformers be applied to the MMM industry for which there are low resource settings and little (or no) acronym dictionaries? We presented a SciAD/ SDU fine-tuned transformers that can disambiguate acronyms within Scientific document understanding ( SDU) context very well and is a stepping stone to being used in the Mining, Metals & Minerals ( MMM) domain in future. We foresee that there is still opportunity to unlock the benefits of other pre-trained language models ( PLM). We note the value that a small model could be used for the MMM domain.

Dissertation

Publications