Generative NLP Models in Customer Service: Evaluating Them, Challenges, and Lessons Learned in Banking

Editor’s note: The authors are speakers for ODSC Europe this June. Be sure to check out their talk, “Generative NLP models in customer service. How to evaluate them? Challenges and lessons learned in a real use case in banking,” there!
With the increasing use of digital communication, daily interactions between customer service agents and clients have shifted from traditional phone calls to chat and text messages. The banking industry is no exception, as customers reach out to agents for various reasons, such as reporting a lost card, seeking advice on investment plans, or requesting clarification on account details.
To save time for our financial advisors, our team decided to experiment with generative natural language processing (NLP) models to assist them in their daily conversations with clients. These models would allow us to automate simple and repetitive tasks, freeing up our agents to focus on more complex and valuable interactions with customers.
Although natural language generation (NLG) has gained widespread popularity in recent months due to the public release of ChatGPT, it has been used for some time now for tasks such as machine translation, question answering, and summarization. While the accuracy of NLG models has improved, there are still challenges that remain, regardless of how sophisticated the technology is.
The challenge of evaluation: the need for human criteria
When using generative methods to build systems that respond to clients’ requests, one of the biggest challenges is the evaluation phase. This stage raises many questions, such as: What does “correct” mean? Can an answer be syntactically and grammatically correct, but not meet the customer’s needs? What about a suggestion that saves time for the manager but requires editing? How can the system’s overall performance be assessed? How can we prevent the model’s output from generating hallucinations? What is the best metric to optimize the network?
Various automatic metrics, such as BLEU, ROUGE, accuracy, and precision, exist in the literature to optimize system and network performance. However, it is recommended that the chosen metric aligns with human evaluation criteria. Designing this evaluation process is not a straightforward task and needs to be adjusted to each use case.
Our use case within the banking industry
To assist financial managers in responding to customer requests, we trained a sequence-to-sequence deep learning neural network with more than one million query-answer pairs. The network’s encoder and decoder were implemented using two LSTMs.
Our results show that the network achieved automatic NLG metrics with an accuracy above 75%. However, metrics alone may not always accurately reflect the quality of the generated text output, leading us to realize that the evaluation phase needed to include human-centered annotations to align automatic metrics with human criteria.
To achieve this, we invested significant time and brains into designing a strategy for our specific use case. This strategy involved several stages, such as understanding the problem, categorizing the landscape of questions, and designing clear guidelines for annotators.
So we did! As a result, we found correlations between these metrics and human evaluation, and lots of other insights which are independent of the neural network architecture or NLG technique employed.
If you want to learn more, we will be presenting our findings at ODSC Europe 2023. We will also share other valuable lessons learned from using this type of model in customer service, based on our specific banking use case.
About the authors:


She has worked in several analytical domains, ranging from Retail and Urban Analysis to Customer Intelligence. Now, she is trying to enhance the customers’ relationship with the bank through Natural Language Processing and Text Analytics. María focuses on understanding business challenges and developing the best analytical solution for each problem.
The authors would like to thank Mariana Bercowsky for her contribution to the writing of this post.




