When the hardware gets chilled: monitoring NFC modules of ATMs

Tomcat · Nov 23, 2024

Hello! I work at developing technologies for automating the maintenance of various hardware, including ATMs.

In many engineering and technical fields, there is a concept of "monitoring". In IT, there is "simple monitoring", which usually implies tracking the behavior of software and its impact on user experience, and "technical monitoring" - monitoring the operation of equipment. We have already briefly described how we monitor the health of ATMs (see the list of articles below). Now we will share our experience of tracking the status of specific components of "money robots" - NFC readers.

Our previous articles about ATMs and hardware:

We install ATMs in severe frost, heat, the metro and on all-terrain vehicles (about unexpected technical difficulties that arise when installing ATMs in a variety of, sometimes very unusual, places)
We fix ATMs: we fix hardware and software (about the daily routine of engineers servicing ATMs. We only briefly touched on the topic of monitoring)
We are fixing ATMs again (continuation of an interesting topic, including monitoring)
Call Kuzya: How We Wrote a FAQ for Engineers (About Voice Automation)

How our technical monitoring works

As written in our documentation:

Technical monitoring is a set of works on tracking the status of the control system, detecting failures, remote control and generating requests for restoring functionality, aimed at increasing functional and CTM availability.

This requires explanations in human language.

SD is "self-service device", commonly known as an ATM, is a complex hardware and software system that provides a secure, continuous and convenient set of online services, from accepting cash to electronic payments and generating personal offers to the client.

CTM is "comprehensive technical maintenance". It is the responsibility of a whole system (SCTM), developed by us independently. With its help we:

we detect and diagnose failures in ATMs;
we organize a full restoration cycle and carry out design work;
We automate and optimize incident resolution processes;
We exchange information with all participants in the process and coordinate their work.

That is, SCTM is our central system for monitoring the life cycle of ATMs. It collects and processes information from all sources and provides a full cycle of work to eliminate failures. Here is a high-level diagram of SCTM operation:

Our technical monitoring system uses several important concepts:

An incident is an entity that occurs during the operation of an ATM or design, requiring resolution or complete completion of planned work. An incident can be detected by sensors or event monitoring tools, it can be assigned by a request or scheduled.
An event is an entity that occurs with any changes within an incident: creation, initiation or closure of a request, change of status within a request, readiness of an incident for closure, etc. An event does not have a long life cycle; it occurs, is registered and processed.
A request for on-site or remote work is an entity within an incident assigned to a specific performer. It has a set of attributes: type, scheduled time, overdue status, current execution time, and much more.
SD (ATM) is a configuration unit. It has a set of attributes: technical parameters of the device, its installation location and service conditions.
A robot is a module within the system that automates a specific operational function and makes the work of operational specialists easier. We have many robots, and we give them individual names: Inna, Ilana, Katyusha, or, for example, Marusya, who controls the software. This is because it was inconvenient to say and write: "What was the change in the robot for making the first decision in terms of automatic processing of the classic incident processing module?" It is much easier to say: "How has Wall-E changed?" Therefore, each robot has a name with its own history. Wall-E was the first, its launch from the business analytics side was handled by an employee named Valentin, who also remembered the cute character from the cartoon.

Incident Resolution Stages

The life cycle of any incident, in terms of its handling, consists of four stages:

Creation — the moment when a system or person determines that there is a problem with the ATM and it is necessary to register the incident. The main sources of information are processing of changes in the sensors of the control system on the host, events on problems of the main and backup communication channel, transaction monitoring, event monitoring, requests and projects.
First decision making (FDM) is the period when an incident is diagnosed (what kind of failure, in which ATM, what events are in the logs, what incidents are open in parallel, etc.), an attempt is made to restore it remotely, and the first request for restoration is initiated. There are many FDM-class robots in our system.
Information exchange is the period from the opening of the first application to the closing of the last one. Participants mainly exchange statuses and give comments. There is only one information exchange class robot in our system.
Closing is the moment when the results of the work performed are assessed, the closure of applications, the state of the ATM nodes, the presence of known incidents and their status, and the passage of transactions are checked. In our system, there is only one robot of the closing class.

Automatic incident handling

We have implemented an automatic mechanism for processing claims based on the SKTO, event monitoring and event provider. This mechanism works online 24x7. Jarvis is responsible for event monitoring - this is a module in the SKTO for making the first decision. It receives a stream of events of different types as input. A profile is created for each type, defining the required repeatability of the event, incident over the event, and subsequent processing. When the profile is triggered, Jarvis creates an incident, performs a remote error reset on the ATM, and, if necessary, creates a request for an engineer to visit. Then Inna, the information exchange module, is connected.

Examples of profiles:

"3 claims in two days". If three claims accumulate for the operation of any ATM in 48 hours, the system stops it and creates an urgent incident. Then it initiates a request for log analysis and an SLM request. After completing the work, it starts the ATM and closes the incident.
"2 claims in 14 days". After work has been carried out on an ATM, the system tracks claims about its operation for another two weeks. When two are reached, the system creates a regular incident, initiates a request for log analysis, escalation and an SLM request. After the work is completed, it starts the ATM and closes the incident.
"4 claims per month". When four claims are made against the ATM, the system creates a regular incident, initiates a request for log analysis and an SLM request. After completing the work, it starts the ATM and closes the incident.

NFC technology in ATMs

ATMs are becoming smarter and more convenient thanks to the introduction of new technologies. One of them is NFC (Near Field Communication), or near contactless communication. NFC allows you to perform various operations without physical contact with the device. For this, ATMs use NFC readers.

NFC readers are devices for reading information from mobile devices or bank cards equipped with an NFC chip. Readers are capable of exchanging data at a distance of up to several centimeters. That is, you can withdraw and transfer money, pay for services - and all this without inserting a card. Just bring your mobile device or card to the appropriate panel of the ATM. It's faster and easier.

View from the inside

In the dark, this module can be illuminated to help you avoid missing it.

Of course, NFC modules also fail. The peculiarity is that the modules transmit very little diagnostic information and they cannot be tracked by a “bad status”. That is why we created a service to detect undiagnosed failures in technically sound ATMs. It tracks transaction load anomalies for NFC transactions in a specific ATM. The service is based on machine learning. It analyzes devices with abnormal transaction behavior once an hour and then sends events to Jarvis. Jarvis creates incidents, and then the AO PPR Wall-E robot connects to resolve the incident. One of the interesting challenges was monitoring ATMs in which few NFC transactions are performed and their prediction was difficult.

The result of the service's work was round-the-clock monitoring of deviations in the operation of modules. From the very beginning, we achieved automation at a level of more than 70%, and the number of remotely eliminated failures exceeded 80%.

In general, NFC-related incidents are a small part of our technical monitoring. So, once every 3-5 minutes (depending on the source), we check the state of the equipment node sensors on the host, key inventory parameters of the device, states from agents of various client-server systems. In total, this is about 200 states in a network of about 20,000 devices. Or up to 55 billion checks per month. Based on the results of checks, about 300,000 incidents are created monthly, which, in turn, create about 430,000 signals on changes in the states of the control system, applications, incidents and are processed with an 80% automation level. But we will tell you about this another time.

Source

When the hardware gets chilled: monitoring NFC modules of ATMs

Tomcat

Professional

How our technical monitoring works

Incident Resolution Stages

Automatic incident handling

NFC technology in ATMs

Similar threads

When the hardware gets chilled: monitoring NFC modules of ATMs

Tomcat

Professional

How our technical monitoring works​

Incident Resolution Stages​

Automatic incident handling​

NFC technology in ATMs​

Similar threads

How our technical monitoring works

Incident Resolution Stages

Automatic incident handling

NFC technology in ATMs