We are fixing ATMs again

Tomcat

Professional
Messages
2,686
Reputation
10
Reaction score
731
Points
113
ln0aikrfpzuwv4ezd7vdtwpzimo.png


ATMs break down periodically. Sometimes-by themselves, simply because of wear and tear of mechanical parts, more often - with the help of bank clients. Crumpled money, paper clips, or duct tape can get stuck in them. The Windows on which they work may fall again. In general, they break. But if the item is picked up in time, it is not considered dropped, so we quickly fix them.

More precisely, the robot repairs the ATM first. An incident is triggered for typical sensor responses, and the robot starts a recovery program. This is usually a reboot or error reset on a specific module. If the condition persists after a reboot, or if the failure occurs more often than a statistical probability, an alert appears for the engineer or operator.

If you need physical repairs, the robot writes a report after diagnostics and tells you which spare parts to take.

My name is Pavel Slyusar, and I am the director of the Multicarta service Development Department. We help about twenty thousand ATMs and have learned how to return them to operation almost at lightning speed. Today I will tell you how we got to the "rapid response system" and what tools we use in it.

What can break at the ATM​


Of all the problems with ATMs, approximately 60 % are related to money, its movement, jamming, lack or, conversely, oversupply and are solved with the help of a cash collector.

In general, the amount of cash in the ATM is very well predicted. Collection is a complex and expensive operation, and banks have learned to plan for days when cash is already running low. In most cases, they do it perfectly. But from time to time, the task of changing the cassettes with banknotes in time is complicated by additional circumstances.

First, sometimes the ATM is accessible to customers, but not accessible to cash collectors and engineers. For example, on weekends, buildings and offices where the ATM service area is located in the security loop are often closed. This means that you can withdraw money from the card, but you can't remove the ATM from the alarm system and load or unload it. And then it doesn't matter how well the service schedule was drawn up.

Secondly, from time to time, customer behavior becomes unpredictable. Against the background of some news, people can all go en masse to withdraw (or deposit) cash in one day. And this, of course, breaks the projected collection plan approximately entirely.

The remaining approximately 40 % of failures are related to any other nodes and functions.

For example, an ATM may lose connection to the central computer and router.

Often, the rollers break and the chip stations of the card readers into which the card is inserted fail (contactless ones are safer in this regard).

You may also have problems with:
  • Receipt printer and supplies.
  • By reading the client's card.
  • Complex equipment for receiving and issuing funds.
  • Using the keyboard.
  • By communication.
  • Software, etc.

Who serves ATMs​


Key roles — five:
  1. A center where coordinators, incident managers, robots, and admins work to monitor the software installed on each of the ATMs. A kind of support line that helps engineers "in the fields" and ensures that the software of each device is loaded and spilled. They also coordinate engineers and set up routes for collectors here.
  2. Cash collectors are mostly brave guys in an armored car with a limited set of actions. Best of all, they are able to protect money and in which case they clearly act according to military regulations.
  3. Service engineers are highly specialized professionals with highly specialized equipment who know very well how to test an ATM, how to fix it on the spot, and how to remove and replace the module so that they can tinker with it in the office if they can't figure it out on the spot. An engineer who goes out to fix a full ATM is not always accompanied by a cash collector, because joint visits are always a very expensive process. Most often, cash collectors first unload money from the ATM, and then an engineer arrives who can safely repair it for several hours until he finally fixes it.
  4. Device managers. These are employees in the regions (mostly bank employees) who use a special program to choose a specific place where it will be optimal to place an ATM. And then they come to the point to see how it will look and how well it is visible there, agree on the installation with the landlord and interact with customers who go to this ATM to withdraw money when they have complaints.
  5. Centralized regional structure of the security Department, which is responsible for the selection and installation of funds that limit the possibility of influencing the ATM. That is, for all sorts of anti-skimming pads, alarms, video surveillance, etc.

ATM quality is the most important thing for incident management​


Unlike many other types of equipment, ATMs can be diagnosed remotely.

In other words, at any given time, it is known what state each of the nodes is in: issuing, receiving, card readers, and so on.

This means that we can update the software and fix many errors remotely.

And build routes and instructions as accurately as possible for engineers who go on a field trip, so that they can eliminate those errors that they couldn't handle remotely.

When we first started dealing with ATMs in 2008, we worked according to the classic scheme​


The ATM in it was perceived as an ordinary user, who periodically breaks something.

When we had only a couple of hundred devices in service, this principle worked just fine. As soon as an incident occurred, a specially trained person saw all the information about it in the list, and if they didn't see it, then any changes in the device status were duplicated to their email address. Then it could log in to the incident and do everything necessary: resolve, fix, restart, and so on.

But then our installation base grew dramatically, and the ATMs that we support became twenty thousand. We split them up into regions and lines, but the time that a person spends searching for changes still exceeds the time to process them by about three to four times, and this is somehow wrong.

Then we sat down and began to think about how to make the entire search take place exclusively with the help of a machine, and a person was engaged in more intellectual work.

We thought and thought and came up with the idea to switch to the event model. This is not the most popular solution, because almost no service desk uses it as the main one: the history of incident management interfaces, the management ticket, the list of requests, the list of incidents, etc. are saved as central everywhere. And very much in vain, because it literally saved us.

The essence of the solution is this:the list of incidents has ceased to exist for our employees, but an event feed has appeared. Most of all, it resembles a messenger for communicating with the system: in real time, a person receives a message with a short description of the incident, which you need to quickly respond to.

And it works like this​


Three key concepts of the system:
  1. An incident is a long-term entity that exists from the moment a problem occurs until it is resolved.
  2. An event is an instant signal that occurs when any changes occur within an incident and should attract the person's attention. It doesn't have a life cycle — it occurs, is captured, processed, and disappears. There can be multiple events within a single incident.
  3. Request — it also exists within the incident and is directed to some contractor who will deal with the ATM on the road or remotely. Each application in the program has certain attributes: type, scheduled time, overdue status, current completion time, and so on.

As soon as any change occurs in the service desk (incident, request, etc.), a message about it is sent in an optimized form to all employees in the feed, which reflects events sorted by priority, time of occurrence, complexity, focus, etc.

The employee selects one of them, goes to a separate page, receives a fully prepared message about the incident, works out the action that needs to be done right now, without looking at the information about what happened before, and closes the window.

Let's assume that the ATM card reader has stopped working. The system receives a signal, it knows that the first step is to restart it, and it does it on its own. The incident was generated, the signal was sent, and the reboot was completed. Let's say the card reader didn't recover.

Then the system sends a request to the engineers. Let's say something went wrong in the program and the engineers didn't respond that someone would be coming to fix it soon. Then an event is sent to the operator via our service partner. It looks like a line in a separate interface that reads something like this: "ATM 11222, there is no appointment of an engineer for a long time, you need to call the coordination of the service partner and find out what the problem is in fulfilling the assignment of this application."

The operator clicks on the link, goes to the card where some actions have already been performed before, and sees the final comment. This information is enough for them to call a service partner and ask them to appoint an engineer who will go and find out what is happening with the ATM without diving into deep research about completed and outstanding work. The partner opens their program, sees that an error has occurred due to human factors, and says to the operator: "Please excuse me, we have a jamb. But we have already appointed Petrov, he will soon go to the facility, and you will now receive information through the information exchange." Now the operator has every right to close the window and forget about this incident.

Then either Engineer Petrov will fix the ATM, and no one else will ever remember about this story, or they won't be able to fix it, and the event with information about it will get back to the interface. The operator will open it on a first-come, first-served basis and also work it out without reading the historical references.

We can set up a business process, for example, so that a person should only respond to the third failure. The card reader was buggy, the robot rebooted it, it didn't help — it rebooted it, it didn't help again — it rebooted it and created an event. A person will log in to this event, carefully read the logs, find non-standard ones among them, and possibly make some decision on the incident. They can also send a comment to the business process automation service saying that it would be a good idea to make a fork here.

We have incidents in which a person does not appear at all even once: the entire system is processed automatically. There are those where one employee deals with all the issues.

And there are also those that connect up to five or six operators independently of each other during different incident life cycles.

The most important advantage of such a system is that employees no longer spend time searching​


And thanks to this, we have reduced the labor costs of operators by about 60 %.

We no longer have a situation where a person sits and looks at what has changed and changed from green to red and vice versa on the dashboard or in the list of tickets in the service desk. We passed this stage back in 2012.

In addition, we can now see the real picture of human labor costs, which is based not on the incoming flow of signals to the system, but on the flow of operator signals. We know how many events the system brought in, how many of them were captured by each employee, and who spent how much time.

And this was the first global step of our optimization.

And then it's time for total (well, almost) automation​


When we optimized the search, most of our events were not yet automated.

At first, the system only fired hardcoded signals that something had happened somewhere and the person needed to connect and figure out what it was.

The next step was this: we hardcoded where these events came from at the time of certain changes, and determined the sequence and scope of events.

Then we built the so-called automation tree, i.e. a binary transition tree that can launch different data sources to different branches, from which a consistent analysis of the incident or request occurs. The system follows the tree and generates events for processing at certain times.

As soon as we built the tree and got the key event volumes, we started automating these points from larger to smaller. Each branch of this tree, which initially led to manual events, slowly became an automated business process.

Where there were a lot of manual events, we described the business process in more detail, adding it to the tree so that the system could work it out and make comments independently without the operator's participation.

Of all the events that were initially handled by humans, 84 % are now handled automatically. Engineers account for only 16% of applications left, and these are the most complex stories that are performed by expensive collection services and the most professional of all professional engineers. As a rule, they are associated with free text that cannot be reduced to a common denominator. For example, if the application says something like: "What should I do if I was bitten by an ATM?" - a person will have to deal with this. But you don't often see something like this in your application.

And this is the right approach, because many formalized moments contain free text, and it is cheaper for us for a person to read this text and react correctly than to rely only on robots.

By the way, about our robots​


They perform human functions, we love them very much and know each of them by name. In addition, everyone has their own story.

No, we're not going crazy, it's just really more convenient. It is quite difficult to say: "What change did we make to the robot for making the first decision regarding automatic processing of the classical incident handling module?" It's much easier to say, " How has Wall-E changed?"

By the way, he was the first robot we launched. From the business intelligence side, this was handled by the head of the group, Valentin Serdyukov. Therefore, the robot is named, on the one hand, in honor of a cute cartoon character, and on the other — in honor of its creator.

We also have Marusya, Ilana, Mila, Thor, Katyusha, Inna, Vika, Julia, Sigma, Vadik and Jarvis working for us.

What do we want to come to in the end​


Now we are engaged in gradual implementation of machine learning. We have already done this at the entrance from the point of view of requests and complaints, where free text turns into some formalized tags. And we are gradually moving to train the system to formalize free research of engineers on the basis of all available reports. And then our binary tree with ordered information will be able to provide us with automation of the last layer of the event with free text.
 
Top