How the bank “broke”

Tomcat

Professional
Messages
2,689
Reaction score
920
Points
113
h4mxykcqngpndxytxa3__fjvvce.png

A failed IT infrastructure migration resulted in the corruption of 1.3 billion bank customer records. This was all due to insufficient testing and a frivolous attitude towards complex IT systems. Cloud4Y tells how it happened.

In 2018, the English bank TSB realized that its two-year-old “divorce” with the Lloyds banking group (both companies merged in 1995) was too expensive. TSB was still tied to its former partner through hastily cloned Lloyds IT systems. And the worst thing was that the bank had to pay “alimony” - deductions in the form of annual license fees in the amount of $127 million.

Few people like to pay money to their exes, so on April 22, 2018 at 18:00 TSB began the final stage of the 18 months of a plan that was supposed to change everything. It was planned to transfer billions of customer records to the IT system of the Spanish company Banco Sabadell, which bought TSB for $2.2 billion back in 2015.

Banco Sabadell CEO José Olu spoke about the upcoming event 2 weeks before Christmas 2017 during a festive staff meeting in a prestigious conference hall in Barcelona. The most important migration tool was to be a new version of the system developed by Banco Sabadell: Proteo. It was even renamed Proteo4UK specifically for the TSB migration project.

At the presentation of Proteo4UK, Banco Sabadell executive director Jaime Guardiola Romojaro boasted that the new system is a large-scale project that has no analogues in Europe, on which over 1,000 specialists worked. And that its implementation will provide a significant boost to the growth of Banco Sabadell in the UK.

April 22, 2018 was set as migration day. It was a quiet Sunday evening in the middle of spring. The bank's IT systems were down as records were being transferred from one system to another. With public access to bank accounts restored late on Sunday, one would expect the bank to slowly and smoothly return to service.

But while Olyu and Guardiola Romojaro were happily broadcasting from the stage about the implementation of the Proteo4UK project, the employees responsible for the migration process were very nervous. The project, which took 18 months, was seriously behind schedule and over budget. There was no time to conduct additional tests. But transferring all the company’s data (which, remember, is billions of records) to another system is a Herculean task.

It turned out that the engineers were nervous for good reason.

zeuznfdrrrhfassf61yuuhdfjci.jpeg

A stub on the site that customers saw for too long.

20 minutes after TSB opened access to accounts, being fully confident that the migration had gone smoothly, the first messages about problems arrived.

People's savings suddenly disappeared from their accounts. Purchases of insignificant amounts were incorrectly recorded as multi-thousand-dollar expenses. Some people logged into their personal accounts and saw not their bank accounts, but the accounts of completely different people.

At 21:00, TSB representatives informed the local financial regulator (the UK Financial Conduct Authority, FCA) that the bank was in trouble. But the FCA has already taken notice: TSB has really screwed up badly, and customers have been made fools. And, of course, they began to complain on social networks (and nowadays it’s not difficult to drop a few lines on Twitter or Facebook). At 11:30 p.m., the FCA was contacted by another financial regulator, the Prudential Regulation Authority (PRA), which also sensed something was wrong.

Already well after midnight they managed to get through to one of the bank representatives. And ask them the only question: “what the hell is going on?”

It took time to understand the scale of the tragedy, but we now know that 1.3 billion records of 5.4 million customers were damaged during the migration. For at least a week, clients were unable to manage their money from computers and mobile devices. They were unable to pay the loan, and many bank clients received a blemish on their credit history, as well as late fees.

zl6dg0-ld0zj8fnbfdzljwrdst8.jpeg

This is what TSB's online customer banking looked like

When the glitches started to appear, almost immediately after, bank representatives insisted that the problems were “periodic.” Three days later, a statement was issued that all systems were normal. But customers continued to report problems. It was not until 26 April 2018 that the bank's chief executive, Paul Pester, admitted that TSB was "on its knees" as the bank's IT infrastructure continued to have a "capacity problem" preventing around a million customers from using its online banking services.

Two weeks into the migration, the online banking application was still reported to be experiencing internal errors related to the SQL database.
Payment difficulties, especially with business and mortgage bills, continued for up to four weeks. And ubiquitous journalists found out that TSB rejected an offer of help from Lloyds Banking Group at the very beginning of the migration crisis. In general, problems associated with logging into online services and the ability to transfer money were observed until September 3.

A little history​


nfydhmwipjdxs1zly1lz3nlajkk.jpeg

The first ATM was opened on 27 June 1967 near Barclays in Enfield

Banking IT systems are becoming increasingly complex as customer needs and expectations from the bank grow. About 40-60 years ago, we would have been happy to visit our local bank branch during business hours to deposit cash or withdraw it through the teller.

The amount of money in the account was directly related to the cash and coins we gave to the bank. Our home accounting could be tracked with pen and paper, and computer systems were not accessible to clients. Bank employees placed data from passbooks and other media into devices that counted the money.

But in 1967, for the first time, an ATM was installed in north London that was not located on the bank's premises. And this event changed banking. User experience has become a benchmark for the development of financial institutions. And this has helped banks become more sophisticated in terms of working with clients and their money. After all, while computer systems were available only to bank employees, they were satisfied with the old, “paper” way of interacting with clients. It was only with the advent of ATMs and then online banking that the general public gained direct access to bank IT systems.

ATMs were just the beginning. Soon people were able to avoid the line at the cash register by simply calling the bank by phone. This required special cards inserted into a reader capable of deciphering the dual-tone multi-frequency (DTMF) signals transmitted when the user pressed the “1” (withdraw money) or “2” (deposit funds) key.

The Internet and mobile banking have brought customers closer to the core systems that power banks. Despite their varying limitations and settings, all of these systems must interact effectively with each other and with the main mainframe, performing account balance checks, making money transfers, and so on.

Few clients think about how complex the information path is when you, for example, log into an online bank to view or update information about the money in your account. When you log in, this data is passed through a set of servers; when you make a transaction, the system duplicates this data in the backend infrastructure, which then does the heavy lifting—transferring money from one account to another to pay bills, make payments, and continue subscriptions.

Now multiply this process by several billion. According to data compiled by the World Bank with the help of the Bill and Melinda Gates Foundation, 69 percent of adults worldwide have a bank account. Each of these people has to pay bills. Someone pays a mortgage or transfers money for children's clubs, someone pays for a Netflix subscription or renting a cloud server. And all these people use more than one bank.

Numerous internal IT systems of one bank (mobile banking, ATMs, etc.) must not simply interact with each other. They need to interact with other banking systems in Brazil, China, and Germany. A French ATM should be able to dispense money that is on a bank card issued somewhere in Bolivia.

Money has always been global, but never before has the system been so complex. The number of ways to use bank IT systems is increasing, but the old ways are still in use. The success of a bank largely depends on how “maintainable” its IT infrastructure is, and how effectively the bank can cope with a sudden failure due to which the system will be idle.

No tests - prepare for problems​


beoheobhj3kfs4u5fx7kkiivm2w.jpeg

Banco de Sabadell CEO Jaime Guardiola (left) was confident that everything would go smoothly. Did not work out.

TSB's computer systems weren't very good at solving problems quickly. There were, of course, software glitches, but in reality the bank “broke” due to the excessive complexity of its IT systems. According to the report, which was prepared in the early days of the massive outage, “the combination of new applications, increased use of microservices combined with the use of two Active/Active data centers led to complex risk in production.”

Some banks, such as HSBC, operate globally and therefore also have very complex, interconnected systems. But they are regularly tested, migrated and updated, according to one HSBC IT manager in Lancaster. He sees HSBC as a model for how other banks should manage their IT systems: by devoting staff and spending their time. But at the same time he admits that for a smaller bank, especially one that does not have migration experience, doing this correctly is a very difficult task.

The TSB migration was difficult. And, according to experts, the bank staff could simply not reach this level of complexity in terms of qualifications. In addition, they didn’t even bother to check their solution or test the migration in advance.

During a speech in the British Parliament on banking problems, Andrew Bailey, chief executive of the FCA, confirmed this suspicion. Bad code probably only caused the initial problems at TSB, but the interconnected systems of the global financial network meant that its mistakes were perpetuated and irreversible. The bank continued to see unexpected errors elsewhere in its IT architecture. Customers received messages that were meaningless or unrelated to their problems.

Regression testing could help prevent disaster by catching bad code before it was released into production and caused damage by creating bugs that could not be rolled back. But the bank decided to run through a minefield that it didn’t even know about. The consequences were predictable. Another problem was the “optimization” of costs. How did it manifest itself? The fact is that previously it was decided to do away with the backup copies stored at Lloyds, since they “ate up” too much money.

British banks (and others too) are striving to achieve a four-nines availability level, that is, 99.99%. In practice, this means that the IT system must be available at all times, with up to 52 minutes of downtime per year. The “three nines” system, 99.9%, at first glance does not differ much. But in reality this means that downtime reaches 8 hours per year. For the bank, “four nines” is good, but “three nines” is not.

But every time a company makes changes to its IT infrastructure, it takes risks. After all, something can go wrong. Reducing changes can help avoid problems, while required changes need careful testing. And British regulators have focused their attention on this point.

Perhaps the easiest way to avoid downtime is to simply make fewer changes. But every bank, like any other company, is forced to introduce more and more useful features for clients and its own business in order to remain competitive. At the same time, banks are still obliged to take care of their clients, protecting their savings and personal data, providing comfortable conditions for using services. It turns out that organizations are forced to spend a lot of time and money maintaining the health of their IT infrastructure, while simultaneously offering new services.

The number of reported technology failures in the financial services sector in the UK increased by 187 percent between 2017 and 2018, according to data released by the UK's Financial Conduct Authority. Most often, the cause of failures is problems in the operation of new functionality. At the same time, it is critical for banks to ensure the constant uninterrupted operation of all services and almost instantaneous reporting of transactions. Clients are always nervous when their money is hanging out somewhere. And a client who is nervous about money is always a sign of trouble.

A few months after the TSB failure (by which time the bank's chief executive had resigned), UK financial regulators and the Bank of England issued a discussion paper on operational resilience. So they tried to raise the question of how deep banks have gone in pursuit of innovation, and whether they can guarantee the stable operation of the system that they have now.

The document also proposed changes to legislation. It was about holding people within the company accountable for what goes wrong in that company's IT systems. British parliamentarians explained it this way: “When you are personally responsible, and you can go bankrupt or go to prison, this will greatly change the attitude towards work, including increasing the amount of time devoted to the issue of reliability and safety.”

Results​


Every update and patch comes down to risk management, especially when hundreds of millions of dollars are involved. After all, if something goes wrong, it can be costly in terms of money and reputation. It would seem obvious things. And the bank's failure during migration should have taught them a lot.

Had. But he didn’t teach me. In November 2019, TSB, which again reached profitability and was slowly improving its reputation, “delighted” customers with a new failure in the field of information technology. The second blow to the bank meant that it will be forced to close 82 branches in 2020 to cut its costs. Or he could simply not save on IT specialists.

Stinginess with IT ultimately comes at a cost. TSB reported a loss of $134 million in 2018, compared with a profit of $206 million in 2017. Post-migration costs, including customer compensation, correcting fraudulent transactions (which increased sharply during the banking chaos), and third-party assistance, totaled $419 million. The bank's IT provider was also billed $194 million for its role in the crisis.

However, no matter what lessons are learned from the TSB bank failure, disruptions will still occur. They are inevitable. But with testing and good code, crashes and downtime can be greatly reduced. Cloud4Y, which often helps large companies migrate to cloud infrastructure, understands the importance of quickly moving from one system to another. Therefore, we can conduct load testing and use a multi-level backup system, as well as other options that allow you to check everything possible before starting the migration.
 
Top