How to make a payment system with your own hands

Tomcat

Professional
Messages
2,656
Reputation
10
Reaction score
647
Points
113
From scratch. Well, isn't it a dream?

True, as always, on the way to a dream, I had to swim most of the way along rivers with pitfalls, and part of the way I had to ride on bicycles that I had assembled myself. Along the way, we have gained a lot of interesting and useful knowledge that we would like to share with you.

We will tell you how we wrote the entire processing of RBKmoney Payments, that’s what we called it. How they made it resistant to loads and equipment failures, how they came up with the possibility of almost linear horizontal scaling.

And, in the end, how we took off with all this, not forgetting about the comfort of those who are inside - our payment system was created with the idea of being interesting primarily for developers, those who create it.

With this post we are opening a series of articles in which we will share both specific technical things, approaches and implementations, as well as experience in developing large distributed systems in principle. The first article is a review; in it we will outline milestones that we will cover in detail, and sometimes in great detail.

Disclaimer
No less than 5 years have passed since the last publication on our blog. During this time, our development team has been significantly updated; there are now new people at the helm of the company.

When you create a payment system, you need to take into account a lot of different things and develop many solutions. From processing capable of processing thousands of simultaneous parallel requests to write off money, to user-friendly and user-friendly interfaces. Trivial, if you don’t take into account the small nuances.

The harsh reality is that behind payment processing there are payment organizations that do not accept such traffic with open arms, and sometimes even ask “to send us no more than 3 requests per second.” And the interfaces are looked at by people who, perhaps for the first time on the Internet, decided to pay for something. And any UX jamb, incomprehensibility and delay is a reason to panic.

A shopping cart you can put your groceries in even during a tornado​

Our approach to creating payment processing is to provide the ability to always start a payment. It doesn’t matter what’s going on inside us - the server burned down, the admin got confused in the networks, the electricity in the building/district/city was turned off, we hmm... lost diesel. Doesn't matter. The service will still allow you to start the payment.

The approach sounds familiar, doesn't it?

Yes, we were inspired by the concept described in Amazon Dynamo Paper. The guys from Amazon also built everything so that the user should be able to put the book in the cart, no matter what horror was happening on the other side of his monitor.

Of course, we do not violate the laws of physics and have not figured out how to disprove the CAP theorem. It is not a fact that the payment will be processed immediately - after all, there may be problems on the banks’ side, but the service will create a request, and the user will see that everything worked. Yes, and we still have a dozen backlog listings with technical debt before the ideal, to be honest, we can answer 504 occasionally.

Let's look into the bunker, since there's a tornado outside the window​

It was necessary to make our payment gateway always available. Whether the peak load has increased, something has dropped, or has gone to the DC for maintenance, the end user should not notice this at all.

This was solved by minimizing the places where the system state is stored - it is obvious that stateless applications are easy to scale to the horizon.

Our applications themselves run in Docker containers, the logs from which we reliably merge into the central Elasticsearch storage; They find each other through Service Discovery, and transmit data via IPv6 inside the Macroservice .

All microservices assembled and working together, together with related services, form a Macroservice, which ultimately provides you with the payment gateway as you see it from the outside in the form of our public API.

The order is maintained by SaltStack, which describes the entire state of the Macroservice.

We will return with a detailed description of this entire farm.

It's easier with apps.

But if you store the state somewhere, then it must be in a database in which the cost of failure of part of the nodes is minimal. Also, there should be no master nodes with data in it. So that it can respond to requests with predictable waiting times. Are they dreaming here? Back then, it didn’t require much maintenance, and so that Erlang developers would like it.

Yes, haven't we already said that the entire online part of our processing is written in Erlang?

As many have probably already guessed, we didn’t have a choice as such.

All state of the online part of our system is stored in Basho Riak. We'll tell you how to cook Riak without breaking your fingers (because you'll definitely break your brain), but for now we'll move on.

Where's the money, Lebowski?​

If you take an infinite amount of money, you might be able to build an infinitely reliable processing facility. But it is not exactly. And they didn’t give us much money. Exactly like servers of the “high-quality, but China” level.

Fortunately, this led to positive effects. When you realize that it will be somewhat difficult for you, as a developer, to get 40 physical cores addressing 512GB of RAM, you have to get out and write small applications. But they can be deployed as many as you like - the servers are still inexpensive.

Even in our world, any servers tend not to come back to life after a reboot, or even experience a power supply failure at the most inopportune moment.

With an eye on all these horrors, we have learned to build a system with the expectation that any part of it will suddenly break down. It is difficult to remember whether this approach caused any inconvenience for the development of the online part of the processing. Perhaps this has something to do with the Erlangist philosophy and their famous LetItCrash concept?

But it’s easier with servers.

We figured out where to place applications, there are many of them, they are scalable. The database is also distributed, there is no master, we don’t mind burned out nodes, we can quickly load the cart with servers, come to the DC and leave them with pitchforks in the racks.

But you can’t do that with disk arrays! The failure of even a small disk storage is a failure of part of the payment service, which we cannot afford. Duplicate storage systems? Too impractical.

But we don’t want to afford expensive branded disk arrays. Even out of a simple sense of beauty, they won’t look next to the racks where nonames are crammed in neat rows. And all this is unreasonably expensive.

As a result, we decided not to use disk arrays at all. All our block devices run under CEPH on identical inexpensive servers - we can put them in racks in large quantities as needed.

With network hardware, the approach is not much different. We take average people and get good equipment suitable for the task at very low cost. In the event of a switch failure, a second one works in parallel, and OSPF is configured on the servers, convergence is ensured.

Thus, we got a convenient, fault-tolerant and universal system - a rack full of simple, cheap servers, several switches. Next stand. And so on.

Simple, convenient and overall very reliable.

Listen to the rules of conduct on board​

We never wanted to come to the office, do work and get paid in money. The financial component is very important, but it cannot replace the satisfaction of a job well done. We have already written payment systems, including at previous places of work. And we had a rough idea of what we didn’t want to do. I didn’t want standard but proven solutions, I didn’t want a boring enterprise.

And we decided to bring maximum freshness into the work. In the development of payment systems, new solutions are often limited, they say, why do you need a docker at all, let’s go without it. And generally speaking. Unsecured. Ban.

We decided not to prohibit anything, but on the contrary, to encourage everything new. This is how we built a Macroservice in production from a huge bunch of applications in docker containers, managed through SaltStack, Riak clusters, Consul as a Service Discovery, an original implementation of request tracing in a distributed system, and many other wonderful technologies.

And all this is so safe that you can publish the Bugbounty program on hackerone.com without shame.

Of course, the very first steps along this road turned out to be strewn with some completely indecent amount of rakes. We will definitely tell you how we went through them; we will also tell you, for example, why we don’t have a test environment, and all processing can be deployed on a developer’s laptop with a simple make up.
As well as a bunch of other interesting things.

Thank you for choosing our airline!

PS: Original content! All photos in the post are scenes from the life of our office.
 
Top