Teacher
Professional
- Messages
- 2,670
- Reaction score
- 798
- Points
- 113
My name is Nikita, I am a backend developer from the antifraud team in Citymobil. Today I will share with you a story about how we moved our service from a monolith to a separate service, how we came to this decision in general, and what problems we encountered.
To begin with, I'll tell you a little about our service.
Antifraud 101
Our anti-fraud service is a set of rules for detecting orders that contain signs of fraud or fraudulent patterns.
Example of driver fraud
Drivers who use the services of taxi aggregators have the opportunity to receive bonuses for short trips, which fraudsters are trying to earn dishonestly. For example, we can see that one driver has n consecutive orders with the same customer. What they were doing there is not clear to us, but we can say with great confidence that this is a fraud and cancel these orders.
Example of client fraud
Customers receive bonuses if they invite new customers to the app. Fraudster clients register several new clients on their device, for which they can lose all accrued bonuses.
To check the driver for fraud, we run all checks and as a result receive events, each of which indicates the presence of the desired pattern in the orders submitted for entry.
Checks can be divided into several types:
To configure the rules, there is a web interface - "admin panel". And to visually monitor the triggered rules, we created a web page with different reports with a large set of filters.
Adding a new check occurs as follows: we describe the fraud pattern, encode it in the service, run the new rule in test mode, and observe. If necessary, adjust the rule and enable it.
Problems with the previous architecture
Previously, a partner company could only receive money if all its drivers were checked for fraud.
Antifraud worked inside PHP on a single thread. It didn't scale without crutches, and there were queues for checking during peak hours. The checks themselves were not parallelized in any way, and adding each new rule inevitably increased the processing time.
The old antifraud "outgrown" its database model, and it became impossible to work with the database: the database was periodically put down during operation, which in a monolithic architecture led not only to problems of antifraud, but also for the entire business as a whole.
Reports were built slowly. To look at seemingly simple things with your hands in the database, sometimes you had to JOIN five or more tables, not to mention more complex things.
Business was growing, and these problems needed to be solved quickly. I also wanted to check drivers for fraud on the fly (after each trip).
What options did we have:
The choice fell on leaving antifraud in a separate service. Golang was chosen as the main parallelization tool, and the company has good expertise in this area.
It was decided to move in two stages.
First stage: migration to a new language
They stayed on the old data model (yes, it creaks, but it still works). We started building the service from scratch, and in a couple of months we moved the main functionality and most of the checks. We have made the service fully operational.
Now each check for one order can be run in parallel, which significantly reduces the processing time.
Comparative characteristics in terms of processing speed: previously, it took 6 hours to analyze all drivers, but now it takes 25 minutes.
Second stage: selecting a storage model
For current work, we needed both an OLTP - like database for fraud analysis and an OLAP database for building reports. The current data schema did not support anti-fraud scenarios, from the word "nothing".
The choice was between:
We chose Elastic. It scales easily, and it has indexes for any field out of the box, which allows you to customize filters in reports to your heart's content. Denormalized the model so that you don't have to do JOINS between Elastic indexes.
Warning
If you also decide to choose Elastic as a database, then be careful. With the default settings, Elastic may start to return partial search results under load. For example, the request will be stymied on several shards, and the response code will be 2xx. If you are not satisfied with this behavior and you would rather get a search error (for example, so that you can later delete it), then you can adjust this behavior using the parameter .
Current anti-fraud scheme
Let me remind you that the main logic related to trips lives in a monolith, and all information about orders is still stored in MySQL. At the end of the trip, the monolith transfers the order from the active orders table to the closed orders table and sends a message to our service via RabbitMQ so that we can check the specific order.
When receiving a message from RabbitMQ, you could immediately generate a goroutine to process the message and proceed to receive the next message, but this approach does not control the number of goroutines. Therefore, the number of handlers in the service is adjusted dynamically using worker pools.
When processing a message, the anti-fraud service goes to the MySQL slave, reads all the order data we need from different tables, writes it to Elastic, and then sends itself a message to check the same order. When checking, it captures distributed lock on Redise, to prevent parallel processing of the object during particularly intensive requests, for example, when the driver or client is updated frequently. If a fraud is found, the service sends a message about it to the monolith.
When creating reports in the admin panel, the service is called via the REST API.
All this allows you to have a minimal impact on the monolith.
Now read more
The reader may have noticed a couple of problems:
Let's start by solving the second problem
Our RabbitMQ inside consists not only of two queues (incoming and outgoing), but also a third one-the retry queue (retry).
This queue has a producer, but no consumer. It has a dead-letter policy configured: after its TTL expires, the message is sent back to the incoming queue, and we will process this message.
In other words, if we have received a message for checking an order, but the slave does not have this order yet, then we will simply put this message each time in the retry queue until the order appears. Using this approach, you can retract temporary errors, and if the number of attempts to process this message is exceeded, discard it with an error message in the log.
Now back to the first problem
The fastest and worst option is to refresh the index on any write operation. Elasticsearch Developers they recommend it be extremely careful with this approach, it can lead to lower performance.
There is another option: send all information about the order in the message at once, rather than reading it from the slave. But then the message size will grow by several orders of magnitude, which will increase the load on our rabbit, and we try to protect it. In addition, the structure of the read data changes quite often, and we would like to avoid changing the model both in the monolith and in the service.
Maybe we should check the order as soon as we read it from the slave? It is possible, but most of our checks still make a conclusion based on several orders, that is, you will still have to go to the database for other orders. Why complicate the logic when you can use the same retry queue mechanism?
By setting the TTL of messages in the retry queue longer than the Elastic index update interval, we will forget about the first problem once and for all.
You can read more about the dead-letter mechanism.
A little bit about our tests
It is dangerous to make mistakes in the logic of anti-fraud rules: this can lead to massive cash write-offs. This is why we strive for 100% coverage of important parts of the code. To do this, we use the library testify by mocking external dependencies and checking the rules for performance. We also have functional tests that check the main flow of order processing and verification.
Instead of conclusions
By rewriting all the anti-fraud code, we have secured confidence in our service at the exit and further business growth over the next few years. We have solved an important business problem, thanks to which an honest driver is sure to receive his honest money immediately after the trip.
Of course, some of the tasks that our service performs were left behind the curtain of the NDA. And some tasks just wouldn't fit in one article.
Maybe next time I'll come back with a story about how we analyze user actions for fraud, where the load is orders of magnitude higher.
To begin with, I'll tell you a little about our service.
Antifraud 101
Our anti-fraud service is a set of rules for detecting orders that contain signs of fraud or fraudulent patterns.
Example of driver fraud
Drivers who use the services of taxi aggregators have the opportunity to receive bonuses for short trips, which fraudsters are trying to earn dishonestly. For example, we can see that one driver has n consecutive orders with the same customer. What they were doing there is not clear to us, but we can say with great confidence that this is a fraud and cancel these orders.
Example of client fraud
Customers receive bonuses if they invite new customers to the app. Fraudster clients register several new clients on their device, for which they can lose all accrued bonuses.
To check the driver for fraud, we run all checks and as a result receive events, each of which indicates the presence of the desired pattern in the orders submitted for entry.
Checks can be divided into several types:
- Checking the customer/driver for any changes (for example, a new credit card was added).
- Checking 1..n recent orders.
- Special: checking the correct operation of drivers participating in specific promotions.
To configure the rules, there is a web interface - "admin panel". And to visually monitor the triggered rules, we created a web page with different reports with a large set of filters.
Adding a new check occurs as follows: we describe the fraud pattern, encode it in the service, run the new rule in test mode, and observe. If necessary, adjust the rule and enable it.
Problems with the previous architecture
Previously, a partner company could only receive money if all its drivers were checked for fraud.
Antifraud worked inside PHP on a single thread. It didn't scale without crutches, and there were queues for checking during peak hours. The checks themselves were not parallelized in any way, and adding each new rule inevitably increased the processing time.
The old antifraud "outgrown" its database model, and it became impossible to work with the database: the database was periodically put down during operation, which in a monolithic architecture led not only to problems of antifraud, but also for the entire business as a whole.
Reports were built slowly. To look at seemingly simple things with your hands in the database, sometimes you had to JOIN five or more tables, not to mention more complex things.
Business was growing, and these problems needed to be solved quickly. I also wanted to check drivers for fraud on the fly (after each trip).
What options did we have:
- Bring to mind what is there. Change the data model to a new one.
- Write a service from scratch, with the ability to scale out of the box.
The choice fell on leaving antifraud in a separate service. Golang was chosen as the main parallelization tool, and the company has good expertise in this area.
It was decided to move in two stages.
First stage: migration to a new language
They stayed on the old data model (yes, it creaks, but it still works). We started building the service from scratch, and in a couple of months we moved the main functionality and most of the checks. We have made the service fully operational.
Now each check for one order can be run in parallel, which significantly reduces the processing time.
Comparative characteristics in terms of processing speed: previously, it took 6 hours to analyze all drivers, but now it takes 25 minutes.
Second stage: selecting a storage model
For current work, we needed both an OLTP - like database for fraud analysis and an OLAP database for building reports. The current data schema did not support anti-fraud scenarios, from the word "nothing".
The choice was between:
- A new SQL model (properly denormalized) for current work, as well as ClickHouse for reports.
- Elastic’ом.
We chose Elastic. It scales easily, and it has indexes for any field out of the box, which allows you to customize filters in reports to your heart's content. Denormalized the model so that you don't have to do JOINS between Elastic indexes.
Warning
If you also decide to choose Elastic as a database, then be careful. With the default settings, Elastic may start to return partial search results under load. For example, the request will be stymied on several shards, and the response code will be 2xx. If you are not satisfied with this behavior and you would rather get a search error (for example, so that you can later delete it), then you can adjust this behavior using the parameter .
Current anti-fraud scheme
Let me remind you that the main logic related to trips lives in a monolith, and all information about orders is still stored in MySQL. At the end of the trip, the monolith transfers the order from the active orders table to the closed orders table and sends a message to our service via RabbitMQ so that we can check the specific order.
When receiving a message from RabbitMQ, you could immediately generate a goroutine to process the message and proceed to receive the next message, but this approach does not control the number of goroutines. Therefore, the number of handlers in the service is adjusted dynamically using worker pools.
When processing a message, the anti-fraud service goes to the MySQL slave, reads all the order data we need from different tables, writes it to Elastic, and then sends itself a message to check the same order. When checking, it captures distributed lock on Redise, to prevent parallel processing of the object during particularly intensive requests, for example, when the driver or client is updated frequently. If a fraud is found, the service sends a message about it to the monolith.
When creating reports in the admin panel, the service is called via the REST API.
All this allows you to have a minimal impact on the monolith.
Now read more
The reader may have noticed a couple of problems:
- Elastic does not guarantee that the recorded data will be immediately available for search, but only after refreshing the index, which elastic itself does in the background with some frequency. Then how do I check the order that we just added?
- What if the MySQL slave is lagging behind and the order isn't there yet?
Let's start by solving the second problem
Our RabbitMQ inside consists not only of two queues (incoming and outgoing), but also a third one-the retry queue (retry).
This queue has a producer, but no consumer. It has a dead-letter policy configured: after its TTL expires, the message is sent back to the incoming queue, and we will process this message.
In other words, if we have received a message for checking an order, but the slave does not have this order yet, then we will simply put this message each time in the retry queue until the order appears. Using this approach, you can retract temporary errors, and if the number of attempts to process this message is exceeded, discard it with an error message in the log.
Now back to the first problem
The fastest and worst option is to refresh the index on any write operation. Elasticsearch Developers they recommend it be extremely careful with this approach, it can lead to lower performance.
There is another option: send all information about the order in the message at once, rather than reading it from the slave. But then the message size will grow by several orders of magnitude, which will increase the load on our rabbit, and we try to protect it. In addition, the structure of the read data changes quite often, and we would like to avoid changing the model both in the monolith and in the service.
Maybe we should check the order as soon as we read it from the slave? It is possible, but most of our checks still make a conclusion based on several orders, that is, you will still have to go to the database for other orders. Why complicate the logic when you can use the same retry queue mechanism?
By setting the TTL of messages in the retry queue longer than the Elastic index update interval, we will forget about the first problem once and for all.
You can read more about the dead-letter mechanism.
A little bit about our tests
It is dangerous to make mistakes in the logic of anti-fraud rules: this can lead to massive cash write-offs. This is why we strive for 100% coverage of important parts of the code. To do this, we use the library testify by mocking external dependencies and checking the rules for performance. We also have functional tests that check the main flow of order processing and verification.
Instead of conclusions
By rewriting all the anti-fraud code, we have secured confidence in our service at the exit and further business growth over the next few years. We have solved an important business problem, thanks to which an honest driver is sure to receive his honest money immediately after the trip.
Of course, some of the tasks that our service performs were left behind the curtain of the NDA. And some tasks just wouldn't fit in one article.
Maybe next time I'll come back with a story about how we analyze user actions for fraud, where the load is orders of magnitude higher.