Business intelligence solutions use microservice systems for better scalability and flexibility. Optimization of their performance is a great challenge for a development team. So, I've decided to describe the improvement of microservice architecture with the help of a reporting module system. The article includes its technical scheme, estimates, pros and cons of possible tools. It will be useful both for tech professionals and business owners.
Microservice architecture: pros and cons
Since 2010, the popularity of microservice design grows with the rise of DevOps and Agile development. Nowadays, Airbnb, Netflix, Uber, LinkedIn and other big companies take benefits of microservices.
A monolith system has a single processor for all the implemented logic. Unlike it, microservice architecture consists of several independent processors. Each of them usually includes the common parts of an enterprise application:
Any change in the system leads to the deployment of a new version of the server part of the system. Let's consider the concept in detail.
What does microservices architecture really mean?
Microservice design means a set of services, but the definition is vague. I can single out 4 features that a microserver usually has:
responsibility for a specific business need
usage of endpoints
the decentralized control of languages and data
On the picture below you can see microservice design compared to a monolith app.
Monolith vs microservice design
What is scalability in microservices?
One of the main benefits of the microservice style is its scalability. You can scale several services without changing the whole system. So, you save resources and keep the app less complex. One of the most famous cases that prove this fact is Netflix user base. The company had to cope with the growing subscribers' database. The microservice design was a great solution for scaling it.
Each microservice needs its own database. Otherwise, you can't use all the benefits of the modularization pattern. But the variety of databases leads to challenges in the reporting process. We will discuss the problem later.
Microservice design speeds up app development and allows to launch the product earlier. Each part can be rolled out separately. So, the deployment of microservices is quicker and easier.
What are other advantages of microservices?
The ability to work in smaller teams and use an Agile approach
Flexibility in continuous integration and deployment
The possibility of convenient horizontal system scaling
Increased development team members productivity
Simplification of the debugging and maintenance processes
What are the disadvantages of using microservices?
Despite all these benefits, microservice architecture has its own drawbacks. I mean the necessity of operating many systems and completing various tasks in the distributed environment. So, the main microservice pitfalls are:
The complexity of microservice design makes developer plan and act more carefully.
The external API communication in microservice architecture leads to more significant risks of attacks.
Diversity of programming languages
Sometimes it's difficult to switch between them in the development and deployment processes.
BI project details: the problem of custom reports
FreshCode team worked on a legacy EdTech project. The application consisted of over 10,000 files developed in Coldfusion. It was a 7-year-old app from the USA, which was running on an MS SQL database. The system was very complex and included many microservices. Its main parts were:
sophisticated financial and billing system
multi-organisation structure for large group entities
workflow management tool for business processes
integrated bulk email, SMS and live chat
online system for surveys, quizzes, examination
flexible assessment and learning management system
FreshCode worked on the project on the stage ofmigrating to a new interface. The product was preparing for the global launch. The microservice system was supposed to process great amounts of data. As for the app target audience, it was developed for
large education networks that manage 100s of campuses
governments that have up to 200k schools, colleges and universities
Meanwhile, the EdTech app design was convenient both for great education networks and a small school of about 100 students.
So, FreshCode development team faced the problem of managing and improving the performance of the complex microservice architecture. It should be mentioned that the client wanted to build both SaaS and self-hosted systems. So, we have chosen the technical solutions keeping this fact in mind.
How to improve microservices performance?
The process of generating reports required engagement with different services. Thus, it caused performance issues. That's why Freshcode team decided to optimize the app architecture by creating a separate reporting microservice. It received data from all the databases. Then, it saved them and transformed into custom reports.
On the picture below you can see the scheme of reporting microservices system and technologies for its implementation.
microservice reporting module
Yellow color marks all microservices in the system. Each of them has its own database. The reporting module tracks all changes in them with the help of a messaging system. Then, it stores the new data in its own report database.
Reporting module implementation in 6 steps
Let's look at the 6 main part of the reporting system, technologies that can be used and the best solutions.
Step №1: Change Data Capturing (CDC)
CDC tracks every single change (insert, update, delete) and performs some logic on it. There were 3 possible tools for the first step of implementing the microservice reporting system.
1. Apache NiFi It allows creating simple CDC without coding at all. Apache NiFi has a lot of built-in processors and supports data routing, transformation and system mediation logic.
Support of cluster mode and easy scaling
Built-in PutToKafka and PutToKinesis activities
Implementation of custom activities on any JVM language
No predefined data format for messaging between activities
Supports only JVM languages
The quality of default activities isn't perfect
No Oracle CDC activity
2. StreamSets Data Collector Popular open source solution for continuous big data ingestion in a microservice reporting system. Its main advantages are simple creation of data pipelines and support of many widespread technologies.
Open source software can be adjusted for your needs
Simple and convenient UI
Support of most of the popular tools
It's a new solution that is still actively developing
It's a little bit difficult to start working with StreamSets Data Collector
3. Matillion The innovative ELT architecture has an easy-to-use interface. It is built specifically for Amazon Redshift, Google BigQuery and Snowflake.
A proprietary tool
Support of the development team
Only several databases can be used with this tool
ELT architecture doesn't match to all projects
Oracle was the main database of our microservice reporting system. So, we choose StreamSets Data Collector, because of Oracle CDC support out of the box.
Step №2: Messaging System
It allows sending messages between computer systems, as well as setting publishing standards for them.
1. Apache Kafka One of the most famous tools for real-time analytics. Apache Kafka has high throughput and reliability characteristics.
High throughput, fault tolerance, durable
Great scalability, high concurrency
Batch mode, native computation over streams
A great choice for on-premise microservice reporting system
Requires DevOps knowledge for correct setup
No built-in monitoring tool
2. AWS Kinesis It simplifies collecting, processing, analyzing streaming data. Amazon Kinesis offers key capabilities for the cost-effective process at any scale.
Easy to manage and scale
Great integration with other AWS services
Almost no DevOps effort
Built-in monitoring and alert system
Needs some cost optimizations
No way to use for on-premise software
Although Apache Kafka required a bit more effort to deploy and setup, we used it as a cost-efficient on-premise solution.
Step №3: Streaming Computation Systems
The high-performance computer system analyzes multiple data streams from many sources. It helps to prepare data before ingestion. So, it's possible to denormalize/join them and add any info if needed.
1. Spark Streaming Brings Apache Spark's language-integrated API for stream processing. So, it allows writing streaming jobs the same way we write batch jobs.
Stateful exactly-once semantics out of the box
Pretty expensive to use
No built-in state management
2. Apache Flink It is useful for stateful computations over unbounded and bounded data streams. Apache Flink suits for all common cluster environments and performs computations at in-memory speed.
Exactly once state consistency
SQL on Stream & Batch Data
Low latency, scalability, fault-tolerance
Support of very large state
Requires high programming skills
Flink community is less than Spark but growing
3. Apache Samza The scalable data processing engine for real-time analytics that can be used in a microservice reporting system.
Can maintain a large state
Low latency, high throughput, mature and tested at scale
Fault-tolerant and high performance
At-least-once processing guarantee
Lack of advanced streaming features (watermarks, sessions, triggers)
4. AWS Kinesis Services The set of tools includes Data Firehose, Data Analytics, and Data Streams. As a result, it helps to build powerful stream processing without implementing any custom code.
Pay only for what you use
The easiest way to process data streams in real time with SQL
Handle any amount of streaming data
No way to use on-premise
The cost in a high-load environment will be higher compared to other solutions, but development and maintenance cost may be less
Complicated to customize
AWS provides a great set of tools for ETL and data procession. It's a good start point. But there is no way to deploy it on custom servers. That's why it doesn't fit for on-premise solutions.
Apache Flink is the most feature reach and performant solution. It allows storing large application state (multi-terabyte). But it requires more developers to be involved and should be deployed by yourself.
Step №4: Data Lake
The central repository of integrated data from one or more disparate sources. It stores current and historical data in one single place. So, we can use them for creating analytical reports, machine learning, etc.
1. AWS S3 The object storage service offers industry-leading scalability, data availability, security and performance.
Easy to integrate with other AWS services
Designed for 99.999999999% (11 9's) of data durability
Cost-effective for rarely accessed data
Has an open source implementation with full API support
High network pricing
Previously S3 met availability issues, but it's not a problem for a Data Lake
2. Apache Hadoop The primary data storage system used by Hadoop applications. It allows storing and processing large amounts of data.
Efficiently works with huge amounts of data
Integration with many analytical and operational tools
(Impala, Hive, HBase, Kuda, Kyle, etc)
Complicated to deploy and manage
Needs to set up monitoring and high availability
We decided to start with AWS S3. It has an open source implementation. That's why we could integrate it to the on-premise microservice reporting system.
Step №5: Report Databases
1. AWS Aurora It is up to 5 times faster than standard MySQL databases and 3 times faster than PostgreSQL databases.
Pretty fast SQL database
High Availability and Durability
Easy to scale
Bad performance for analytical reports in case of big data projects
The minimally available instance is too big, but we can easily replace it by plain PostgreSQL
2. AWS Redshift Redshift delivers 10 times faster performance than other data warehouses. It is using machine learning, massively parallel query execution and columnar storage on high-performance disk.
May run queries on external S3 files
Easy to set up, use and manage
Doesn't enforce uniqueness
Can't be used as a live app database
It's mostly useful for run aggregation on a large amount of data
3. Kinetica The vectorized, columnar, memory-first database designed for analytical (OLAP) workloads. Kinetica automatically distributes any workload across CPUs and GPUs for optimal results.
Pretty fast aggregation performance, run on GPU and CPU
Supports materialized join views, and can update them incrementally
GPU instances still cost a lot
No way to join data between different partitions
4. Apache Druid It generally works well with any event-oriented, clickstream, time series, or telemetry data, especially streaming datasets from Apache Kafka. Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics.
Druid can be deployed in any *NIX environment on commodity hardware
Best for interactive dashboards with full drill-down capabilities
Stores only pre-aggregated data
Isn't perfect for custom reports that may be built by users
Works only on time series data
No full join support
All of these databases are amazing. But our client's goal was to create reports based on all data from all microservices. So, the development team considered AWS Aurora as the best choice for this task. It simplified the workflow a lot.
Step №6: Report Microservice
The report microservice was responsible for storing information about data objects and relations between them. It also stood for managing security and generating reports itself. Since these reports were based on the chosen data objects.
SaaS and self-hosted technological stacks
We prepared 2 variants of the technological stack for the microservice reporting system. As for the SaaS product on AWS, we used:
StreamSets for CDC
Apache Kafka as a messaging system
AWS S3 DataLake
AWS Aurora as a report database
AWS ElasticCache as an in-memory data store
The reporting microservice was written in NodeJS. You can see rough estimates for SaaS solution on the table below.
Note:These are calculations for production deployment. The development process required much smaller infrastructure.
Such infrastructure was the most appropriate for the client's requirements. Its main advantage was the easy way to replace AWS services with self-hosted solutions. It allowed us to avoid code/logic duplication for different deployment schemas.
For on-premise one we used Minio, PostgreSQL, Redis accordingly. Their APIs were fully compatible. So, we didn't have any significant problems in the microservice reporting system at all.
The bottom line: custom reporting in microservices
All in all, our team solved the clients' technical challenges. The reporting microservice module was effective and convenient. It was capable of:
Reporting module capabilities
FreshCode client improved the microservice reporting system and achieved these goals:
to update the app's architecture and design
to improve the product by adding new features
to optimize performance, increase flexibility and scalability
For more project details check out the full EdTech case study. It includes the full list of customer's requirements and app's features.
If you are interested in solving the same problem or have any other technical challenges, contact our team. We provide free expert advice for startups, small business and enterprises. Check FreshCode portfolio to find out other interesting projects.
Would you like to read more case-based articles? Let me know in the comments below and stay in touch with FreshCode team!