WE OFFER A FREE CONSULTATION
Add file
Development,10 MIN READ

Reporting in microservices. How to optimize performance?

June 25, 2019
ARTEM BARMIN
Business intelligence solutions use microservice systems for better scalability and flexibility. Optimization of their performance is a great challenge for a development team. So, I've decided to describe the improvement of microservice architecture with the help of a reporting module system. The article includes its technical scheme, estimates, pros and cons of possible tools. It will be useful both for tech professionals and business owners.

Microservice architecture: pros and cons

Since 2010, the popularity of microservice design grows with the rise of DevOps and Agile development. Nowadays, Airbnb, Netflix, Uber, LinkedIn and other big companies take benefits of microservices.

A monolith system has a single processor for all the implemented logic. Unlike it, microservice architecture consists of several independent processors. Each of them usually includes the common parts of an enterprise application:

  • user interface
  • database
  • server
Any change in the system leads to the deployment of a new version of the server part of the system. Let's consider the concept in detail.

What does microservices architecture really mean?

Microservice design means a set of services, but the definition is vague. I can single out 4 features that a microserver usually has:

  • responsibility for a specific business need
  • automatic deployment
  • usage of endpoints
  • the decentralized control of languages and data
On the picture below you can see microservice design compared to a monolith app.
Monolith and microservice architecture
Monolith vs microservice design

What is scalability in microservices?

One of the main benefits of the microservice style is its scalability. You can scale several services without changing the whole system. So, you save resources and keep the app less complex. One of the most famous cases that prove this fact is Netflix user base. The company had to cope with the growing subscribers' database. The microservice design was a great solution for scaling it.

Each microservice needs its own database. Otherwise, you can't use all the benefits of the modularization pattern. But the variety of databases leads to challenges in the reporting process. We will discuss the problem later.

Microservice design speeds up app development and allows to launch the product earlier. Each part can be rolled out separately. So, the deployment of microservices is quicker and easier.

What are other advantages of microservices?

  • The ability to work in smaller teams and use an Agile approach
  • Flexibility in continuous integration and deployment
  • The possibility of convenient horizontal system scaling
  • Increased development team members productivity
  • Simplification of the debugging and maintenance processes

What are the disadvantages of using microservices?

Despite all these benefits, microservice architecture has its own drawbacks. I mean the necessity of operating many systems and completing various tasks in the distributed environment. So, the main microservice pitfalls are:
  • Management issues
The complexity of microservice design makes developer plan and act more carefully.

  • Security risks
The external API communication in microservice architecture leads to more significant risks of attacks.

  • Diversity of programming languages
Sometimes it's difficult to switch between them in the development and deployment processes.

BI project details: the problem of custom reports

FreshCode team worked on a legacy EdTech project. The application consisted of over 10,000 files developed in Coldfusion. It was a 7-year-old app from the USA, which was running on an MS SQL database. The system was very complex and included many microservices. Its main parts were:
  • sophisticated financial and billing system
  • multi-organisation structure for large group entities
  • workflow management tool for business processes
  • integrated bulk email, SMS and live chat
  • online system for surveys, quizzes, examination
  • flexible assessment and learning management system
FreshCode worked on the project on the stage of migrating to a new interface. The product was preparing for the global launch. The microservice system was supposed to process great amounts of data. As for the app target audience, it was developed for
  • large education networks that manage 100s of campuses
  • governments that have up to 200k schools, colleges and universities
Meanwhile, the EdTech app design was convenient both for great education networks and a small school of about 100 students.

So, FreshCode development team faced the problem of managing and improving the performance of the complex microservice architecture. It should be mentioned that the client wanted to build both SaaS and self-hosted systems. So, we have chosen the technical solutions keeping this fact in mind.

How to improve microservices performance?

The process of generating reports required engagement with different services. Thus, it caused performance issues. That's why Freshcode team decided to optimize the app architecture by creating a separate reporting microservice. It received data from all the databases. Then, it saved them and transformed into custom reports.

On the picture below you can see the scheme of reporting microservices system and technologies for its implementation.
microservice data warehouse
microservice reporting module
Yellow color marks all microservices in the system. Each of them has its own database. The reporting module tracks all changes in them with the help of a messaging system. Then, it stores the new data in its own report database.

Reporting module implementation in 6 steps

Let's look at the 6 main part of the reporting system, technologies that can be used and the best solutions.

Step №1: Change Data Capturing (CDC)

CDC tracks every single change (insert, update, delete) and performs some logic on it. There were 3 possible tools for the first step of implementing the microservice reporting system.

1. Apache NiFi
It allows creating simple CDC without coding at all. Apache NiFi has a lot of built-in processors and supports data routing, transformation and system mediation logic.
    Pros:
    • Support of cluster mode and easy scaling
    • Built-in PutToKafka and PutToKinesis activities
    • Implementation of custom activities on any JVM language
    • User-friendly UI
    Cons:
    • No predefined data format for messaging between activities
    • Supports only JVM languages
    • The quality of default activities isn't perfect
    • No Oracle CDC activity
    2. StreamSets Data Collector
    Popular open source solution for continuous big data ingestion in a microservice reporting system. Its main advantages are simple creation of data pipelines and support of many widespread technologies.
      Pros:
      • Built-in AWS S3, Kinesis, Kafka, Oracle, Postgres processors
      • Open source software can be adjusted for your needs
      • Simple and convenient UI
      • Support of most of the popular tools
      Cons:
      • It's a new solution that is still actively developing
      • It's a little bit difficult to start working with StreamSets Data Collector
      3. Matillion
      The innovative ELT architecture has an easy-to-use interface. It is built specifically for Amazon Redshift, Google BigQuery and Snowflake.
        Pros:
        • A proprietary tool
        • Support of the development team
        • Well-tested solution
        Cons:
        • Only several databases can be used with this tool
        • ELT architecture doesn't match to all projects
        Oracle was the main database of our microservice reporting system. So, we choose StreamSets Data Collector, because of Oracle CDC support out of the box.

          Step №2: Messaging System

          It allows sending messages between computer systems, as well as setting publishing standards for them.

          1. Apache Kafka
          One of the most famous tools for real-time analytics. Apache Kafka has high throughput and reliability characteristics.
            Pros:
            • High throughput, fault tolerance, durable
            • Great scalability, high concurrency
            • Batch mode, native computation over streams
            • A great choice for on-premise microservice reporting system
            Cons:
            • Requires DevOps knowledge for correct setup
            • No built-in monitoring tool
            2. AWS Kinesis
            It simplifies collecting, processing, analyzing streaming data. Amazon Kinesis offers key capabilities for the cost-effective process at any scale.
            Pros:
            • Easy to manage and scale
            • Great integration with other AWS services
            • Almost no DevOps effort
            • Built-in monitoring and alert system
            Cons:
            • Needs some cost optimizations
            • No way to use for on-premise software
            Although Apache Kafka required a bit more effort to deploy and setup, we used it as a cost-efficient on-premise solution.

            Step №3: Streaming Computation Systems

            The high-performance computer system analyzes multiple data streams from many sources. It helps to prepare data before ingestion. So, it's possible to denormalize/join them and add any info if needed.

            1. Spark Streaming
            Brings Apache Spark's language-integrated API for stream processing. So, it allows writing streaming jobs the same way we write batch jobs.
              Pros:
              • Stateful exactly-once semantics out of the box
              • Fault-tolerance, scalability
              • In-memory computation
              Cons:
              • Pretty expensive to use
              • Manual optimization
              • No built-in state management
              2. Apache Flink
              It is useful for stateful computations over unbounded and bounded data streams. Apache Flink suits for all common cluster environments and performs computations at in-memory speed.
              Pros:
              • Exactly once state consistency
              • SQL on Stream & Batch Data
              • Low latency, scalability, fault-tolerance
              • Support of very large state
              Cons:
              • Requires high programming skills
              • Complicated architecture
              • Flink community is less than Spark but growing
              3. Apache Samza
              The scalable data processing engine for real-time analytics that can be used in a microservice reporting system.
              Pros:
              • Can maintain a large state
              • Low latency, high throughput, mature and tested at scale
              • Fault-tolerant and high performance
              Cons:
              • At-least-once processing guarantee
              • Lack of advanced streaming features (watermarks, sessions, triggers)
              4. AWS Kinesis Services
              The set of tools includes Data Firehose, Data Analytics, and Data Streams. As a result, it helps to build powerful stream processing without implementing any custom code.
              Pros:
              • Pay only for what you use
              • The easiest way to process data streams in real time with SQL
              • Handle any amount of streaming data
              Cons:
              • No way to use on-premise
              • The cost in a high-load environment will be higher compared to other solutions, but development and maintenance cost may be less
              • Complicated to customize
              AWS provides a great set of tools for ETL and data procession. It's a good start point. But there is no way to deploy it on custom servers. That's why it doesn't fit for on-premise solutions.

              Apache Flink is the most feature reach and performant solution. It allows storing large application state (multi-terabyte). But it requires more developers to be involved and should be deployed by yourself.

              Step №4: Data Lake

              The central repository of integrated data from one or more disparate sources. It stores current and historical data in one single place. So, we can use them for creating analytical reports, machine learning, etc.

              1. AWS S3
              The object storage service offers industry-leading scalability, data availability, security and performance.
                Pros:
                • Easy to integrate with other AWS services
                • Designed for 99.999999999% (11 9's) of data durability
                • Cost-effective for rarely accessed data
                • Has an open source implementation with full API support
                Cons:
                • High network pricing
                • Previously S3 met availability issues, but it's not a problem for a Data Lake
                2. Apache Hadoop
                The primary data storage system used by Hadoop applications. It allows storing and processing large amounts of data.
                Pros:
                • Efficiently works with huge amounts of data
                • Integration with many analytical and operational tools
                (Impala, Hive, HBase, Kuda, Kyle, etc)

                Cons:
                • Complicated to deploy and manage
                • Needs to set up monitoring and high availability
                We decided to start with AWS S3. It has an open source implementation. That's why we could integrate it to the on-premise microservice reporting system.

                Step №5: Report Databases


                1. AWS Aurora
                It is up to 5 times faster than standard MySQL databases and 3 times faster than PostgreSQL databases.
                Pros:
                • Pretty fast SQL database
                • High Availability and Durability
                • Fully Managed
                • Easy to scale
                Cons:
                • Bad performance for analytical reports in case of big data projects
                • The minimally available instance is too big, but we can easily replace it by plain PostgreSQL
                2. AWS Redshift
                Redshift delivers 10 times faster performance than other data warehouses. It is using machine learning, massively parallel query execution and columnar storage on high-performance disk.
                Pros:
                • May run queries on external S3 files
                • Easy to set up, use and manage
                • Columnar storage
                Cons:
                • Doesn't enforce uniqueness
                • Can't be used as a live app database
                • It's mostly useful for run aggregation on a large amount of data
                3. Kinetica
                The vectorized, columnar, memory-first database designed for analytical (OLAP) workloads. Kinetica automatically distributes any workload across CPUs and GPUs for optimal results.
                Pros:
                • Pretty fast aggregation performance, run on GPU and CPU
                • Supports materialized join views, and can update them incrementally
                Cons:
                • GPU instances still cost a lot
                • No way to join data between different partitions
                4. Apache Druid
                It generally works well with any event-oriented, clickstream, time series, or telemetry data, especially streaming datasets from Apache Kafka. Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics.
                Pros:
                • Druid can be deployed in any *NIX environment on commodity hardware
                • Best for interactive dashboards with full drill-down capabilities
                • Stores only pre-aggregated data
                Cons:
                • Isn't perfect for custom reports that may be built by users
                • Works only on time series data
                • No full join support
                All of these databases are amazing. But our client's goal was to create reports based on all data from all microservices. So, the development team considered AWS Aurora as the best choice for this task. It simplified the workflow a lot.

                Step №6: Report Microservice

                The report microservice was responsible for storing information about data objects and relations between them. It also stood for managing security and generating reports itself. Since these reports were based on the chosen data objects.

                SaaS and self-hosted technological stacks

                We prepared 2 variants of the technological stack for the microservice reporting system. As for the SaaS product on AWS, we used:
                • StreamSets for CDC
                • Apache Kafka as a messaging system
                • AWS S3 DataLake
                • AWS Aurora as a report database
                • AWS ElasticCache as an in-memory data store

                The reporting microservice was written in NodeJS. You can see rough estimates for SaaS solution on the table below.
                AWS estimates
                Note: These are calculations for production deployment. The development process required much smaller infrastructure.

                Such infrastructure was the most appropriate for the client's requirements. Its main advantage was the easy way to replace AWS services with self-hosted solutions. It allowed us to avoid code/logic duplication for different deployment schemas.

                For on-premise one we used Minio, PostgreSQL, Redis accordingly. Their APIs were fully compatible. So, we didn't have any significant problems in the microservice reporting system at all.
                on-premise reporting module
                on-premise reporting module
                If you are interested in more information about AWS infrastructure, read the article "How-To Configure Amazon Redshift for Performance".

                The bottom line: custom reporting in microservices

                All in all, our team solved the clients' technical challenges. The reporting microservice module was effective and convenient. It was capable of:
                microservices data analytics
                Reporting module capabilities
                FreshCode client improved the microservice reporting system and achieved these goals:
                • to update the app's architecture and design
                • to improve the product by adding new features
                • to optimize performance, increase flexibility and scalability
                For more project details check out the full EdTech case study. It includes the full list of customer's requirements and app's features.

                If you are interested in solving the same problem or have any other technical challenges, contact our team. We provide free expert advice for startups, small business and enterprises. Check FreshCode portfolio to find out other interesting projects.

                Would you like to read more case-based articles? Let me know in the comments below and stay in touch with FreshCode team!
                1
                2
                3
                4
                5
                Tech and business insights
                Get valuable content once a week!
                THE MOST popular POSTS
                Show more