Wednesday, February 28, 2024

Spring Boot Observability - Distributed Tracing, Metrics

In this post we'll see how to set up distributed tracing for Spring Boot microservices. With Spring Boot 3, observability support has been added to Spring applications and that's what we'll use in this post which means setting up observability in Spring Boot microservices with Micrometer and Grafana Stack (Loki, Prometheus, Tempo).

What is Observability

Observability is the process of understanding the internals of your system by examining the outputs of the application, where the output entails logs, metrics and distributed tracing. By observing these outputs, you can get an idea about the state of the system like exceptions thrown, latency in applications.

Spring Boot Observability with Grafana Stack

You can observe the Spring Boot application with the applications provided as part of Grafana stack.

  1. Traces with Tempo and OpenTelemetry or Zipkin
  2. Metrics with Prometheus, Spring Boot Actuator, and Micrometer
  3. Logs with Loki and Logback

Let's get some idea about these applications.

Loki- Loki is a horizontally-scalable, highly-available log aggregation system. Since the stack we are using puts Micrometer Tracing on the classpath, the logs are automatically correlated (i.e. contain a unique trace identifier). Loki is the server for storing the logs and processing queries.

Logback- Logback is a logging framework, using it you can configure how to send logs to Loki and the pattern (that includes traceID) for logs.

Tempo- Grafana Tempo is an open source, easy-to-use and high-scale distributed tracing backend.

Tempo is Jaeger, Zipkin, Kafka, OpenCensus and OpenTelemetry compatible. It ingests batches in any of the mentioned formats, buffers them and then writes them to Azure, GCS, S3 or local disk.

Prometheus- Prometheus is the metrics aggregation system. Metrics is some kind of measurement for example, time taken for processing request by a web server, number of active connections in a database.

Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

Micrometer- Micrometer provides a simple facade over the instrumentation clients, allowing you to instrument your JVM-based application code without vendor lock-in.

Micrometer provides libraries for connecting to Prometheus for Micrometer metrics.

For tracing context propagation Micrometer provides libraries to connect to Zipkin or OpenTelemetry.

Grafana- Grafana acts as a one stop shop for connecting to the data sources. There is Loki with logging data, Prometheus with metrics data and Tempo with tracing data, all these data sources are connected to Grafana to get meaningful insight at a single location.

With Grafana you can easily query, visualize, set up alerts and analyze the data with the help of metrics

Span and Traces

With observability two terms you will hear a lot are span and traces so let's try to get some clarity about these terms.

To serve a request, from an end user, processing may span several microservices, where one microservice calls another and so on.

Each call from one service to the another is called a span.

Trace is the sum total of all the spans involved in processing of a single request.

For example, if end user requests to get customer data that call goes to customer-service to get customer data and to account-service to get associated accounts.

Span and Trace in distributed tracing

As you can see from the image with each request span ID and trace ID are also assigned. If there is a call from one microservice to another that call span is assigned a different span ID but trace ID remains same. Each service passes the trace ID to the next service in order to facilitate the distributed tracing.

Some other terms related to distributed tracing are as given below-

Tracer- A library that handles the lifecycle of a span.

Tracing context- For distributed tracing to work the tracing context (trace id, span id, etc.) must be propagated through the process and over the network.

Log correlation- Parts of the tracing context (e.g. trace id, span id) can be populated to the logs of a given application. One can then collect all logs in a single storage and group them via trace id. That way one can get all logs, for a single business operation (trace) from all services put in a chronological order.

Latency analysis tools- A tool that collects exported spans and visualizes the whole trace.

Spring Boot observability example

To set up observability which involves distributed tracing, logs and metrics we'll use the example used in Spring Boot Microservice - Service Registration and Discovery With Eureka as our base example and make changes to it to add observability.

In the example we'll have two services CustomerService and AccountService to cater to customer and account related functionalities respectively. When customer information is retrieved it should also get the accounts associated with the specific customer. For getting the associated accounts for a customer we'll have to make a call from CustomerService to AccountService.

Adding maven dependencies

Need to add the following dependencies in the pom.xml of both Customer-Service and Account-Service.

<dependency>
  <groupId>com.github.loki4j</groupId>
  <artifactId>loki-logback-appender</artifactId>
  <version>1.4.2</version>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
  <groupId>io.zipkin.reporter2</groupId>
  <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

com.github.loki4j:loki-logback-appender dependency is required to send the logs to Loki.

io.micrometer:micrometer-tracing-bridge-brave is a dependency to add Zipkin Brave which is a tracer bridge for Tracing Context Propagation with Micrometer Tracing.

There is also an OpenTelemetry tracer available in case you are using OpenTelementry. Then the dependency would be-

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>

io.zipkin.reporter2:zipkin-reporter-brave dependency is needed to produce Zipkin-compliant span. These formatted finished spans are needed for Latency Visualization. In case you are using OpenTelemetry then the dependency would be.

<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>

io.micrometer:micrometer-registry-prometheus dependency is needed for Micrometer metrics with Prometheus

Adding Configuration

Add the following properties in the application.yml file.

management:
  endpoint:
    health:
      show-details: always
  endpoints:
    web:
      exposure:
        include: health, metrics, prometheus
  metrics:
    distribution:
      percentiles-histogram:
        http:
          server:
            requests: true
  tracing:
    sampling:
      probability : 1.0         
 
logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

management.endpoints.web.exposure.include property exposes the mentioned endpoints (health, metrics and prometheus) through the actuator.

logging.pattern.level property is needed to add trace ID and span ID to the logs

management.tracing.sampling.probability is set to 1 so that all the traces are sent to latency analysis tool. By default, Spring Boot samples only 10% of requests to prevent overwhelming the trace backend. This property switches it to 100% so that every request is sent to the trace backend.

management.metrics.distribution.percentiles-histogram.http.server.requests=true property means Micrometer accumulates values to an underlying histogram and ships a predetermined set of buckets to the monitoring system. This helps in collectiong data to observe histogram-based percentile distributions.

/src/main/resources/logback-spring.xml

This XML file is needed to configure the loki-logback-appender to send logs to the local instance of Loki.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <include resource="org/springframework/boot/logging/logback/base.xml" />
  <springProperty scope="context" name="appName" source="spring.application.name"/>

  <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
    <http>
      <url>http://localhost:3100/loki/api/v1/push</url>
    </http>
    <format>
      <label>
        <pattern>app=${appName},host=${HOSTNAME},traceID=%X{traceId:-NONE},level=%level</pattern>
      </label>
      <message>
        <pattern>${FILE_LOG_PATTERN}</pattern>
      </message>
      <sortByTime>true</sortByTime>
    </format>
  </appender>

  <root level="INFO">
    <appender-ref ref="LOKI"/>
  </root>
</configuration>

RestTemplate Config

We need a RestTemplate bean that is automatically instrumented by Spring Boot. Construct a RestTemplate instance from the RestTemplateBuilder.

import org.springframework.boot.web.client.RestTemplateBuilder;
import org.springframework.cloud.client.loadbalancer.LoadBalanced;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

@Configuration
public class TemplateConfig {
	
  @Bean 
  @LoadBalanced
  RestTemplate restTemplate(RestTemplateBuilder builder) { 
    return builder.build(); 
  }
}

Putting Log statements

Do add few log statements in the Controller and Service classes.

For example, in CustomerController add the @Slf4j annotation at the class level then add log statement in the method getCustomerById().

@GetMapping("/{id}")
public ResponseEntity<CustomerDto> getCustomerById(@PathVariable("id") Long customerId){
  log.info("Getting data for customer with id {}", customerId);
  
  CustomerDto customer = customerService.getCustomerById(customerId);
  
  return ResponseEntity.ok(customer);
}

Adding log statements in CustomerServiceImpl

@Override
public CustomerDto getCustomerById(Long id) {
  Customer customer = customerRepository.findById(id)
               .orElseThrow(()-> new CustomerServiceException("No customer found for the given id - " + id));
  log.info("Calling Account Service to get account data for customer with id {}", id);
  ResponseEntity<List<AccountDto>> response = restTemplate.exchange("http://ACCOUNT-SERVICE/account/customer/"+id, HttpMethod.GET, null, 
      new ParameterizedTypeReference<List<AccountDto>>(){});
  List<AccountDto> accounts = response.getBody();
  CustomerDto customerDto = CustomerDto.builder()
         .id(customer.getId())
         .name(customer.getName())
         .age(customer.getAge())
         .city(customer.getCity())
         .accounts(accounts)
         .build();
  return customerDto;
}

Same way you can add log statements in AccountController and AccountServiceImpl in the account-service.

Installing Loki, Tempo, Prometheus and Grafana

It's easy to install and run these applications using Docker. Here is a docker-compose.yml file that downloads images and run these applications.

version: "3"

services:
  tempo:
    image: grafana/tempo:latest
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - ./docker/tempo/tempo-local.yml:/etc/tempo.yaml:ro
      - ./tempo-data:/tmp/tempo
    ports:
      - "3200:3200" # tempo
      - "9411:9411" # zipkin
          
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: [ "-config.file=/etc/loki/local-config.yaml" ]
  
  prometheus:
    image: prom/prometheus:latest
    command:
      - --enable-feature=exemplar-storage
      - --config.file=/etc/prometheus/prometheus.yml
    volumes:
      - ./docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports:
      - "9090:9090"
      
  grafana:
    image: grafana/grafana:latest
    volumes:
      - ./docker/grafana/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
      - ./docker/grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_DISABLE_LOGIN_FORM=true
    ports:
      - "3000:3000"

If you notice in the docker-compose file certain configuration files are set up under volumes so we need to create those files. With in the current directory (directory where docker-compose file is residing) create a directory "docker" and then create the following files.

For Grafana if you notice authorization is disabled by the environment parameter passed, which is ok for local testing but not otherwise.

/docker/tempo/tempo-local.yml

server:
    http_listen_port: 3200

distributor:
    receivers:
        zipkin:

storage:
    trace:
        backend: local
        local:
            path: /tmp/tempo/blocks

/docker/grafana/provisioning/datasources/datasource.yml

This configuration file gives information about the available data sources like Loki, Prometheus, Tempo. Note the use of host.docker.internal, which resolves to the internal IP address used by the host. Using it you can connect from a container to a service on the host.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://host.docker.internal:9090
    editable: false
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo
  - name: Tempo
    type: tempo
    access: proxy
    orgId: 1
    url: http://tempo:3200
    basicAuth: false
    isDefault: true
    version: 1
    editable: false
    apiVersion: 1
    uid: tempo
    jsonData:
        httpMethod: GET
        tracesToLogs:
            datasourceUid: 'loki'
  - name: Loki
    type: loki
    uid: loki
    access: proxy
    orgId: 1
    url: http://loki:3100
    basicAuth: false
    isDefault: false
    version: 1
    editable: false
    apiVersion: 1
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: \[.+,(.+?),
          name: TraceID
          url: $${__value.raw}

/docker/grafana/provisioning/dashboards/dashboard.yml

apiVersion: 1

providers:
  - name: dashboards
    type: file
    disableDeletion: true
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: true

/docker/prometheus/prometheus.yml

global:
  scrape_interval: 2s
  evaluation_interval: 2s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
        - targets: ['host.docker.internal:9090']
  #- job_name: 'apps'
    #metrics_path: '/actuator/prometheus'
    #static_configs:
      #- targets: ['host.docker.internal:8081','host.docker.internal:8082']
      
  - job_name: 'eureka'
    metrics_path: '/actuator/prometheus'
    # Scrape Eureka itself to discover new services.
    eureka_sd_configs:
      - server: http://host.docker.internal:8761/eureka

A scrape_config section specifies a set of targets and parameters describing how to scrape them. Targets may be statically configured via the static_configs parameter or dynamically discovered using one of the supported service-discovery mechanisms.

In the above .yml file Services are discovered dynamically as the services are already registered with Eureka server, so no need to hardcode service URLs. Eureka service discovery is done using eureka_sd_configs parameter. static_configs section is also given (which is commented) for better understanding.

Testing the Spring Boot Observability

Running the Grafana Stack

For testing our setup first thing is to start all the services. For that run the following command by going to the location where you have your docker-compose file.

docker compose up -d

-d switch to run it in detached mode.

This command builds and starts the containers. After that you can use docker compose stop and docker compose start to stop and start these containers.

Just ensure that all the containers are up by running the following command.

Docker ps -a
docker compose Grafana stack

If there is any problem with any of the containers, please check the docker logs for the same.

Running the MicroServices

Also start the eureka-discovery-service, customer-service and account-service. Try to access customer data- http://localhost:8081/customer/1

Accessing Grafana

To access Grafana go to http://localhost:3000/

Select explore by clicking toggle menu on the left-hand side.

From the dropdown select Loki to view the aggregated logs.

Grafana Explore

In the label filter you have options like filter by app, host, level, trace ID

If you select app and then select customer-service and then run the query

Grafana Loki

Verify the logs and notice the trace ID.

2023-12-19T11:02:07.917+05:30  INFO [customer-service,65812ad785d60e30c67cc73d94e07aa0,c67cc73d94e07aa0] 22188 --- [http-nio-8081-exec-2] c.n.c.service.CustomerServiceImpl        : Calling Account Service to get account data for customer with id 1
2023-12-19T11:02:07.743+05:30  INFO [customer-service,65812ad785d60e30c67cc73d94e07aa0,c67cc73d94e07aa0] 22188 --- [http-nio-8081-exec-2] c.n.c.controller.CustomerController      : Getting data for customer with id 1

For me trace ID is 65812ad785d60e30c67cc73d94e07aa0

If I select app as account-service and then run the query. On verifying the logs, trace ID is same for the same request though the span ID differs. That way logs from different microservices can be corelated by trace ID.

2023-12-19 11:06:29.800	
2023-12-19T11:06:29.800+05:30  INFO [account-service,,] 16160 --- [AsyncResolver-bootstrap-executor-0] c.n.d.s.r.aws.ConfigClusterResolver      : Resolving eureka endpoints via configuration
2023-12-19T11:02:08.342+05:30  INFO [account-service,65812ad785d60e30c67cc73d94e07aa0,3434618310164a97] 16160 --- [http-nio-8082-exec-1] c.n.a.controller.AccountController       : Getting account data for customer with id 1

Copy the trace ID and go back to explore and select Tempo. Paste the copied trace ID and you can see that you have the information about the spans with in the same trace and the time spent.

Grafana Tempo

Same way you can go to Prometheus and select the job as "Eureka" to see the activity around this service.

By accessing http://localhost:8081/actuator/prometheus you can get the list of metrics exposed by actuator.

Reference: https://spring.io/blog/2022/10/12/observability-with-spring-boot-3

That's all for this topic Spring Boot Observability - Distributed Tracing, Metrics. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Spring Tutorial Page


Related Topics

  1. Spring Boot Microservice + API Gateway + Resilience4J
  2. Spring Boot Microservice - Externalized Configuration With Spring Cloud Config
  3. Spring Boot Microservice - Load-Balancing With Spring Cloud LoadBalancer
  4. Spring Boot Event Driven Microservice With Kafka
  5. Spring Boot - spring-boot-starter-parent

You may also like-

  1. Spring MVC Excel Generation Example
  2. Spring MVC - Binding List of Objects Example
  3. Connection Pooling Using C3P0 Spring Example
  4. BeanFactoryAware Interface in Spring Framework
  5. How to Use ngFor and ngIf in Same Element in Angular
  6. Java CyclicBarrier With Examples
  7. Convert Numbers to Words Java Program
  8. React Declarative Approach

No comments:

Post a Comment