Pattern: Distributed tracing
pattern observability service designContext
You have applied the Microservice architecture pattern. Requests often span multiple services. Each service handles a request by performing one or more operations, e.g. database queries, publishes messages, etc.
Problem
How to understand the behavior of an application and troubleshoot problems?
Forces
- External monitoring only tells you the overall response time and number of invocations - no insight into the individual operations
- Any solution should have minimal runtime overhead
- Log entries for a request are scattered across numerous logs
Solution
Instrument services with code that
- Assigns each external request a unique external request id
- Passes the external request id to all services that are involved in handling the request
- Includes the external request id in all log messages
- Records information (e.g. start time, end time) about the requests and operations performed when handling a external request in a centralized service
This instrumentation might be part of the functionality provided by a Microservice Chassis framework.
Examples
The Microservices Example application is an example of an application that uses client-side service discovery. It is written in Scala and uses Spring Boot and Spring Cloud as the Microservice chassis. They provide various capabilities including Spring Cloud Sleuth, which provides support for distributed tracing. It instruments Spring components to gather trace information and can delivers it to a Zipkin Server, which gathers and displays traces.
The following Spring Cloud Sleuth dependencies are configured in build.gradle
:
dependencies {
compile "org.springframework.cloud:spring-cloud-sleuth-stream"
compile "org.springframework.cloud:spring-cloud-starter-sleuth"
compile "org.springframework.cloud:spring-cloud-stream-binder-rabbit"
RabbitMQ is used to deliver traces to Zipkin.
The services are deployed with various Spring Cloud Sleuth-related environment variables set in the docker-compose.yml
:
environment:
SPRING_RABBITMQ_HOST: rabbitmq
SPRING_SLEUTH_ENABLED: "true"
SPRING_SLEUTH_SAMPLER_PERCENTAGE: 1
SPRING_SLEUTH_WEB_SKIPPATTERN: "/api-docs.*|/autoconfig|/configprops|/dump|/health|/info|/metrics.*|/mappings|/trace|/swagger.*|.*\\.png|.*\\.css|.*\\.js|/favicon.ico|/hystrix.stream"
This properties enable Spring Cloud Sleuth and configure it to sample all requests.
It also tells Spring Cloud Sleuth to deliver traces to Zipkin via RabbitMQ running on the host called rabbitmq
.
The Zipkin server is a simple, Spring Boot application:
@SpringBootApplication
@EnableZipkinStreamServer
public class ZipkinServer {
public static void main(String[] args) {
SpringApplication.run(ZipkinServer.class, args);
}
}
It is deployed using Docker:
zipkin:
image: java:openjdk-8u91-jdk
working_dir: /app
volumes:
- ./zipkin-server/build/libs:/app
command: java -jar /app/zipkin-server.jar --server.port=9411
links:
- rabbitmq
ports:
- "9411:9411"
environment:
RABBIT_HOST: rabbitmq
Resulting Context
This pattern has the following benefits:
- It provides useful insight into the behavior of the system including the sources of latency
- It enables developers to see how an individual request is handled by searching across aggregated logs for its external request id
This pattern has the following issues:
- Aggregating and storing traces can require significant infrastructure
Related patterns
- Log aggregation - the external request id is included in each log message
See also
- Open Zipkin - service for recording and displaying tracing information
- Open Tracing - standardized API for distributed tracing