In a microservice architecture like the containerized Sitecore XP1 topology, the ability to track the progression of a single request through different services is very important to investigate problems, optimize our application code and build more reliable services. This method, called distributed tracing, helps to clearly show the dependencies and relationships among various services and components of a system. In complex distributed systems, where each service generates its own logs, it is also important to be able to correlate application logs to specific web request traces. This method, called contextual logging, automates the association of logs to traces and spans and adds new logging properties (like error messages or error classes) in the group of tracing dimensions, that can be used in querying and troubleshooting processes when trying to identify a specific pattern or clue.
In the Azure cloud world, these two methods can be implemented instrumenting the Sitecore services with Application Insights, a feature of Azure Monitor that provides extensible application performance management and monitoring capabilities.
In this blog post, I will describe how to procure Application Insights in an existing AKS cluster and the steps needed to instrument all services of a containerized Sitecore 10.2 XP1 solution to collect distributed traces and correlated Sitecore logs, targeting the latest Application Insights SDK version (2.20).
There are several ways to collect logs in Kubernetes. For Sitecore applications running in a Kubernetes cluster, the official Kubernetes specifications and the Sitecore images support the following two approaches to expose logs outside of containers: using the default stdout container logs or using a persistent volume to share logs outside containers.
The first approach relies on the fact that Sitecore images have been instrumented with LogMonitor, a Microsoft container tool responsible to monitor log sources inside the container and to pipe a formatted output to the stdout container output. Log sources can include Windows system event logs, IIS logs and Sitecore application logs. The collected logs are all piped together and it can be challenging to filter and analyze them, because LogMonitor doesn’t enrich the logs with a source property. This challenge becomes even bigger when the Sitecore application generates multiline logs, for example when an error with its stack trace is logged: each line of a multiline log record is collected separately and they can get separated and mingled in the piped stdout output with other source logs (for example IIS logs).
The second approach consists instead in storing logs in a persistent volume, so that can be made available and stored outside the containers, in an Azure File storage resource. This configuration has been introduced recently with Sitecore 10.2 Kubernetes specifications and it has been configured for Sitecore application logs only. The same configuration could be applied for other file sources inside the containers, like for example IIS logs. This approach helps to keep different source logs separated outside the containers.
Once the logs are available outside the containers, there are different solutions that can be implemented to collect them in the desired log analytics platform. In Azure Kubernetes Service, the simplest solution is to rely on the out-of-the-box capabilities of Container Insights, that automatically collects stdout and stderr output container logs in the associated Azure Log Analytics workspace resource thanks to its containerized monitoring agents. Alternative solutions to collect and forward logs outside Azure to other analytics platforms (like for example Elasticsearch) are more complex. They usually require to use a log forwarder application (like for example FluentBit) that can be setup to run as daemon set with a pod in each cluster node to collect and forward stdout logs or as deployment to read logs from a persistent volume storage and forward them.
In the next sections I am going to describe how to enhance the collection of stdout logs of AKS cluster containers to avoid a fragmented collection of multiline logs and how to query them in the Azure Monitor targeting each source separately.
In the previous post of this series I described the steps to create a new monitoring namespace in an existing Sitecore AKS cluster and to install two important tools for metrics collection and data visualization: Prometheus and Grafana. In this blog post I will describe how to enhance and configure Grafana data sources to collect metrics from internal and external AKS cluster resources. I will also describe and share a comprehensive collection of Grafana dashboards to visualize the collected metrics.
When working with a containerized distributed platform like Sitecore, it is important to be able to monitor metrics of all critical resources in its AKS cluster. The Container Insights monitoring feature available in AKS provides valuable data for the platform and the containerized application, but lacks deeper metric collection at workload level. For this reason, a couple of years ago Microsoft launched a native capability to integrate Azure Monitor with Prometheus, a popular open-source metric monitoring solution.
Prometheus collects metrics as time series: a series of timestamped values belonging to the same metric and indexed in time order. The metrics collection occurs via a pull model over HTTP from collection targets identified dynamically through service discovery or via static configuration. Metrics from a target can be exposed in Prometheus metrics data format implementing a custom HTTP endpoint using a client library (there are many available in different languages, including .NET / C#) or using a third-party exporter that collects, converts and exposes the metrics over HTTP.
Prometheus metrics data can be queried in the Prometheus Web UI using its own PromQL query language. Or, for a much more intuitive and robust experience, they can be visualized in Grafana, an open source data visualization solution, that allows to query, visualize and alert on data collected from multiple sources, including the Prometheus API.
Both Prometheus and Grafana are two important must-have additions to the monitoring stack of an AKS cluster. In the next sections, I will describe how to install and configure Prometheus and Grafana in an existing AKS cluster, how to export host metrics from a Windows node in the cluster and how to access them in the Azure Log Analytics workspace of the cluster and in Grafana.
With the release of the Sitecore 10 platform a couple of years ago, Sitecore started to support running applications in containers and provided guidance on how to deploy a containerized Sitecore application to a Kubernetes cluster. Since then, Kubernetes specification files are now included in each new Sitecore release and a full example of the deployment process has been shared in the newly launched Sitecore MVP site GitHub repository.
The containerized approach is meant to compete with the established Sitecore PaaS approach, that instead consists in running the Sitecore platform in Microsoft Azure cloud platform using PaaS resources (like for example app services and app service plans). The path to adopt Sitecore containerized applications in production is still in its early phase, but it is gradually growing and progressing.
An area that could speed up the adoption and help with building confidence in running Sitecore on Kubernetes in production is monitoring. For Sitecore PaaS, monitoring has always been an integrated part of the delivered solution, thanks to the Sitecore Application Level Monitoring module (distributed as ARM templates on the Sitecore Azure Quickstart Templates repository) and the embedded instrumentation for Application Insights for all Sitecore role instances. With the introduction of containers, monitoring assets have been removed (or disabled) from the Sitecore application and there is no direct documentation support on how to procure and configure monitoring tools in Kubernetes. This choice has made the Sitecore containerized application agnostic of a particular hosting platform (AKS, EKS, GKE, …), but at the same time has removed a must-have feature for managing a complex application in production.