1 Scaling and Reliability

2 Reliablity and fault tolerance

Fault tolreance for monitoring services is addressed by making the monitoring service highly availably usually by clustering the implementation. Clustering solution however require complex networking and management of state between nodes in the cluster.
The recommended fault toloreant solution for Prometheus is to run two identically configured Prometheus servers in parallel, both active at the same time. Duplicate alerts are handled upstream in Alertmanager using it's grouping and inhibits capacity.
Alertmanager is made fault tolerant by creating a cluster of Alertmanagers. All prometheus servers send alerts to all Alertmanagers.

2.1 Setting up Alertmanager clustering

Cluster capability provided by Hashicorp's memberlist library which uses a gossip based protocol.

Let's say we have three hosts am1, am2 and am3. We will use the am1 host to initiate the cluster.

am1$ alertmanager --config.file alertmanager.yml --cluster.listen-address 172.19.0.10:8001
am2$ alertmanager --config.file alertmanager.yml --cluster.listen-address 172.19.0.20:8001 --cluster.peer 172.19.0.10:8001
am3$ alertmanager --config.file alertmanager.yml --cluster.listen-address 172.19.0.30:8001 --cluster.peer 172.19.0.10:8001

You can check that they are indeed clustered at https://127.0.0.1:9000/status

2.2 Configuring Prometheus for an Alertmanager cluster

Edit prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - am1:9093
      - am2:9093
      - am3:9093

The above configuration assumes that the Prometheus server can resolve DNS entries for each of the alertmanager.

3 Scaling

Scaling usually takes two forms:

Functional scaling
Horizontal scaling

3.1 Functional scaling

Splits monitoring concerns onto separate Prometheus servers.

3.2 Horizontal shards

Horizontal sharding uses a series of worker prometheus servers each of which scrapes a subset of targets. We then aggregate specific time series we're interested in on the worker servers.
The proimary server not only pulls in the aggregated metrics but now also acts as the default source for graphing or exposing metrics to tools like Grafana.

4 Remote storage

Prometheus has the capability to write to remote stores of metrics.