Omer definition12/27/2022 In case one of the API will send too much metrics and breach this limit, Prometheus will not scrape it. Prometheus allows us to limit the maximum scrape size. How can we protect Prometheus from such a mistake? Limiting the Scrape Size All that is required is one unaware developer (and I did similar mistakes in the past) who decided to use user identifier as a label (it makes sense!). This will solve the issue – but not the root cause: a single service can create real damage to Prometheus, and this is dangerous. The solution is to drop the problematic metrics – or modify the code, when possible. This can be the culprit, as it increases the load on Prometheus. For example, an API that uses a user identifier as a label – which has high cardinality (read more about it here). Use these queries to check for spikes, and consider adding resources if needed.Īnother reason could be a single scrape target with a lot of time series. You can use queries like rate(process_cpu_seconds_total) for memory usage. One possible reason is a resource limit – for example, not enough CPU. So increasing scrape interval is possible, but limited – at some point you can’t increase it any more.Īn alternative is to investigate what makes scrape duration to increase. For example, if the scrape interval is one minute, you will have one data point per minute. The scrape interval define the metrics resolution – you cannot have the smallest metrics resolution than the scrape interval. This is a valid solution, but it has one downside – losing metrics. What we can do? One option is to increase the scrape interval. When the scrape duration is getting near the limit we need to take an action. Now that we have an alert it’s time to ask: what we can do when the alert raised? Taking Action if the scrape interval is 20 seconds, raise an alert when the scrape duration is 15 seconds. Summary: "Prometheus Scrape Duration is getting near the limit"Īdjust the threshold based on the scrape_interval configuration – the threshold should be lower than this value. The next step is setting an alert – for example, using Prometheus Alert Manager: alert: ScrapeDuration The first step is visibility to the scrape duration, using a very simple query: While there are reasons to prefer push over the pull model, it has its own challenges: Each metric scrape operation can take time what happens if it the scrape take longer then the scrape interval?įor example, let’s say Prometheus is configured to scrape its targets (that’s how services are called in Prometheus language) once in 20 seconds what will happen if one scrape takes more then 20 seconds? The result is out of order metrics: instead of having a data point every 20 seconds, it will be every time the scrape completed. Pull model is fundamentally different – the service exposes metrics on a specific endpoint, and Prometheus scrapes them once in a while (the scrape interval – see here how to configure it). The push model is simple: Just push metrics from your code directly to the monitoring system, for example – Graphite. Prometheus works differently from other monitoring systems – it uses pull over push model. Why does it happen? And how can we keep it in shape? Let’s first do a quick recap of how Prometheus works. If not handled properly, it can easily get out of shape. But, like any other technology we’re using, Prometheus need special care an love. For me, the move from manual metrics shipping to Prometheus was magical. It can easily scrape all the services in your cluster dynamically, without any static configuration.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |