LWN: Comments on "Monitoring with Prometheus 2.0" https://lwn.net/Articles/744410/ This is a special feed containing comments posted to the individual LWN article titled "Monitoring with Prometheus 2.0". en-us Tue, 11 Nov 2025 01:14:57 +0000 Tue, 11 Nov 2025 01:14:57 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Monitoring with Prometheus 2.0 https://lwn.net/Articles/746706/ https://lwn.net/Articles/746706/ anarcat And for what it's worth, <a href="https://gnocchi.xyz/">Gnocchi 4.2</a> was just released with support for remote Prometheus writes, which means you could, in theory, use Prometheus only for discovery, collectiong and alerts, and store long-term trending into Gnocchi, which can then be used by Grafana for graphing. Tue, 06 Feb 2018 16:50:08 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/746262/ https://lwn.net/Articles/746262/ faxm0dem Long-term storage is very important to us, but RRDTool doesn't scale unless you DIY. The solution we came up with is to pre-aggregate the data into multiple resolutions using configurable consolidation functions, just like RRDTool does, but on top of a modern storage. We wrote a <a rel="nofollow" href="http://riemann.io">riemann</a> plugin that does the aggregation in realtime, and then indexes the results into <a rel="nofollow" href="http://elastic.co/products/elasticsearch">Elasticsearch</a>. It also handles the aliases so that the access to the data is transparent to the user (highest resolution available gets higher priority) and curates old data automatically so that storage usage doesn't increase in time. Fri, 02 Feb 2018 07:53:28 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/745276/ https://lwn.net/Articles/745276/ aowi <div class="FormattedComment"> munin-cgi-graph will graph any time-period you tell it to. Just click on the statically generated graphs to get to it. It'll let you zoom in and out as much as you'd like. The interface is crude, but perfectly workable.<br> <p> We're not auto-generating the ten-year graphs, but then again, we're also not auto-generating the three-month graphs, or the 'what happened between 18:00 and 18:30 last Tuesday'-graphs either. But the data is there, and the graphs are three mouse-clicks away when needed. As a tool for exploration, for planning and for the occasional reporting it's quite serviceable.<br> <p> If you do need to auto-generate the graphs for your purposes, then it'd require a bit more work, yes.<br> </div> Thu, 25 Jan 2018 09:22:11 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744761/ https://lwn.net/Articles/744761/ bangert <div class="FormattedComment"> RRD only aggregates data if you tell it to - and yes, it is annoying that you have to specify that up front.<br> <p> A big issue for most(all?) other tsdb's is, that they use much more storage per saved data byte compared to RRD.<br> This is not only a question about the amount of storage but IO performance.<br> </div> Fri, 19 Jan 2018 09:53:51 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744760/ https://lwn.net/Articles/744760/ barryascott <div class="FormattedComment"> Once you have created a couple of grafana dashboards it then<br> becomes a short task to craft custom dashboards for any purpose.<br> The hard bit is what metrics do we need to show to help understand<br> the behaviour we are curious about.<br> <p> That RRD reduces the detail of the metrics it stores over time is<br> a problem.<br> <p> Often I want to compare a metric from this week with the one a<br> few weeks ago. That's often not possible with RRD once the detail<br> has gone.<br> <p> We are hoping to be able to keep full metrics for long enough with<br> Prometheus. The trade off being the storage needs.<br> <p> Barry<br> </div> Fri, 19 Jan 2018 08:55:35 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744732/ https://lwn.net/Articles/744732/ ken <div class="FormattedComment"> I gave up trying to get munin to show more than 1 year. <br> <p> I was adding support to read out temperature from multiple temp sensors using a tellstick duo and then it would be nice to see multiple years but in the end I could not figure out how to do it. <br> <p> Have not had time to research what to use instead but Prometheus do not look to be the proper solution.<br> </div> Thu, 18 Jan 2018 17:27:34 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744729/ https://lwn.net/Articles/744729/ fwiesweg <div class="FormattedComment"> We found this to be a valid argument, too. Our applications are running behind an nginx reverse proxy anyway so setting up forwarding the exporters via https (with client authentication for the scraper, too) was a straightforward and simple step.<br> </div> Thu, 18 Jan 2018 17:07:34 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744728/ https://lwn.net/Articles/744728/ jcpunk <div class="FormattedComment"> Any thoughts on this vs Performance CoPilot (<a href="http://pcp.io/">http://pcp.io/</a>)?<br> </div> Thu, 18 Jan 2018 16:42:56 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744726/ https://lwn.net/Articles/744726/ spaetz <div class="FormattedComment"> Thanks for showing the list of filed bugs. I find that a very nice feature.<br> </div> Thu, 18 Jan 2018 16:22:31 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744722/ https://lwn.net/Articles/744722/ anarcat <div class="FormattedComment"> sure, there's a way to hack munin to keep more results. but how do you instrument graphs on top of that? it quickly becomes a mess, unfortunately.<br> <p> but yeah, 10 years is a timespan i'd like to see...<br> </div> Thu, 18 Jan 2018 16:03:37 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744690/ https://lwn.net/Articles/744690/ nim-nim <div class="FormattedComment"> If you're monitoring over untrusted networks, it's much better to add a trusted apache (or whatever) layer to perform public auth and crypto over a loopback http service such as prometheus, rather than trust every component to talk https and get the crypto aspects right<br> <p> Sure it's a bit longer to setup, than if it was built-in, to do you trust someone to actually audit all the built-in https stacks out there? Especially given how fast https security moves nowadays?<br> </div> Thu, 18 Jan 2018 10:51:11 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744688/ https://lwn.net/Articles/744688/ aowi <div class="FormattedComment"> <font class="QuotedText">&gt; Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome)</font><br> <p> You can have munin use arbitrary data retention with the "graph_data_size custom" setting though it doesn't seem to be well documented.<br> <p> <a href="https://github.com/munin-monitoring/munin/blob/ce9e01172a17269ae1f2c1d1fe60bf958b219473/lib/Munin/Master/UpdateWorker.pm#L1155">https://github.com/munin-monitoring/munin/blob/ce9e01172a...</a><br> <p> This is the default retention we use, which keeps five minute samples for two days (5m*576), 30 minute samples (5m*6) for nine days (30m*432) et cetera up to a 1d (5m*288) sampling for ten years (1d*3660):<br> <p> graph_data_size custom 576, 6 432, 24 540, 288 3660<br> </div> Thu, 18 Jan 2018 10:12:28 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744660/ https://lwn.net/Articles/744660/ anarcat <p>Bugs filed or worked on while writing on this series of articles:</p> <ul> <li><a href="https://github.com/prometheus/prometheus/issues/3684">disk usage metrics</a>, including <a href="https://github.com/prometheus/node_exporter/pull/789">textfiles examples to implement actually this</a> and <a href="https://github.com/anarcat/tsdb/commit/2ba6cb82ef0dd068cda08459b1f3f8abe4af5c38">proof-of-concept patch to ship this natively</a></li> <li><a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=886891">working node-exporter default config in systemd (no extra backquotes)</a></li> <li><a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=886893">document how to hook mtail properly in prometheus</a></li> <li><a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=886894">do not use daemon when systemd is available</a></li> <li><a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=887112">new apache-exporter upstream version available (0.5.0)</a></li> <li><a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=835210">Grafana out of date in Debian</a></li> </ul> <p>Most of those are actually filed against the Debian project's packaging, because I had good interactions with the package maintainer there. Thanks again to tincho for all the help in setting up Prometheus and technical reviews of the article.</p> Wed, 17 Jan 2018 23:25:20 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744646/ https://lwn.net/Articles/744646/ anarcat <blockquote> "But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection." That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (https://prometheus.io/docs/prometheus/latest/configuratio...) as well as HTTP basic auth (https://prometheus.io/docs/prometheus/latest/configuratio...). </blockquote> Sure: prom supports scraping HTTPS targets. But by default, the node_exporter (and in fact most exporters as well) do not export their metrics through HTTPS. Users are told to install a TLS proxy in front to enable end-to-end security. <p> And even then: this doesn't authenticate the collecting server against the metrics target. For that you need yet another authentication layer. Furthermore, many container deployments do not use HTTPS internally: it's all plain text, and then HTTPS is added on the edges, which means a lot of this traffic goes in the clear. So I think it's a fairly accurate description. It doesn't mean it's catastrophic: many organizations have been running Munin exactly that way forever. But it's something to keep in mind when deploying Prometheus: it's not magic. <P> The security guide is great, in that regard: honest, and to the point. Thank you for that. <blockquote> Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually. </blockquote> Yep. Note that in the last paragraph, i suggest sysadmins should wait befor converting existing infrastructures, but I would probably use prometheus to monitor any new infrastructure I would setup in the future. My only concern is disk space and downsampling, but I will be touching on that subject more in the next article, which should come out next week. Stay tuned! :) Wed, 17 Jan 2018 20:44:22 +0000 Monitoring with Prometheus 2.0 https://lwn.net/Articles/744633/ https://lwn.net/Articles/744633/ bitfehler <div class="FormattedComment"> "But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection."<br> <p> That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (<a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Ctls_config%3E">https://prometheus.io/docs/prometheus/latest/configuratio...</a>) as well as HTTP basic auth (<a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E">https://prometheus.io/docs/prometheus/latest/configuratio...</a>).<br> <p> Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually.<br> <p> Disclaimer: I work at SoundCloud ;)<br> </div> Wed, 17 Jan 2018 19:02:48 +0000