LWN: Comments on "Monitoring with Prometheus 2.0"

Monitoring with Prometheus 2.0

anarcat — Tue, 06 Feb 2018 16:50:08 +0000

And for what it's worth, Gnocchi 4.2 was just released with support for remote Prometheus writes, which means you could, in theory, use Prometheus only for discovery, collectiong and alerts, and store long-term trending into Gnocchi, which can then be used by Grafana for graphing.

Monitoring with Prometheus 2.0

faxm0dem — Fri, 02 Feb 2018 07:53:28 +0000

Long-term storage is very important to us, but RRDTool doesn't scale unless you DIY. The solution we came up with is to pre-aggregate the data into multiple resolutions using configurable consolidation functions, just like RRDTool does, but on top of a modern storage. We wrote a riemann plugin that does the aggregation in realtime, and then indexes the results into Elasticsearch. It also handles the aliases so that the access to the data is transparent to the user (highest resolution available gets higher priority) and curates old data automatically so that storage usage doesn't increase in time.

Monitoring with Prometheus 2.0

aowi — Thu, 25 Jan 2018 09:22:11 +0000

munin-cgi-graph will graph any time-period you tell it to. Just click on the statically generated graphs to get to it. It'll let you zoom in and out as much as you'd like. The interface is crude, but perfectly workable.

We're not auto-generating the ten-year graphs, but then again, we're also not auto-generating the three-month graphs, or the 'what happened between 18:00 and 18:30 last Tuesday'-graphs either. But the data is there, and the graphs are three mouse-clicks away when needed. As a tool for exploration, for planning and for the occasional reporting it's quite serviceable.

If you do need to auto-generate the graphs for your purposes, then it'd require a bit more work, yes.

Monitoring with Prometheus 2.0

bangert — Fri, 19 Jan 2018 09:53:51 +0000

RRD only aggregates data if you tell it to - and yes, it is annoying that you have to specify that up front.

A big issue for most(all?) other tsdb's is, that they use much more storage per saved data byte compared to RRD.
This is not only a question about the amount of storage but IO performance.

Monitoring with Prometheus 2.0

barryascott — Fri, 19 Jan 2018 08:55:35 +0000

Once you have created a couple of grafana dashboards it then
becomes a short task to craft custom dashboards for any purpose.
The hard bit is what metrics do we need to show to help understand
the behaviour we are curious about.

That RRD reduces the detail of the metrics it stores over time is
a problem.

Often I want to compare a metric from this week with the one a
few weeks ago. That's often not possible with RRD once the detail
has gone.

We are hoping to be able to keep full metrics for long enough with
Prometheus. The trade off being the storage needs.

Barry

Monitoring with Prometheus 2.0

ken — Thu, 18 Jan 2018 17:27:34 +0000

I gave up trying to get munin to show more than 1 year.

I was adding support to read out temperature from multiple temp sensors using a tellstick duo and then it would be nice to see multiple years but in the end I could not figure out how to do it.

Have not had time to research what to use instead but Prometheus do not look to be the proper solution.

Monitoring with Prometheus 2.0

fwiesweg — Thu, 18 Jan 2018 17:07:34 +0000

We found this to be a valid argument, too. Our applications are running behind an nginx reverse proxy anyway so setting up forwarding the exporters via https (with client authentication for the scraper, too) was a straightforward and simple step.

Monitoring with Prometheus 2.0

jcpunk — Thu, 18 Jan 2018 16:42:56 +0000

Any thoughts on this vs Performance CoPilot (http://pcp.io/)?

Monitoring with Prometheus 2.0

spaetz — Thu, 18 Jan 2018 16:22:31 +0000

Thanks for showing the list of filed bugs. I find that a very nice feature.

Monitoring with Prometheus 2.0

anarcat — Thu, 18 Jan 2018 16:03:37 +0000

sure, there's a way to hack munin to keep more results. but how do you instrument graphs on top of that? it quickly becomes a mess, unfortunately.

but yeah, 10 years is a timespan i'd like to see...

Monitoring with Prometheus 2.0

nim-nim — Thu, 18 Jan 2018 10:51:11 +0000

If you're monitoring over untrusted networks, it's much better to add a trusted apache (or whatever) layer to perform public auth and crypto over a loopback http service such as prometheus, rather than trust every component to talk https and get the crypto aspects right

Sure it's a bit longer to setup, than if it was built-in, to do you trust someone to actually audit all the built-in https stacks out there? Especially given how fast https security moves nowadays?

Monitoring with Prometheus 2.0

aowi — Thu, 18 Jan 2018 10:12:28 +0000

> Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome)

You can have munin use arbitrary data retention with the "graph_data_size custom" setting though it doesn't seem to be well documented.

https://github.com/munin-monitoring/munin/blob/ce9e01172a...

This is the default retention we use, which keeps five minute samples for two days (5m*576), 30 minute samples (5m*6) for nine days (30m*432) et cetera up to a 1d (5m*288) sampling for ten years (1d*3660):

graph_data_size custom 576, 6 432, 24 540, 288 3660

Monitoring with Prometheus 2.0

anarcat — Wed, 17 Jan 2018 23:25:20 +0000

Bugs filed or worked on while writing on this series of articles:

Most of those are actually filed against the Debian project's packaging, because I had good interactions with the package maintainer there. Thanks again to tincho for all the help in setting up Prometheus and technical reviews of the article.

Monitoring with Prometheus 2.0

anarcat — Wed, 17 Jan 2018 20:44:22 +0000

"But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection." That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (https://prometheus.io/docs/prometheus/latest/configuratio...) as well as HTTP basic auth (https://prometheus.io/docs/prometheus/latest/configuratio...).

Sure: prom supports scraping HTTPS targets. But by default, the node_exporter (and in fact most exporters as well) do not export their metrics through HTTPS. Users are told to install a TLS proxy in front to enable end-to-end security.

And even then: this doesn't authenticate the collecting server against the metrics target. For that you need yet another authentication layer. Furthermore, many container deployments do not use HTTPS internally: it's all plain text, and then HTTPS is added on the edges, which means a lot of this traffic goes in the clear. So I think it's a fairly accurate description. It doesn't mean it's catastrophic: many organizations have been running Munin exactly that way forever. But it's something to keep in mind when deploying Prometheus: it's not magic.

The security guide is great, in that regard: honest, and to the point. Thank you for that.

Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually.

Yep. Note that in the last paragraph, i suggest sysadmins should wait befor converting existing infrastructures, but I would probably use prometheus to monitor any new infrastructure I would setup in the future. My only concern is disk space and downsampling, but I will be touching on that subject more in the next article, which should come out next week. Stay tuned! :)

Monitoring with Prometheus 2.0

bitfehler — Wed, 17 Jan 2018 19:02:48 +0000

"But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection."

That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (https://prometheus.io/docs/prometheus/latest/configuratio...) as well as HTTP basic auth (https://prometheus.io/docs/prometheus/latest/configuratio...).

Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually.

Disclaimer: I work at SoundCloud ;)