Kartar.Net

Monitoring Survey 2015 - Data

2015-08-12T00:00:00-04:00

Over the course of the series I’ve talked about monitoring effectiveness, monitoring environments, metrics, the tools people use to monitor and the demographics of the survey.

In this last post I am providing the anonymized source data that I based my analysis on. It’s in CSV form and comes directly from Survey Monkey. The only data I have removed is the IP address of the respondents to make it anonymous.

For those interested I used R and R Studio to produce the analysis and ggplot2 to produce the graphs.

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Data was originally published by James Turnbull at Kartar.Net on August 12, 2015.

Monitoring Survey 2015 - Effectiveness

2015-08-11T00:00:00-04:00

In the last posts I talked about monitoring environments, metrics, the tools people used in monitoring and the demographics of the survey.

In this post I am going to look at the questions around the effectiveness of monitoring, how people handle alerting and the use of configuration management software.

As I’ve mentioned in previous posts, the survey got 1,116 responses of which 884 were complete and my analysis only includes complete responses.

This post will cover the questions:

12. When do you most commonly add monitoring checks or graphs to your environment?
13. Do you ever have unanswered alerts in your monitoring environment?
14. How often does something go wrong that IS NOT detected by your monitoring?
15. Do you use a configuration management tool like Chef, Puppet, Salt or Ansible to manage your monitoring infrastructure?

When do you add monitoring checks or graphs to your environment?

Question 12 attempts to identify when in the product and infrastructure lifecycle you add monitoring checks to your environment. This is designed to tease out whether your monitoring is proactive or reactive.

The question had the following choices:

When something goes wrong and we want to monitor for that problem in future.
When we build new infrastructure or deploy new applications.

I’ve provided a graph showing the distribution of answers.

We can see that most people, 62.7% of them, add checks when infrastructure or applications are deployed, leaving 37% performing reactive checks. That’s largely unchanged from last year’s response.

We’ve also mapped it by organization size.

We can see that very small and very large organizations are slightly more reactive.

Do you ever have unanswered alerts in your monitoring environment?

In Question 13 we’re interested in the measurement of alerting hygiene and how people respond to alerts. I was interested in seeing how many people had outstanding alerts and how many actioned them immediately.

Each respondent had the option to answer the question with:

No - we action them all immediately
Yes - we usually have a few
Yes - we usually have some
Yes - we usually have a lot

I’ve provided a graph showing the distribution of answers.

We can see that the largest group of respondents, 401 or 45%, have at least a few unanswered alerts. This is identical to last year’s results for this category. The next largest group at 196 or 22% of respondents actions all alerts immediately. A further 19% have some unanswered alerts and 13% have a lot of unanswered alerts.

I also broke down alert behavior by organization size.

This year the patterns in this breakdown again felt very familiar. Like last year there is a decrease in alerts being actioned immediately as the organization grows and an increase in volume of alerts that are not actioned.

I was also planning to add a question about alert fatigue in this year’s survey but was unable to frame one that provided viable data.

How often does something go wrong that IS NOT detected by your monitoring?

Question 14 asked about outages and failures in environments that are NOT detected via monitoring. The respondents had the option of answering:

Frequently
Occasionally
Never

I’ve graphed the responses here:

We can see that 81% of respondents had something occasionally go wrong that wasn’t detected by monitoring. 11% stated that failures frequently occurred that were not detected by monitoring. 8% stated that there were never undetected failures in their environments. This is very close to last year’s results.

I further analyzed the response by organization size.

Again we see some familiar patterns with more frequent unmonitored failures in larger organizations.

Do you use a configuration management tool

The last question, Question 15, asked respondents if they used Configuration Management to manage their monitoring environment.

This year 71.7% of respondents did use Configuration Management to manage their monitoring, which is in line with last year’s results.

0.3% or 3 respondents did not know what configuration management was.

I also analyzed the responses by organization size.

Again this year we see less use of configuration management in larger organizations.

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Effectiveness was originally published by James Turnbull at Kartar.Net on August 11, 2015.

Monitoring Survey 2015 - Metrics

2015-08-10T00:00:00-04:00

In the last posts I talked about the tools people used in monitoring, the demographics, and what environments people monitor. In this post I am going to look at the questions around collecting metrics and what those metrics are used for by respondents.

As I’ve mentioned in previous posts, the survey got 1,116 responses of which 884 were complete.

This post will cover the questions:

7. Do you collect metrics on your infrastructure and applications?
8. What tools do you use to collect metrics?
9. What tools do you use to store your metrics?
10. What tools do you use to visualize your metrics?
11. If you collect metrics, what do you use the metrics you track for?

Collecting Metrics

Question 7 asked if the respondents collected metrics. It was a Yes/No question.

We can see that the overwhelming majority, 88% in fact, of respondents collect metrics (slightly down from 90% last year). That continues to be a pretty conclusive indication that metrics matter.

I also broke the responses down by organization size. I was curious to see what size organizations collected the least metrics.

We can see that there a pretty even distribution of people that do not collect metrics across organization size.

Metric collection tools

I also asked respondents to tell me about the tools they used to collect metrics. There was a choice of potential tools and an Other option. The choice of tools included:

collectd
Cube
DataDog
Ganglia
Librato
Munin
New Relic
OpenTSDB
StatsD

We can see that both collectd and StatsD are heavily used with New Relic coming in third, in keeping with the data revealed in the tool analysis results.

The results of the Other question was also interesting. I’ve only included tools that occurred more than once to keep the list manageable.

Metrics collection tools - Other
In-house	77
Diamond	26
Sensu	23
Zabbix	19
ELK	17
Cacti	16
Nagios	13
Check_MK	13
Centreon	11
pnp4nagios	9
Splunk	9
SolarWinds	8
AppDynamics	7
Prometheus	6
Icinga2	6
NetCrunch	6
Shinken	5
Zenoss	5
jmxtrans	5
DropWizard	4
Observium	4
Dataloop	4
OpenNMS	4
Riemann	3
Coda’s Metrics	3
Cloudwatch	2
OMD	2
Dynatrace	2
Smokeping	2
Graphite	2
Stackdriver	2
Xymon	2
CopperEgg	2
Ganglia	2
LogicMonitor	2
SignalFX	2

The high number respondents building their own metrics collection tools (77 reported having in-house tooling) is interesting. It potentially suggests that there is still a segment of the market that isn’t happy with the available tooling out there.

Also interesting was the support for Diamond, a Python-based metrics collection tools originally written by the Brightcove team and now maintained as a separate open source project.

Metric storage tools

We also asked respondents to name the tools they used to store metrics. The options for the question included:

DataDog
Graphite
Hosted Graphite
InfluxDB
Librato
OpenTSDB
RRDtool

There was also an Other option we’ll report below.

The clear winner here is Graphite. As one of the longer standing tools in the metrics space it’s not overly surprising it is so well represented. Also present in large numbers is RRDTool, an even older tool in the metric’s space. The newer generation of tools is represented by InfluxDB.

These are the responses to the Other option. I’ve only included tools that occurred more than once to keep the list manageable.

Metrics storage tools - Other
ELK	28
In-house	27
Splunk	14
Zabbix	14
New Relic	9
MySQL	8
Prometheus	8
Cacti	8
SignalFX	7
AppDynamics	6
NetCrunch	6
Dataloop	5
SolarWinds	5
Stackdriver	4
Zenoss	4
Cassandra	4
CopperEgg	3
MSSQL	3
Ganglia	3
postgreSQL	2
Circonus	2
LogicMonitor	2
Check_MK	2
pnp4nagios	2
SPM	2
OpenNMS	2
kairosdb	2
Xymon	2
Redis	2

Interesting to note here is the people using the ELK stack and in-house tools to store their metric data. I’ve been seeing a lot of tools and services converting data and metrics into Logstash’s JSON format and using Logstash as a filtering router and Elasticsearch as storage.

Metric visualization tools

Our last question focussed on metrics visualization tools.

Respondents had a choice of the following tools:

D3
Grafana
Graphene
Graphite
Highcharts
Rickshaw
Tessera

Respondents could also select an Other option and specify other tools.

Here Grafana is a clear favorite. Likely given its ability to sit on top of Graphite, InfluxDB and OpenTSDB. The next largest tool was Graphite itself and then, with a long drop-off, the D3 Javascript framework.

These are the responses to the Other option. I’ve only included tools that occurred more than once to keep the list manageable.

Metrics Visualization tools - Other
In-house	54
ELK	35
pnp4nagios	27
DataDog	24
Cacti	22
Zabbix	17
Splunk	13
Munin	13
New Relic	10
Ganglia	8
Observium	7
Librato	7
NetCrunch	7
Centreon	6
AppDynamics	6
SolarWinds	6
Dataloop	5
RRDTool	5
Dashing	5
OpenNMS	5
SignalFX	4
Stackdriver	4
Promdash	4
Check_MK	4
MRTG	3
pnp	3
Nagios	3
Circonus	3
Graphite	3
Tableau	3
CopperEgg	3
Xymon	3
Metrilyx	2
Riemann	2
Zenoss	2
LogicMonitor	2
SPM	2
Nagiosgraph	2
OpenTSDB	2
StatusWolf	2
Visage	2

Again present are a lot of in-house tools and the ELK stack in the form of Kibana. Given the presence of lots of Nagios users it’s also not a surprise to see pnp4nagios represented.

The purpose of metrics collection

I also asked respondents why they collected metrics. As with last year I was curious whether respondents were collecting data for performance analysis or as a fault detection tool. There’s a strong movement in more modern monitoring methodologies to consider metrics a fault detection tool in their own right. I was interested to see if this thinking had grown from last year.

Respondents were able to select one or more choice from the list of:

Performance analysis and trending
Fault and Anomaly detection
Capacity Planning
A/B Testing
We don’t do anything with collected metrics
Other

If respondents selected “No”, that they did not collect metrics, the previous question logic skipped them to the next question.

I’ve produced a summary table of respondents and their selections.

Metrics Purpose
Performance analysis and trending	63%
Fault and Anomaly detection	53%
Capacity Planning	45%
A/B Testing	11%
We don’t do anything with collected metrics	3%

We have see that 63% of respondents specified performance analysis and trending as a reason for collecting metrics. Below that 53% of respondents specified that they used metrics for Fault and anomaly detection. This is 10% lower than last year’s survey. The next largest group, 45%, used metrics for capacity planning.

A very small group, 11%, used metrics for A/B testing.

I also summarized the Other responses as a table

Metrics Purpose - Other
Reporting	5
Dashboards	4
Alerting	3
Business KPIs	2
Slow call traces	1
Marketing	1
Retrospectives	1
Power management	1
Fault diagnosis	1
Incident response	1
Billing	1

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Metrics was originally published by James Turnbull at Kartar.Net on August 10, 2015.

Monitoring Survey 2015 - Environments

2015-08-07T00:00:00-04:00

In the last posts I’ve talked about the the tools people used in monitoring and the demographics of the survey.

In this post I am going to look at the question around what parts of people’s environments are monitored. As I’ve mentioned in previous posts, the survey got 1,116 responses of which 884 were complete.

This post will cover the question:

6. What parts of your environment do you monitor? Please select all the apply.

What parts of your environment do you monitor?

With Question 6 I am most interested in understanding what types of infrastructure are monitored. Especially in areas beyond traditional host-based monitoring. As with last year I asked about network and application monitoring. I also introduced a new category called Cloud Infrastructure in response to feedback on this question.

Overall, I divided monitoring types into:

Server Infrastructure
Cloud Infrastructure
Network Infrastructure
Application logic
Business logic

I’ve compiled the results into a summary table.

Environments Monitored
Server Infrastructure	81%
Cloud Infrastructure	49%
Network Infrastructure	57%
Application logic	59%
Business logic	29%

81% of respondents perform Server Infrastructure monitoring. 49% monitor Cloud Infrastructure. That half the respondents have cloud infrastructure to monitor is likely a consequence of selection bias in the respondent pool.

A smaller 57% of respondents monitor Network Infrastructure. This fits with the results from last year, where I had expected more network monitoring. I posited that this may be related to the silo’ing of network management in many organizations into a Network-specific team or be a selection bias.

A slightly smaller group than last year perform Application and Business logic monitoring with 59% and 29% respectively.

This year I also added an “Other” category to cover other environments or elements that people might monitor.

Environments Monitored - Other
Application	12
Database	5
Plant and Physical site	3
Workstations	1
Backups	1
External services	1

I’m assuming Application here and Application logic above are related. A smaller group also considered database monitoring as a separate category.

In the next post I’ll be looking at metrics and their use in monitoring.

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Environments was originally published by James Turnbull at Kartar.Net on August 07, 2015.

Monitoring Survey 2015 - Tools

2015-08-06T00:00:00-04:00

In this series I am looking at the results of my recent monitoring survey and specifically the monitoring tools being used by respondents. As I’ve mentioned in previous posts, the survey got 1,116 responses of which 884 were complete.

This post will cover the question:

5. What tools do you use for monitoring? (Choose all that apply)

Every respondent was required to answer question five. Last year I asked about primary tools and forced respondents to select a single “primary” tool. Feedback indicated that this artificially constrained respondents and many people struggled to select a single tool. This year I allowed respondents to select all tools that they used for monitoring.

Monitoring Tools

This graph shows the monitoring tools selected. The clear winner this year is again Nagios. This suggests that advances in monitoring approaches are still potentially embryonic and evolutionary rather than revolutionary.

Interestingly however the next two most popular choices are AWS Cloudwatch and New Relic. This data suggests a few interesting potential trends:

We’re starting to see more SAAS-based monitoring.
People are potentially using New Relic and the like as an Application Performance Management or APM tool in conjunction with other tools.
With CloudWatch it is also possible that companies that use AWS hosting are using this to supplement and feed into existing monitoring.

Also interesting is that the number of Sensu users has doubled from last year. That’s fairly rapid growth but is only half the usage of Nagios.

There’s also a large number of home-grown tools, 230 people responded that they have a home-grown tool. That’s six times the number who indicated they had a home-grown tool last year. As a result I’ll be adding a question or questions to attempt to unpack that in next year’s survey.

Note: You can find last year’s results in this blog post

Other Tools

This is the breakdown of the Other category. 363 respondents specified other tools not specifically listed in Question 5. This table shows the summary listing of all other tools specified.

Other tools	Count
SolarWinds	23
Pingdom	21
Check_MK	19
ELK	19
Shinken	18
Munin	16
Splunk	13
Graphite	13
AppDynamics	12
Cacti	10
PRTG	10
HP	10
Consul	9
Monit	9
LogicMonitor	9
Monit	9
OpenNMS	9
NetCrunch	8
Prometheus	8
Dynatrace	8
Observium	7
OMD	7
Nimsoft	6
Collectd	6
Ganglia	6
Circonus	5
Dataloop	5
Op5	4
Scout	4
Sentry	4
SignalFX	4
Grafana	4
Rackspace Cloud Monitoring	4
Stackdriver	3
Hyperic	3
Statsd	3
NodePing	3
CopperEgg	3
Monitis	3
What’s Up	3
PagerDuty	3
Smokeping	3
Netcool	2
MongoDB Manager Service	2
OpenTSDB	2
naemon	2
SPM	2
Bosun	2
Flapjack	2
Kibana	2
BMC	2
CA	2
StackDriver	2
New Relic	2
Loggly	2
Tivoli	2
Azure AppInsights	2
PCP/Vector	2
Uptime robot	2
Sysdig Cloud	2
Graphite-beacon	1
Opsware	1
Alerta	1
graphite-pager	1
ScienceLogic EM7	1
Netuitive	1
ServerSpec	1
Seyren	1
Sitescope	1
NMSaaS	1
Intermapper	1
SNMP	1
Opsmatic	1
Logwatch	1
Monasco	1
Big Brother	1
Wavefront	1
Boundary	1
locust	1
Server density	1
Elasticsearch	1
torrus	1
LibreNMS	1
Metrics.net	1
Fluentd	1
ITRS Geneos	1
Argo	1
uptrends	1
Livewatch	1
vRealize Operations	1
MonYog	1
Jennifer	1
Icinga2	1
mon	1
Pulseway	1
diamond	1
Moogsoft	1
CloudMonix	1
www.cronalarm.com	1
VividCortex	1
Runscope	1
Rancid	1
Catchpoint	1
Truk	1
Kubernetes	1
Gomez/Compuware	1
Nagios	1
COTS	1
Graylog	1
Tensor	1
Elastic Watcher	1
ruxit	1
SumoLogic	1
Neustar	1
Traverse	1
Squash	1
The Dude	1
RightScale	1
Geckoboard	1
Pandora	1
VeeamOne	1
StatusWolf	1
Keymetrics.io for NodeJS	1
Logsene	1
WebNMS	1
Cloudhealth	1

A lot of the use cases here appear to be more domain-specific monitoring: network specific tools like SolarWinds and Pingdom or log management tools like the ELK stack and Splunk.

Newcomer Prometheus also appeared with 8 respondents stating they used it.

Finally it was also interesting to see 9 respondents report using Consul for monitoring. There was a strong negative reaction to a recent post suggesting that approach.

In the next post I’ll look at what environments people monitor.

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Tools was originally published by James Turnbull at Kartar.Net on August 06, 2015.

Monitoring Survey 2015 - Demographics

2015-08-05T00:00:00-04:00

In an earlier post I talked about the 2015 edition of the monitoring survey and the background to it. In this post, the first of several posts analyzing the results, I am going to look at the demographics of the responses.

The survey got 1,116 responses of which 884 were complete.

This post will cover the questions:

Which of the following best describes your IT job role?
How big is your organization?
Are you responsible for IT monitoring in your organization
If you are not responsible for monitoring, who is?

Everyone was required to answer questions 1 to 3. If they answered “No” to question 3 then they were prompted with question 4. If they answered that they didn’t do any monitoring they were presented with the end of the survey. Otherwise they moved onto the next question on the form.

Note - you can find last’s years answers to these questions in this blog post.

Job Roles

Operations, SysAdmins and SRE staff represented 40% of the respondents. This compares to 49% of last’s year respondents. The next largest group being DevOps at 28% of respondents which compares with 33% last year. A slightly higher percentage, 12%, of respondents reported themselves as developers. This compares to 9.2% last year. This year 15% of respondents classed themselves as management of some kind. An increase of 11% from last year.

As with last year’s results, the bias towards Operations roles is likely related to the communities where the survey was distributed. But it also may be related to Operations being the traditional owners of monitoring.

Organization Size

I also asked respondents about the size of their organization.

The results are reasonably well distributed across organizations of various sizes. The largest group, 31%, are small organizations of 1 to 50 employees. Closely behind this, at 21%, are slightly larger organizations of 50 to 250 employees. In the third place are organizations of larger than 1000 employees at 18%. This is very similar to last year’s demographic results.

Roles by Organization Size

I also created an overlay of roles distributed by organization size.

The graph reveals results similar to last year with the same slightly higher distribution of developers responding from smaller organizations and the more visible presence of architects and security folks in larger enterprises. We also see the influx of management respondents in the two largest categories of organization.

Monitoring Responsibility

I also asked respondents if they were responsible for monitoring or if the task belonged to someone else.

81% of respondents, were responsible for monitoring. A further 17% of respondents were not responsible for monitoring (slightly up from 15% last year). A small group, 1.6% of respondents, indicated that their organization did not do monitoring at all. This is slightly down from last year’s result of 2.5%.

In the case where respondents were not responsible for monitoring I asked them to indicate which groups were responsible. The respondents could specify all the groups that were involved in monitoring. I’ve rolled up the multiple responses into a summary graph.

These results again reflect the distribution of roles established by respondent’s who did manage monitoring. Strangely, last year’s category of Monitoring Team did not reappear this year.

I’ve also broken out those people who don’t monitor. Firstly, I’ve looked at the breakdown of roles across organization who do not monitor at all.

I’ve also broken out the count of people by organization size who don’t monitor.

Obviously it’s a very small sample size (18 respondents) but the largest group of people who don’t monitor are in smaller organizations.

In the next post I’ll be looking at the tools identified in the survey.

P.S. I am also writing a book about monitoring.

The posts:

Monitoring Survey 2015 - Demographics was originally published by James Turnbull at Kartar.Net on August 05, 2015.

Monitoring Survey 2015 - Background

2015-08-04T00:00:00-04:00

As many of you are aware I recently ran a small Monitoring survey. I ran a similar survey last year and decided to see if the results had changed. Assuming interest continues I’ll run it again next year too.

Again, the intent of the survey was to understand the state of maturity across some key areas of monitoring. I was specifically interested in what sort of monitoring people were doing, some idea of why they were doing that monitoring, and what tools they were using to do that monitoring. I am also writing a book about monitoring and wanted to get some insights that could help shape the book.

The survey greatly benefited from community feedback and was tweaked in response to that and the data I received last year.

This year the survey was 15 questions across 5 pages. The questions (which included some skip logic) are reproduced here:

Which of the following best describes your IT job role?
How big is your organization?
Are you responsible for IT monitoring in your organization
If you are not responsible for monitoring, who is?
What tools do you use for monitoring? (Choose all that apply)
What parts of your environment do you monitor? Please select all the apply.
Do you collect metrics on your infrastructure and applications?
What tools do you use to collect metrics? (Choose all that apply)
What tools do you use to store your metrics?
What tools do you use to visualize your metrics?
If you collect metrics, what do you use the metrics you track for? (Select all that apply)
When do you most commonly add monitoring checks or graphs to your environment?
Do you ever have unanswered alerts in your monitoring environment?
How often does something go wrong that IS NOT detected by your monitoring?
Do you use a configuration management tool like Chef, Puppet, Salt or Ansible to manage your monitoring infrastructure?

The survey was launched 6/15/2015 and ran until 7/20/2015. It was advertised on this blog, Twitter, and a number of monitoring, DevOps, SysAdmin and tools events, publications and mailing lists. As a result there’s likely some bias in the responses towards more open source, DevOps, Operations and startup-centric communities.

In total there were 1,116 response (slightly more than last year’s 1,016), of which 884 were complete (866 last year). In my analysis I’ve considered complete and some partial responses where appropriate.

I’ll be again analyzing each section of the survey in a series of posts, starting with the demographics of the respondents. Once I’ve posted my analysis I’ll be making the source data available to anyone who wants to use it.

The posts:

Monitoring Survey 2015 - Background was originally published by James Turnbull at Kartar.Net on August 04, 2015.

The Art of Monitoring sample chapter

2015-06-23T00:00:00-04:00

TL;DR - The Art of Monitoring has a sample chapter

I’m writing a new book on monitoring rather illustriously called The Art of Monitoring. I’ve just released a sample chapter from the book. The chapter focuses on installing, learning and using Riemann for monitoring.

The book is progressing well and I hope to have it out at the end of the year. If you’re interested in receiving updates and getting notified when the book is released you can sign up below.

The Art of Monitoring sample chapter was originally published by James Turnbull at Kartar.Net on June 23, 2015.

Monitoring Survey 2015

2015-06-16T00:00:00-04:00

TL;DR - Please take the 2015 Monitoring Survey

Last year I ran a monitoring survey, whose data I also reviewed as a series of posts on this blog and presented in several talks. I was interested in running the survey because I think we’re seeing the beginnings of a significant change in the maturity of the monitoring landscape.

I’ve decided to make the survey a yearly event and am coinciding the launch of this year’s survey with Monitorama in Portland.

The survey takes about 5 minutes to fill out and the results will again be presented on this blog, in some conference talks and made available as Creative Commons licensed data. The survey is totally anonymous and the data won’t be used for any commercial purposes.

You can find the survey at https://www.surveymonkey.com/s/monitoringsurvey2015.

Running the survey last year resulted in numerous suggestions on methodology and approach. I want to thank everyone who responded to the survey and who provided feedback that contributed to this year’s survey especially: Paul Nasrat, Lindsay Holmwood, and John Allspaw.

Thanks in advance!

Monitoring Survey 2015 was originally published by James Turnbull at Kartar.Net on June 16, 2015.

Looking up events in the Riemann index

2015-06-15T00:00:00-04:00

Forthcoming book - The Art of Monitoring

One of the classic problems of monitoring alerts is that they are often very cryptic. Coupled with the challenge of alert fatigue¹ this makes working out what to do next when you receive an alert quite tricky. Additionally, alerts often happen when we’re not at the top of our game: a 4am on a Sunday morning alert is not likely to foster an exemplary response.

The quintessential example of cryptic/unhelpful alerts are Nagios disk space alerts.

PROBLEM Host: server.example.com
Service: Disk Space

State is now: WARNING for 0d 0h 2m 4s (was: WARNING) after 3/3 checks

Notification sent at: Thu Aug 7th 03:36:42 UTC 2015 (notification number
1)

Additional info:
DISK WARNING - free space: /data 678912 MB (9% inode=99%)

What does this alert mean? We can see that filesystem /data has 678912 Mb of disk space left or 9%. Should we worry? How fast it is filling up? Is this likely to happen RSN or sometimes in the future? What’s on that filesystem? Do I care if it fills up? I already have five questions from a single alert and I haven’t even started to diagnose WHY things might be wrong. Meh I am going back to sleep.

Thankfully, in the middle of last year the estimable Ryan Frantz released Nagios Herald. Nagios Herald is a decorator for Nagios alerts. It allows you to add context or further information to alerts generated by Nagios.

For example, here is a decorated Nagios disk alert.

Much more useful. Nice big stack bar. Helpful graph. Output from the df command. With this information I’m feeling a lot more comfortable about fixing the issue. (You can find a bunch of other example alerts here too.)

So helpful to all using Nagios. Not so helpful to others. (Although I think there is support for user-supplied attributes in Sensu and uchiwa and probably some other tools but nothing quite so well integrated and helpful (yet).)

So in the spirit of recent Riemann posts I thought about what I could do quickly and simply to provide some context for alerts, specifically email alerts. Riemann does have one useful store of information: the index. Every event you index is stored in there until its TTL expires and the expiration reaper runs. So if you’re collecting useful events then some of those might help to color your alerts with helpful context.

In my environment Riemann receives events from collectd and does most of its alerting based on the values of collectd metrics. One of those plugins, df, emits metrics that measure the size of your filesystems. It emits a metric like so:

{:host host.example.com, :service df-root/percent_bytes-used, :state nil, :description nil, :metric 90.334929260253906, :tags [collectd], :time 1433706333, :ttl 20.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance used, :type percent_bytes, :plugin_instance root, :plugin df}

We can use this event, through the :service field, for example :service df-root/percent_bytes-use, to identify when specific filesystem have exceeded a threshold.

We can create a configuration like so to do this:

(let [index (index)]

  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      (where (and (service #"^df-(.\*)/percent_bytes-used") (>= metric 90.0))
        (email james@example.com)
      )
)))

This uses the where filter stream to select all df-generated metric matching df-(.\*)/percent_bytes-used. This should find the percent bytes used for every filesystem we’re monitoring, for example for the / filesytem the metric would be: df-root/percent_bytes-used. Our where filter all matches on the metric when the percentage if greater than or equal to 90%. If it matches it sends an email using the email function to james@example.com.

It’s inside our email alerting that we’re going to add the additional context. Inside our email variable we’re going to redefine how Riemann creates the emails it sends. We do this by adding the :body option to the mailer plugin. We’ve defined that plugin inside our email variable.

(def email (mailer {:from "reimann@example.com"
                    :body (fn [events] (format-body events))
                    }))

The :body option takes a function and the events argument. The events argument contains one or more events in a sequence that our function, here format-body, will then parse and format.

Our new format-body function will look pretty similar to the default Riemann email formatting.

(defn format-body
  "Format the email body"
  [events]
  (clojure.string/join "\n\n\n"
        (map
          (fn [event]
            (str
              "Time: " (riemann.common/time-at (:time event)) "\n"
              "Host: " (:host event) "\n"
              "Service: " (:service event) "\n"
              "Metric: " (if (ratio? (:metric event))
                (double (:metric event))
                (:metric event)) "\n"
              "\n"
              "Additional context for host: " (:host event) "\n\n"
              (print-context (search (:host event)))
              "\n\n"))
          events))
)

We take the events argument and loop through the sequence of events inside it to produce a notification. Where the function starts to differ is when we begin to populate our additional insights. The insight is generated by looking up events in the Riemann index. To do this we use a third function called print-context. The print-context function takes a host, here the host of the current event from the :host field, and uses the search function to return all of the other events from that host from the index.

(defn search
  "Search events in the index"
  [host]
  (->> '(= host host)
       (riemann.index/search (:index @riemann.config/core)))
)

The search function uses the riemann.index/search function to query the index. It constructs a query using the host argument. It then uses that query to retrieve all matching events from that host from the index. Where the location of the index is the currently running core. Any matching events in the index will be returned as a sequence of standard Riemann events.

We then pass this sequence to the print-context function as an argument. The print-context function iterates through the sequence and prints out a list of services and associated metrics.

(defn print-context
  "Print the event content"
  [events]
  (clojure.string/join "\n"
    (map
      (fn [event]
        (str
          "Service: " (:service event) " with metric: " (round (:metric event))))
    events))
)

The contextual example is a little silly because you probably don’t want all of these services and their metrics but you could easily select something more elegant. (In the example code we’re also included a lookup function which uses the other index parsing function: riemann.index/lookup. The lookup function uses a host/service pair to look up specific events inside the index.)

We also run our events through the round function which uses cl-format from clojure.pprint to round any numbers to 2 decimal places.

(defn round
  "Round numbers to 2 decimal places"
  [metric]
  (clojure.pprint/cl-format nil "~,2f" metric)
)

Phew! That’s a lot of background. So what actually happens when this alert triggers? In this case you will generate an email much like:

Time: Sun Jun 14 15:22:19 UTC 2015
Host: app2-api
Service: df-root/percent_bytes-used
Metric: 90.33

Additional context for host: app2-api

Service: cpu-0/cpu-system with metric: 0.40
Service: processes-rsyslogd/ps_disk_octets/read with metric: 0.00
Service: processes-collectd/ps_cputime/syst with metric: 3002.70
Service: cpu-0/cpu-wait with metric: 0.00
Service: interface-lo/if_errors/rx with metric: 0.00
Service: swap/swap_io-out with metric: 0.00
Service: interface-docker0/if_errors/rx with metric: 0.00
Service: elasticsearch-productiona/counter-indices.refresh.total with metric: 0.59
Service: interface-eth0/if_octets/tx with metric: 10192.04
Service: processes-collectd/ps_disk_ops/read with metric: 81.07
Service: processes-collectd/ps_data with metric: 621551616.00
Service: processes-rsyslogd/ps_pagefaults/minflt with metric: 0.00
Service: processes/ps_state-paging with metric: 0.00
Service: processes-rsyslogd/ps_count/processes with metric: 1.00
Service: interface-eth0/if_packets/rx with metric: 117.10
Service: interface-lo/if_packets/tx with metric: 0.00
Service: load/load/shortterm with metric: 0.14
. . .

You could easily modify this to only select specific, relevant, events. You could also use any of Riemann’s stream functions or Clojure’s functions to manipulate those events.

You could also extend this example beyond the index to retrieve external information. For example to retrieve further information from the host, construct a graph, or link to an existing Graphite graph or data source. This could even be further extended to take some action on the host itself in addition to the notification. The possibilities are broad and exciting!

P.S. You can find a fully-functioning Riemann configuration for this example here.

Becoming desensitized to alerts because you get so many. ↩

Looking up events in the Riemann index was originally published by James Turnbull at Kartar.Net on June 15, 2015.

Connecting Riemann and Zookeeper

2015-04-21T00:00:00-04:00

One of my pet hates is having to maintain configuration inside monitoring tools. Not only large pieces like host definitions but smaller pieces like service and component definitions. Using a configuration management tool makes this much easier but it still generally requires some convergence to update your monitoring configuration when a host is added or removed or a service changes.

An example might be HAProxy. I have a HAProxy running with multiple back-end nodes. I want to know about issues if the node count drops below a threshold, potentially if it drops at all. With auto-scaling or just adding and subtracting nodes I need to keep this count up-to-date in my monitoring system to ensure I am correctly alerted when something goes wrong and to avoid false positives. I could do that with configuration management and converge the configuration when I deploy, using Puppet’s exported resources for example. But in a dynamic and fast-moving environment I’d really prefer not to wait for any convergence.

(Note: This is a somewhat artificial and very pets v. cattle example. I don’t overly care if individual nodes die because they are disposable and easily replaced. I could apply the same logic to any host or service threshold that I wanted to query.)

Instead I want my monitoring system to be able to lookup my threshold in some source of truth about the state of my infrastructure. That source of truth could be something like Apache Zookeeper, Consul, or a configuration management store like PuppetDB.

In this post I’m going to combine Zookeeper and my Riemann monitoring stack. Let’s start with some code to connect to Zookeeper. It makes use of the Zookeeper-clj Clojure client.

(use '[cemerick.pomegranate :only (add-dependencies)])
(add-dependencies :coordinates '[[zookeeper-clj "0.9.1"]]
                  :repositories (merge cemerick.pomegranate.aether/maven-central
                                       {"clojars" "http://clojars.org/repo"}))

(ns zookeep
  "Zookeeper functions"
  (:use clojure.tools.logging)
  (:require [zookeeper :as zk]
            [zookeeper.data :as data]))

(def client (zk/connect "127.0.0.1:2181"))

(defn get_data
  "Gets data from Zookeeper"
  [node]
  (-> (:data (zk/data client node))
      data/to-string
      read-string)
)

The first part of our code loads the zookeeper-clj client. We then define a namespace called zookeep and require the client (as zk) and the Zookeeper client’s data function as data.

We’ve defined a var called client that is a connection to a local Zookeeper server. We could easily specify a remote server instead.

We’ve created a very simple function named get_data that retrieves the contents of a specific Zookeeper node specified by the node argument.

Let’s now create a riemann.config file to make use of our Zookeeper functions.

(include "/etc/riemann/include/zookeeper.clj")

(let [host "0.0.0.0"]
  (tcp-server {:host host})
  (udp-server {:host host})
  (ws-server  {:host host}))

(def email (mailer {:from "reimann@example.com"}))

(let [index (index)]
  (streams
    (where (and (< metric (zookeep/get_data "/app1/haproxy/nodes")) (service haproxy-backend.web-backend/gauge-active_servers) (tagged "app1")
      (throttle 1 120
        (email "james@example.com")))))
)

In our configuration we’ve included our Zookeeper functions using the include function and bound Riemann to all the interfaces on our host. We’ve also configured the email plug-in to allow us to send emails from Riemann.

Next we’ve defined some streams including a where filter on an event generated from collectd called haproxy-backend.web-backend/gauge-active_servers. This is the active back-end server count from the HAProxy stats output.

Our where filter matches this service, if it is tagged with app1, and if the value of the metric field is less than the value derived from the (zookeep/get_data "/app1/haproxy/nodes") function. This function, zookeep/get_data, takes the node name /app1/haproxy/nodes and looks it up in Zookeeper.

Inside Zookeeper we’ve created this node and populated it with the count of HAProxy back-end nodes running for this specific application. That population of the node or its update would normally take place during deployment.

Now when the metric arrives into Riemann, the lookup is triggered and Riemann compares the value of the metric field with the value from the Zookeeper node. If the metric value is less than the node value then Riemann sends an email out containing the specific event. Now our monitoring system doesn’t need any changes when our HAProxy configuration changes. We hence eliminate the need to wait for our deployment changes to converge in our monitoring environment. Which means less risk of missing an alert or a false positive alert being generated.

Connecting Riemann and Zookeeper was originally published by James Turnbull at Kartar.Net on April 21, 2015.

Just Enough Clojure for Riemann

2015-04-12T00:00:00-04:00

TL;DR - This is not a comprehensive guide to Clojure, but it is enough to get you started with Riemann. This is also an excerpt from my forthcoming book - The Art of Monitoring. It'll also be available in the Riemann documentation at some point too.

Riemann is configured using a Clojure-based configuration file. This means your configuration file is actually processed as a Clojure program. So to process events and send alerts and metrics you'll be writing Clojure. Don't panic! You don't need to become a fully fledged Clojure developer to use Riemann. I can teach you what you need to know in order to use Riemann. Additionally, Riemann comes with a lot of helpers and shortcuts that make it easier to write Clojure to do what we need to process our events.

Let's learn a bit more about Clojure and help you get started with Riemann. Clojure is a dynamic programming language that targets the Java Virtual Machine. It's a dialect of Lisp and is largely a functional programming language.

Functional programming is a programming style that focuses on the evaluation of mathematical functions and steers away from changing state and mutable data. It's highly declarative, meaning you build programs from expressions that describe "what" a program should accomplish rather than "how" it accomplishes it.

Note Languages that describe more of the "how" are called imperative languages.

Examples of declarative programming languages include SQL, CSS, regular expressions and configuration management languages like Puppet and Chef. Let's take a simple example.

SELECT user_id FROM users WHERE user_name = "Alice"

In this SQL query we're asking for the user_id for user_name of Alice from the users table. The statement is asking a declarative "what" question. We don't really care about the "how", the database engine takes care of those details.

In addition to their declarative nature, functional programming languages try to eliminate all side effects from changing state. In a functional language when you call a function its output value depends only on the inputs to the function. So if you repeatedly call function f with the same value for argument x, f(x), it will produce the same result every time. This makes functional programs very easy to understand, test and predict. Functional programming languages call functions that operate like this "pure" functions.

The best way to get started with Clojure is to understand the basics of its syntax and types. Let's get a crash course now.

Warning This is going to be a very high level and not very nuanced introduction to Clojure. It's designed to give you the knowledge and recognition of various syntax and expressions to allow you to work with Riemann. It is not an article that will teach you how to develop in Clojure.

A brief introduction to Clojure

Let's step through the Clojure basic syntax and types. We'll also show you a tool called REPL that can help you test and build your Clojure snippets. REPL (short for read–eval–print loop) is an interactive programming shell that takes single expressions, evaluates them and returns the results. It's a great way to get to know Clojure.

Note If you're from the Ruby world then REPL is just like irb. Or in Python when you launch the python binary interactively.

We can install REPL via a tool called Leiningen. Leiningen is an automation tool for Clojure that helps you automate the build and management of Clojure projects.

Installing Leiningen

In order to install Leiningen we'll need to have Java installed on the host. The prerequisite Java packages on Ubuntu and Red Hat for Reimann will also be sufficient for Leiningen too.

We're going to download a Leiningen binary called lein to install it. Let's download that into a bin directory under our home directory.

$ mkdir -p ~/bin
$ cd ~/bin
$ curl -o lein https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein
$ chmod a+x lein
$ export PATH=$PATH:$HOME/bin

Here we've created a new directory called ~/bin and changed into it. We've then used the curl command to download the lein binary and the chmod command to make it executable. Lastly, we've added our ~/bin directory to our path so that we can find the lein binary.

Tip The addition of the ~/bin directory assumes you're in a Bash shell. It's also temporary to your current shell. You'd need to add the path to your .bashrc or the similar setup for your shell.

Next we need to run lein to auto-install its supporting libraries.

$ lein
. . .

This will download Leiningen's supporting Jar file.

Finally, we can run REPL using the lein repl sub-command.

$ lein repl
. . .
user=>

This will download Clojure itself (in the form of its Jar file) and launch our interactive Clojure shell.

Clojure syntax and types

Let's use this interactive shell to look at some of the syntax and functions we've just learnt about. Let's start by opening our shell.

user=>

Now let's try a simple expression.

user=> nil
nil

The nil expression is the simplest value in Clojure. It represents literally nothing.

We can also specify an integer value.

user=> 1
1

Or a string.

user=> "hello Ms Event"
"hello Ms Event"

Or Boolean values.

user=> true
true
user=> false
false

Clojure functions

Whilst interesting these values aren't very exciting on their own. To do some more interesting things we can use Clojure functions. A function is structured like this:

(function argument argument)

Tip If you're used to the Ruby or Python world a function is broadly the equivalent of a method.

Let's look at a function in action by doing something with some values: adding two integers together.

user=> (+ 1 1)
2

In this case we've used the + function and added 1 and 1 together to get 2.

But there's something about this structure that might look familiar to you if you've used other programming languages. Our function looks just like a list. This is because it is! Our expression might add two numbers together but it’s also a list of three items in a valid list data structure.

Note Technically it's an s-expression.

This is a feature of Clojure called homoiconicity, sometimes described as: "code is data, data is code". This concept is inherited from Clojure's parent language: Lisp.

Homoiconicity means that the program's structure is similar to its syntax. In this case Clojure programs are written in the form of lists. Hence you can gain insight into the program's internal workings by reading its code. This also makes metaprogramming really easy because Clojure's source code is a data structure and the language can treat it like one.

Now let's look more closely at the + function. Each function is a symbol. A symbol is a bare string of characters, like + or inc. Symbols have short names and full names. The short name is used to refer to it locally, for example +. The full name, or perhaps more accurately the fully qualified name, gives you a way to refer to the symbol unambiguously from anywhere. The fully qualified name of the + symbol is clojure.core/+. The clojure.core being the fundamental library of the Clojure language. We can refer to + in it's fully qualified form here:

user=> (clojure.core/+ 1 1)
2

Symbols refer to other things; generally they point to values. Think about them as a name or identifier that points to a concept: + is the name, "adding" is the concept. When Clojure encounters a symbol it evaluates it by looking up its meaning. If it can't find a meaning it'll generate an error message, for example:

user=> (bob 1 2)
CompilerException java.lang.RuntimeException: Unable to resolve symbol: bob in this context, compiling:(NO_SOURCE_PATH:1:1)

Clojure also has a syntax for stopping that evaluation. This is called quoting and it is achieved by prefixing the expression with a quotation mark: '.

user=> '(+ 1 1)
(+ 1 1)

This returns the symbol itself without evaluating it. This is important because often we want to do things, review things, or test things without evaluating.

For example, if we need to determine what type of thing something is in Clojure we can use the type function and quote the function like so:

user=> (type '+)
clojure.lang.Symbol

Here we can see that + is a Clojure language symbol.

Lists

Clojure also has a variety of data structures. Especially useful to us will be collections. Collections are groups of values, for example a list or a map.

Let's start by looking at lists. Lists are core to all Lisp-based languages (Lisp means "LISt Processing"). As we discovered above Clojure programs are essentially lists. So we're going to see a lot of them!

Lists have zero or more elements and are wrapped in parentheses.

user=> '(a b c)
(a b c)

Here we've created a list containing the elements a, b and c. We've quoted it because we don't want it evaluated. If we didn't quote it then evaluation would fail because none of the elements, a, b, etc are defined. Let's see that now.

user=> (a b c)
CompilerException java.lang.RuntimeException: Unable to resolve symbol: a in this context, compiling:(NO_SOURCE_PATH:1:1)

We can do a few neat things with lists, for example add an element using the conj function.

user=> (conj '(a b c) 'd)
(d a b c)

You can see we've added a new element, d, to the front of the list. Why the front? Because a list is really a linked list and focusses on providing immediate access to the first value in the list. Lists are most useful for small collections of elements and when you need to read elements in a linear fashion.

We can also return values from a list using a variety of functions.

user=> (first '(a b c))
a
user=> (second '(a b c))
b
user=> (nth '(a b c) 2)
c

Here we've pulled out the first element, second element, and using the nth function, the third element.

This last, nth, function shows us a multi-argument function. The first argument is the list, '(a b c), and the second argument is the index value of the element we want to return, here 2.

Tip Like most programming languages Clojure starts counting from 0.

We can also create a list with the list function.

user=> (list 1 2 3)
(1 2 3)

Vectors

Another collection available to us is the vector. Vectors are like lists but they are optimized for random access to the elements by index. Vectors are created by adding zero or more elements inside square brackets.

Tip Most of the time, given the choice between a list and a vector, you should use a vector for data access. It's generally faster.

user=> '[a b c]
[a b c]

Like lists, we can again use conj to add to a vector.

user=> (conj '[a b c] 'd)
[a b c d]

You'll note the d element is added at the end because a vector isn't focussed on sequential access like a list.

There are some other useful functions we can use on lists and vectors, for example to get the last element in a list or vector.

user=> (last '[a b c d])
d

Or count the elements.

user=> (count '[a b c d])
4

Because vectors are designed to look up elements by index, we can also use them directly as functions, for example:

user=> ([1 2 3] 1)
2

Here we've retrieved the value, 2, at index 1.

We can create or convert an existing structure, like a list, into a vector with the vector function.

user=> (vector 1 2 3)
[1 2 3]
user=> (vector (list 1 2 3))
[1 2 3]

Sets

There's a final collection related to lists and vectors called a set. Sets are unordered collections of values, prefixed with # and wrapped in curly braces, { }. They are most useful for collections of values where you want to check a value or values is present.

user=> '#{a b c}
#{a c b}

You'll notice the set was returned in a different order. This is because sets are focussed on presence lookups so order doesn't matter quite so much.

Like lists and vectors we can use the conj function to add an element to a set.

user=> (conj '#{a b c} 'd)
#{a c b d}

Sets can never contain an element more than once, so adding an element which is already present does nothing. You can remove elements with the disj function.

user=> (disj '#{a b c d} 'd)
#{a c b}

The most common operation with a set is to check for the presence of a specific value, for this we use the contains? function.

user=> (contains? '#{a b c} 'c)
true
user=> (contains? '#{a b c} 'd)
false

Like a vector, you can also use the set itself as a function. This returns the value if it is present or nil if it is not.

user=> ('#{a b c} 'c)
c
user=> ('#{a b c} 'd)
nil

You can make a set out of any other collection with the set function.

user=> (set '[a b c])
#{a c b}

Here we've made a set out of a vector.

Maps

The last data structure we're going to look at is the map. Maps are key/value pairs enclosed in braces. You can think about them as being equivalent to a hash.

user=> {:a 1 :b 2}
{:b 2, :a 1}

Here we've defined a map with two key/value pairs: :a 1 and :b 2.

You'll note each key is prefixed with a :. This denotes another type of Clojure syntax: the keyword. A keyword is much like a symbol but instead of referencing another value it is merely a name or label. It's highly useful in data structures like maps to do lookups, you look up the keyword and return the value.

We can use the get function to retrieve a value.

(get {:a 1 :b 2} :a)
1

Here we've specified the keyword :a and asked Clojure if it is inside our map. It's returned the value in the key/value pair, 1.

If the key doesn't exist in the map then Clojure returns nil.

user=> (get {:a 1 :b 2} :c)
nil

The get function can also take a default value to return instead of nil, if the key doesn’t exist in that map.

user=> (get {:a 1 :b 2} :c :novalue)
:novalue

We can also use the map itself as a function.

user=> ({:a 1 :b 2} :a)
1

We can also use keywords as functions to look themselves up in a map.

user=> (:a {:a 1 :b 2})
1

To add a key/value pair to a map we use the assoc function.

user=> (assoc {:a 1 :b 2} :c 3)
{:c 3, :b 2, :a 1}

If a key isn't present then assoc adds it. If the key is present then assoc replaces the value.

user=> (assoc {:a 1 :b 2} :b 3)
{:b 3, :a 1}

To remove a key we use the dissoc function.

user=> (dissoc {:a 1 :b 2} :b)
{:a 1}

Note If you've come from the Ruby or Python world the terms list, set, vector and map might be a little new. But the syntax probably looks familiar. You can think about lists, vectors and sets as being very similar to arrays and maps being hashes.

Strings

We can also work with strings. Clojure lets you turn pretty much any value into a string using the str function.

user=> (str "holiday")
"holiday"

The str function turns anything specified into a string. We can also use it concatenate strings.

user=> (str "james needs " 2 " holidays")
"james needs 2 holidays"

Creating our own functions

Up until now we've run functions as stand-alone expressions, for example here's the inc function which increments arguments passed to it:

user=> (inc 1)
2

This isn't overly practical, except to demonstrate how a function works. If we want do more with Clojure we need to be able to define our own functions. To do this Clojure provides a function called fn. Let us construct our first function.

user=> (fn [a] (+ a 1))

So what's going on here? We've used the fn function to create a new function. The fn function takes a vector as an argument. This vector contains any arguments being passed to our function. Then we specify the actual action our function is going to perform. In our case we're mimicking the behavior of the inc function. The function will take the value of a and add 1 to it.

If we run this code now nothing will happen because a is currently unbound as we haven't defined a value for it. Let's run our function now.

user=> ((fn [x] (+ x 1)) 2)
3

Here we've evaluated our function and passed in an argument of 2. This is assigned to our a symbol and passed to the function. The function adds a, now set to 2, and 1 and returns the resulting value: 3.

There's also a shorthand for writing functions that we'll see occasionally in Riemann configurations.

user=> #(+ % 1)

This shorthand function is the equivalent of (fn [x] (+ x 1)) and we can call it to see the result.

user=> (#(+ % 1) 2)
3

Creating variables

But we're still a step from a named function and we're missing an important piece, how do we define our own variables to hold values? Clojure has a function called def that allows us to do this.

user=> (def smoker "joker")
#'user/smoker

The def function does two things:

It creates a new type of object called a var. Vars, like symbols, are references to other values. You can see our new var #'user/smoker returned as output of the def function.
It binds a symbol to that var, here the symbol smoker is bound to a var with a value of the string "joker".

When we evaluate a symbol pointing to a var it is replaced by the var's value. But because def also creates a symbol we can refer to our var like that too.

user=> user/smoker
"joker"
user=> smoker
"joker"

Where did this user/ come from? It's a Clojure namespace. Namespaces are a way Clojure organizes code and program structure. In this case the REPL creates a namespace called user/ by default. Remember we learnt earlier that a symbol has a short name, for example smoker that can be used locally to refer to it, and a full name. That full name, here user/smoker, would be used to refer to this symbol from another namespace.

We'll talk more about namespaces and use them to organize our Riemann configuration in the HOWTO. If you'd like to read more about them then there is an excellent explanation at http://www.braveclojure.com/organization/.

We can also use the type function to see the type of value the symbol references.

user=> (type smoker)
java.lang.String

Here we can see that the value smoker resolves to is a string.

Creating named functions

Now with the combination of def and fn we can create our own named functions.

user=> (def grow (fn [number] (* number 2)))
#'user/grow

Firstly, we've defined a var (and symbol) called grow. Inside that we've defined a function. Our function takes a single argument, number, and passes that number to the * function, the mathematical multiplication operator in Clojure, and multiplies it by 2.

Let's call our function now.

user=> (grow 10)
20

Here we've called the grow function and passed it a value of 10. The grow function multiplies that value and returns the result: 20. Pretty awesome eh?

But the syntax is a little cumbersome. Thankfully Clojure offers a shortcut to creating a var and binding it to a function called defn. Let's rewrite our function using this form.

user=> (defn grow [number] (* number 2))
#'user/grow

That's a little neater and easier to read. Now how about we add a second argument? Let's make both the number to be multiplied and the multiplier arguments.

user=> (defn grow [number multiple] (* number multiple))
#'user/grow

Let's call our grow function again.

user=> (grow 10)
ArityException Wrong number of args (1) passed to: user/grow  clojure.lang.AFn.throwArity (AFn.java:429)

Ooops not enough arguments. Let's add the second argument.

user=> (grow 10 4)
40

We can also add a doc string to our function to help us articulate what it does.

(defn grow
  "Multiplies numbers - can specify the number and multplier"
  [number multiple]
  (* number multiple)
)

We can access a function's doc string using the doc function.

user=> (doc grow)
-------------------------
user/grow
([number multiple])
  Multiplies numbers - can specify the number and multplier
nil

The doc function tells us the full name of the function, the arguments it accepts, and returns the docstring.

That's the end of our crash course.

Learning more Clojure

I recommend trying to get an understanding of the basics of Clojure to get the most out of Riemann. If you'd like to start to learn a bit about Clojure then Kyle Kingsbury's excellent Clojure from the ground up series is an great place to start. This section is very much an abbreviated crash-course of sections of that tutorial and I can't thank Kyle enough for writing it. A reading of this tutorial will add signicantly to the knowledge we've shared here. I recommend at least a solid reading of the first three posts in the series:

The Welcome post.
The post on Basic types.
The post on Functions.

Tip Another resource if you're interested in learning a bit more about the basics of Clojure is http://learn-clojure.com/.

Just Enough Clojure for Riemann was originally published by James Turnbull at Kartar.Net on April 12, 2015.

Custom emails with Riemann

2015-03-27T00:00:00-04:00

I’ve recently started alerting on expired events from Riemann via email. The default email alert looks something like this:

It contains some useful information but it is pretty basic: the subject is the name of the alerted service and the body contains a basic printout of the event’s fields.

I decided I’d like to build some alternative emails and so I went digging into the mailer plug-in code to find out how.

You would normally configure the mailer plug-in something like this:

(def email (mailer {:from "reimann@example.com"}))

This defines a new function called email that passes events to the mailer plug-in. We’ve configured a single option for the plug-in: :from which controls the source address for emails, here riemann@example.com.

If we want to update the subject or body of the email we can pass in the :subject and :body options. These options take a collection of events and return a formatted string, for example the default subject is set by a function like:

(def email (mailer {:from "reimann@example.com"
                    :subject (fn [events]
                     (clojure.string/join ", "
                       (map :service events)))}))

The :subject option has a function with an argument of events, which is the collection of incoming events. The map function then extracts the value of the :service field in each event, if there is more than one event then joins the services in a comma separated list, and writes that as a string in the subject line of our email. Hence riemanna riemann server tcp 0.0.0.0:5555... as the subject in our example email above.

If instead I wanted to build a custom email subject, let’s say to notify me when specific host was down I could add the :subject option to my mailer function:

(def host_email (mailer {:from "reimann@example.com"
                        :subject (fn [events]
                        (apply str "Host " (get-in (first events) [:host]) " is down"))}))

Here we’ve passed the :subject options our events collection. We’ve then specified a string, “Host … is down”. We’ve replaced the ... in the string with the hostname of the event. We’ve taken the hostname by getting the contents of the :host field from the first event in our collection.

We can then trigger these alerts with something like:

(expired
  (by [:host]
    (host_email "james@example.com")))

Here we’re filtering on all expired events, splitting the streams by the :host field using the by function. This creates a new stream for event by host. We then call the host_email function to send the email.

The resulting email would look like:

We could do similar things to modify the body of the email using the :body option.

P.S. I am slowly teaching myself Clojure. I’ve thus far found the Try Clojure site, the Learn Clojure and Clojure from the ground up to be most useful for this.

Custom emails with Riemann was originally published by James Turnbull at Kartar.Net on March 27, 2015.

Treat GitHub Wiki like a repository

2015-02-27T00:00:00-05:00

I recently needed to export all the articles from a GitHub wiki. I had thought I’d need to scrape it but I discovered that each GitHub wiki is in fact a Git repo.

If you need a copy of the content you can just clone it via Git.

$ git clone git@github.com:username/repo_name.wiki.git

That’s neat and I hope it’s useful to someone else.

Treat GitHub Wiki like a repository was originally published by James Turnbull at Kartar.Net on February 27, 2015.

The Art of Monitoring

2015-02-02T00:00:00-05:00

TL;DR - I am writing a book about monitoring and you can sign up for updates here.

Let’s begin with an origin story. Once upon a time(-series) there was a sysadmin. She managed infrastructure that lived in a data center. Every time a new host was added to that environment she installed some software and setup some checks. Every now and again one of those servers would break and a check would trigger. An alert would be sent and she would wake up and run rm -fr /var/log/*.log to fix it.

For many years this approach worked just fine. Oh there were some dramas: sometimes things would go wrong for which there wasn’t a check, or there just wasn’t time to action some alerts, or some applications and services on top of those hosts weren’t monitored. But things were mostly fine.

Then things started to change in the IT industry. Virtualization was introduced and a lot more hosts appeared. Many of those hosts were run by people who weren’t sysadmins or were even outsourced to third-parties. Then some of the hosts in her data center were moved into the Cloud or replaced with Software-as-a-Service applications.

Most importantly, applications and services that were previously merely seen as technology now became critical to selling to customers and providing high quality customer service. Suddenly IT wasn’t a cost centre but rather something the company’s revenue relied on.

As a result aspects of monitoring began to break down. It became hard to keep track of hosts (there were a lot more of them!), applications and infrastructure became more complex, and expectations around availability and quality became more aggressive. It became harder and harder to check for all the possible things that could go wrong using the current system. More and more alerts piled up. More hosts and services meant more demand on monitoring systems, most of which were only able to vertically scale. Faults and outages became harder to find and slower to detect under these loads.

Additionally, the organization began demanding more and more data to both demonstrate the quality of the service they were delivering to customers and to justify the increasing spend on IT services. Many of these demands were made for data that existing monitoring simply wasn’t measuring or couldn’t generate. The monitoring system became a tangled mess.

This is monitoring right now for many people in the industry. But it doesn’t have to be like that. You can build a better solution that addresses the change in the way IT works and that scales for the future.

Welcome to The Art of Monitoring.

This is a hands-on book that teaches you how to build a modern, scalable monitoring environment using up-to-date tools and techniques.

We include lessons for both sysadmins and developers. We’ll show developers how they can better enable monitoring and metrics and we’ll show sysadmins how to take advantage of that data to do better fault detection and get insights into performance.

We try to address the change in IT environments with virtualization, containerization and the Cloud. We help you provide a monitoring environment that helps you and your customers manage IT better.

The book will contain.

Chapter 1: An Introduction to Monitoring
Chapter 2: Building a metrics-centric monitoring environment.
Chapter 3: Metrics, metrics and measurement
Chapter 4: Building a service-centric and dynamic fault detection system
Chapter 5: Alerting
Chapter 6: Trending
Chapter 8: Visualization
Chapter 9: Anomaly Detection for fun and profit

(Likely to change…)

In the book we look at a variety of open source tools, including:

The book will be published late in 2015.

You can find more information on the book and its status here and you can sign up for updates here.

The Art of Monitoring was originally published by James Turnbull at Kartar.Net on February 02, 2015.

Riemann Sample Configurations

2015-02-01T00:00:00-05:00

One of the challenges of getting to know Riemann is that its configuration is in Clojure. Your Riemann configuration is actually a Clojure program that executes when Riemann is running. For some folks this is a very new language and sometimes a new approach.

To help with this process I’m keen on collecting a bunch of sample Riemann configurations from people who have already “been there and done this”. There are a few already online - The Guardian have theirs up for example - but I’d love to have more.

I’ve created a repository to hold them and it’d be great if folks would create a pull request and add theirs. I’d also be happy to manually add configurations via gist, pastie, email or any other way you’d like to get them to me.

Riemann Sample Configurations was originally published by James Turnbull at Kartar.Net on February 01, 2015.

Using Riemann for Metrics

2015-01-19T00:00:00-05:00

In my first post I introduced you to Riemann and my second post discussed Riemann for fault detection. In those posts we’ve discovered that Riemann aggregates events from distributed hosts and services. One of the cool outcomes of this aggregation is the ability to generate metrics from the events. We can then use a tool like Graphite to store the metric data and render graphs from it. In this post you’ll see how to:

Install Graphite.
Generate metrics.
Integrate Riemann with Graphite.

Installing Graphite

The first step we’re going to take is to install Graphite. Graphite is an engine that stores time-series data and then renders graphs from that data.

On an Ubuntu 14.04 or later host Graphite is available from APT packages. It’s made up of three components:

A web interface.
A storage engine called Carbon.
A database library called Whisper.

Carbon also relies on a database backend. The default database is Sqlite3 but you can specify Postgresql or MySQL/MariaDB if you wish (and I recommend one of these for a production environment - they are both far more robust than the default). We’re going to stick with the default right now as we’re just testing.

Installing Packages

Let’s install the packages we need.

$ sudo apt-get update
$ sudo apt-get -y install graphite-web graphite-carbon apache2 libapache2-mod-wsgi

We’ve first updated our APT package cache and then we’ve installed the graphite-web and graphite-carbon packages. The graphite-web package contains Graphite’s web interface and the graphite-carbon package contains the Carbon storage engine. We’ve also installed Apache to run the Graphite web interface.

You’ll be prompted during installation as to whether your graph database should be removed if you uninstall Graphite. Answer “No” to ensure your graph data is preserved.

Configuring Graphite

Next we need to configure Graphite. First we edit the /etc/graphite/local_settings.py configuration file.

$ vi /etc/graphite/local_settings.py

We need to change two items in this file. The first, SECRET_KEY, is used to salt hashes for Graphite’s authentication and the second, TIME_ZONE, controls the time zone. The latter is important if you want your metrics to have the right time and date.

We want to uncomment SECRET_KEY and set it to a long random string. Let’s generate a string now.

$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 256 | head -1
SyN1cmnVFCOvHhKJ4Jxrfc5osJx5HNmOc60LVEFahYM0dusIYmCRndd2mFEfHi6WAf9Sv8xBksmsmdQSh6PcoBKhA0MeX6DMNszKZEyGTBpx3kU5AArbcAtoeyTHz6ROk25DSKmjw7MlbmVVuM5Nbf5ewCIl6OVN3iXDhPLX0wvkE7nKJHKDcqelIOR0EyXDoa25Z88W374TXVNSucpxlyLDXWhHP6XShXCza4EQKCu6GePvFLHl1pjpYrb4sv7J

Now let’s add this random string to SECRET_KEY and uncomment and update our TIME_ZONE setting inside /etc/graphite/local_settings.py.

SECRET_KEY='SyN1cmnVFCOvHhKJ4Jxrfc5osJx5HNmOc60LVEFahYM0dusIYmCRndd2mFEfHi6WAf9Sv8xBksmsmdQSh6PcoBKhA0MeX6DMNszKZEyGTBpx3kU5AArbcAtoeyTHz6ROk25DSKmjw7MlbmVVuM5Nbf5ewCIl6OVN3iXDhPLX0wvkE7nKJHKDcqelIOR0EyXDoa25Z88W374TXVNSucpxlyLDXWhHP6XShXCza4EQKCu6GePvFLHl1pjpYrb4sv7J'
TIME_ZONE = 'America/New_York'

Later in the same file you’ll find a hash of database settings.

DATABASES = {
  'default': {
    'NAME': '/var/lib/graphite/graphite.db',
    'ENGINE': 'django.db.backends.sqlite3',
    'USER': '',
    'PASSWORD': '',
    'HOST': '',
    'PORT': ''
  }
}

For the default Sqlite3 database you won’t need to change this but it’d be here that you’d update if you wanted to use Postgresql or MySQL. In the default configuration you’ll find your data stored in /var/lib/graphite/graphite.db.

Prepping our database

Next we find to prep our initial database using the syncdb option of the graphite-manage command. This populates our database with the required initial tables and structure.

$ sudo graphite-manage syncdb
Creating tables ...
Creating table account_profile
Creating table account_variable
Creating table account_view
Creating table account_window
Creating table account_mygraph
Creating table dashboard_dashboard_owners
Creating table dashboard_dashboard
Creating table events_event
Creating table auth_permission
Creating table auth_group_permissions
Creating table auth_group
Creating table auth_user_groups
Creating table auth_user_user_permissions
Creating table auth_user
Creating table django_session
Creating table django_admin_log
Creating table django_content_type
Creating table tagging_tag
Creating table tagging_taggeditem

You just installed Django's auth system, which means you don't have any superusers defined.
Would you like to create one now? (yes/no): yes
Username (leave blank to use 'root'):
Email address: james@example.com
Password:
Password (again):
Superuser created successfully.
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)

We also define a super-user to use with our database. I specify the default root, an email address and then a secure password.

Configuring Carbon

Next I want to tweak Carbon’s density of metric retention, essentially how long metrics should be stored and how detailed those metrics should be. This is configured in the /etc/carbon/storage-schemas.conf file. Let’s look at this file now.

# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

Each schema entry matches specific metrics by name and specifies one or more retention periods. The first entry, [carbon], manages Carbon’s own metrics. A regular expression pattern is matched to find these, here any metric starting with carbon. The retentions are then set with the retentions entry. You can specify one or more retentions in the form of:

sample_time:retention_period

For the Carbon metrics a data point is created every 60 seconds and kept for 90 days: 60:90d. This means each data point represents 60 seconds and we want to keep enough data points for 90 days of data.

All other metrics use the default_1min_for_1day schema, the pattern matches .* or all events. In this schema, Graphite creates data points every 60 seconds and keeps enough data to represent 1 day. That’s a pretty low resolution by most standards and Riemann processes events much more quickly. So we’re going to create a new schema and comment out the old one.

#[default_1min_for_1day]
#pattern = .*
#retentions = 60s:1d

[default]
pattern = .*
retentions = 10s:1h, 1m:7d, 15m:30d, 1h:2y

This new schema includes multiple retentions. Multiple retentions allow graceful downsampling of historical data, saving you disk and performance. Our first retention, 10s:1h creates data points every 10 seconds and keeps enough data for 1 hour and then our next retention, 1m:7d, retains 1 minute data points for 7 days and so on.

To do the downsample from 10s:1h to 1m:7d Graphite gathers all of the data from the past minute (this should be six data points, one generated every 10 seconds). It then averages the data points to aggregate them and retains this new data point for 7 days. By default, each retention averages the total as it downsamples so you can determine metrics totals by reversing the average.

You can also configure Graphite to use alternate methods to aggregate the data points including min, max, sum and last. This is done by configuring a /etc/carbon/storage-aggregation.conf file. There’s a sample file in /usr/share/doc/graphite-carbon/examples/storage-aggregation.conf.example. We’re not going to do that right now but there’s an annoyingly frequent log message that appears in your Carbon logs, /var/log/carbon/console.log:

/etc/carbon/storage-aggregation.conf not found, ignoring.

Creating an empty /etc/carbon/storage-aggregation.conf file stops the message so let’s do that now.

$ touch /etc/carbon/storage-aggregation.conf

You can see a lot more about how Carbon is configured here.

Run Carbon at startup

Now let’s configure Carbon to run by default by editing the /etc/default/graphite-carbon file.

$ sudo vi /etc/default/graphite-carbon

Change the value of CARBON_CACHE_ENABLED=false to CARBON_CACHE_ENABLED=true.

Installing Graphite’s web interface

As our last setup step we’re going to install Graphite’s web interface. To do this we’re going to install it as Apache’s default website. First, disable the existing default site.

$ sudo a2dissite 000-default

Now copy in Graphite’s Apache configuration.

$ sudo cp /usr/share/graphite-web/apache2-graphite.conf /etc/apache2/sites-available

And enable it.

$ sudo a2ensite apache2-graphite

And we’re done.

Starting Carbon and Graphite

Finally, let’s start or reload the required services.

First Carbon.

$ sudo service carbon-cache start

And then Apache.

$ sudo service apache2 reload.

You can then view the Graphite web interface in your browser.

Configuring Riemann for Graphite

Riemann uses a Clojure-based configuration file to specify how events are processed and handled. On an Ubuntu host we can find that file at /etc/riemann/riemann.config. We’re going to add a Graphite output to the configuration we used in the last posts on Riemann. Let’s look at an updated configuration now.

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on the local interface over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "0.0.0.0"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server  {:host host}))

; Expire old events from the index every 5 seconds.
(periodically-expire 10 {:keep-keys [:host :service :tags :metric]})

(def graph (graphite {:host "localhost"}))

(let [index (index)]
; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      ; graph all
      graph)))

You can see we’ve added a function called graph.

(def graph (graphite {:host "localhost"}))

This defines a connection to our local Graphite server, here localhost. You could also specify the name of a remote Graphite server and you can use either TCP or UDP to send events.

Inside your streams block we can then use the graph function to send events through to Graphite. In our current configuration we’re graphing everything. This means every event sent to Riemann will get passed to Graphite and turned into a graph.

Alternatively, if you don’t want to send everything to Graphite we can be more selective, for example we could only select metrics from specific services.

(streams
  (where (service "heartbeat")
    graph))

Here we’re only sending events from the heartbeat service through to Graphite.

Now let’s send some metrics through to Graphite.

Sending metrics to Riemann and Graphite

For our metrics we’re going to choose some Nginx metrics. We’ve got a host running Nginx and are going to use the riemann-nginx-status command provided by the riemann-tools gem to send the metrics.

$ sudo gem install riemann-tools

The riemann-nginx-status command assumes the presence of an Nginx status page at http://localhost:8080/nginx_status. You can configure a page like that in your Nginx configuration. You can also override the default location with the --uri option.

location /nginx_status {
  stub_status on;
  access_log   off;
  allow 127.0.0.1;
  deny all;
}

Nginx status stub provides connection and status metrics. You can also control which metrics get sent to Riemann and specify any required thresholds. Let’s run riemann-nginx-status now.

$ riemann-nginx-status --host riemann.example.com

We’re sending our metrics from our Nginx host to riemann.example.com and we should start to see events like these hit Riemann shortly:

{:host artemisia.example.com, :service nginx health, :state ok, :description Nginx status connection ok, :metric nil, :tags nil, :time 1421514112, :ttl 10.0}
{:host artemisia.example.com, :service nginx active, :state ok, :description nil, :metric 3, :tags nil, :time 1421514112, :ttl 10.0}

Here we have a health check and the active connections metric. We should also now see if these events passed through to Graphite. Let’s see the resulting graphs in the Graphite web console.

We can see several metrics in our graph but not our health event. This is because Riemann only forwards events that have metrics. As the health event has a metric value of nil it’s not forwarded along to Graphite.

Pretty simple eh? Instant graph gratification.

Summary

We’ve seen how to install Graphite and connect it to Riemann. We’ve also seen how easy it is to turn our metrics into useful graphs. Building on this we could easily add categorization, filtering and manipulation (you remember all those cool things Riemann can do to events and their contents). A good starting point is The Guardian’s Riemann configuration. There’s lots of useful examples and ideas here. Enjoy!

Using Riemann for Metrics was originally published by James Turnbull at Kartar.Net on January 19, 2015.

A Monitoring Maturity Model

2015-01-13T00:00:00-05:00

I’ve been thinking a lot about monitoring maturity. Based on some research I did last year and a number of conversations with people in the industry I’ve documented a simple monitoring maturity model. I present it largely because some folks might be interested rather than as any sweeping revelation.

The three level maturity model reflects the various stages of monitoring evolution I’ve seen organizations experience. The three stages are:

Manual
Reactive
Proactive

Onto the details of the stages.

Manual or None

Monitoring is largely done manually or not at all. If monitoring is performed you will commonly see checklists, or simple scripts and other non-automated processes. Much of the monitoring is cargo cult behaviour where the components that are monitored are those that have broken in the past. Faults in these components are remediated by repeatedly following rote steps that have also “worked in the past”.

The focus is entirely on minimizing downtime and managing assets. Monitoring provides little or no value in measuring quality or service and provides little or no data that helps IT justify budgets, costs or new projects.

This is typical in small organizations with limited IT staffing, where there are no dedicated IT staff or where the IT function is run or managed by non-IT staff, such as a Finance team.

Reactive

Monitoring is mostly automatic with some remnants of manual or unmonitored components. Tooling of varying sophistication has been deployed to perform the monitoring. You will commonly see tools like Nagios with stock checks of basic concerns like disk, CPU and memory. Some performance data may be collected. Most alerting will be simple and via email or messaging services. There may be one or more centralized consoles displaying monitoring status.

There is a broad focus on measuring availability and managing IT assets. There may be some movement towards using monitoring data to measure customer experience. Monitoring provides some data that measures quality or service and provides some data that helps IT justify budgets, costs or new projects. Most of this data needs to be manipulating or transformed before it can be used though. A small number of operationally-focussed dashboards exist.

This is typical in small to medium enterprises and common in divisional IT organizations inside larger enterprises. Typically here monitoring is built and deployed by an operations team. You’ll often find large backlogs of alerts and stale check configuration and architecture. Updates to monitoring systems tend to be reactive in response to incidents and outages. New monitoring checks are usually the last step in application or infrastructure deployments.

Proactive

Monitoring is considered core to managing infrastructure and the business. Monitoring is automatic and often driven by configuration management tooling. You’ll see tools like Nagios, Sensu, and Graphite with widespread use of metrics and graphing. Checks will tend to be more application-centric, with many applications being instrumented as part of development. Checks will also focus on measuring application performance and business outcomes rather than stock concerns like disk and CPU. Performance data will be collected and frequently used for analysis and fault resolution. Alerting will be annotated with context and likely include escalations and automatic responses.

There is a focus on measuring quality of service and customer experience. Monitoring provides data that measures quality or service and provides data that helps IT justify budgets, costs or new projects. Much of this data is provided directly to business units, application teams and other interests parties via dashboards and reports.

This is typical in web-centric organizations and many mature startups. Monitoring will still largely be managed by an operations team but responsibility for ensuring new applications and services are monitoring may be devolved to application developers. Products will not be considered feature complete or ready for deployment without monitoring and instrumentation.

Summary

I don’t believe or claim this model is perfect (or overly scientific). It’s also largely designed so I can quantify some work I am conducting. The evolution of monitoring in organizations varies dramatically, or as William Gibson said: “The future is not evenly distributed.” The stages I’ve identified are broad. Organizations may be at varying points of a broad spectrum inside those stages.

Additionally, what makes measuring this maturity difficult is that I don’t think all organizations experience this evolution linearly or holistically. This can be the consequence of having employees with varying levels of skill and experience over different periods. Or it can that different segments, business units or divisions of an organizations can have quite different levels of maturity. Or both.

A Monitoring Maturity Model was originally published by James Turnbull at Kartar.Net on January 13, 2015.

Using Riemann for Fault Detection

2015-01-05T00:00:00-05:00

In the last post I introduced you to Riemann. I mentioned streams in that post and how they are at the heart of Riemann’s power. However I only provided a vague teaser of streams and left you having to go fish for yourself.

In this post I’m going to build on our example Riemann configuration. I’ll show you how to do simple service management with streams and introduce you to Riemann’s state table: the index. We’ll see:

How the index works.
How we can alert on services and hosts using events.
How we can send those alerts via email and PagerDuty.

Configuring Streams

Streams are specified in Riemann’s Clojure-based configuration file. On our example Ubuntu host we can find that file at /etc/riemann/riemann.config. We edited that configuration in the last post to bind Riemann to all interfaces and to add some more logging. Let’s look at it again now.

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on all interfaces over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "0.0.0.0"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server  {:host host}))

; Expire old events from the index every 5 seconds.
(periodically-expire 5)

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      ; Log expired events.
      (expired
        (fn [event] (info "expired" event))))))

In our configuration we can see a section called (streams. Inside this section is where we configure Riemann’s streams. The first entry in this section specifies a default time to live for events. More on this shortly. The second entry tells Riemann to index all events.

The Riemann Index

The index is a table of the current state of all services being tracked by Riemann. In the last post, when we introduced events, we discovered that each Riemann event is a struct that can contain one of a number of (optional) fields including: host, service, state, a time and description, a metric value or a time to live. Each event you tell Riemann to index is added and mapped by its host and service fields. The index retains the most recent event for each host and service. You can think about the index as Riemann’s worldview. The Riemann dashboard, which we also saw in the last post, uses the index as its source of truth.

Each indexed event has a Time To Live or TTL. The TTL can be set by the event’s ttl field or as a default. In our configuration we’ve set the default TTL to 60 seconds with the default variable. This is the period for any event which doesn’t already have a TTL.

After an event’s TTL expires it is dropped from the index and fed back into the stream with a state of expired. This seems pretty innocuous right? Nope! This is where the change in monitoring methodology that Riemann facilitates starts to become clear (and exciting).

Detecting down services

In the last post I talked a bit about pull/polling models versus push models for monitoring. In the monitoring “pull model” we actively poll services, for example using an active check like a Nagios plugin. If any of those services failed to respond or returned a malformed response our monitoring system would alert us to that. This active monitoring generally results in a centralized, monolithic and vertically scaled solution. That’s not an ideal architecture.

In an event-driven push model we don’t do any active monitoring. Our services generate events. Those events are pushed to Riemann. Each event has a TTL and the last event received is stored in the index. When the TTL expires Riemann will expire the event and feed it back into the stream. In that stream I can then monitor for events with a status of expired and alert on those. A much simpler, more scalable and IMHO more elegant solution.

So let’s see how this might work for a service. In the last post we looked at some of the Riemann tools for service checking. Let’s use the riemann-varnish tool again for our testing.

On our Varnish host we need to install riemann-tools via RubyGems.

$ sudo gem install riemann-tools

We can then use riemann-varnish to send our events.

$ riemann-varnish --host riemann.example.com

The riemann-varnish command wraps the varnishstat command and converts Varnish statistics into Riemann events, for example the client connections accepted metric generates an event like so:

:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0

We can see that the event has a host and a service, the combination of which Riemann will use to track state in the index. The event also has a state field of ok plus other useful information like the actual client connections accepted metric.

We’re going to use this data, plus the TTL, to do basic service monitoring with Riemann. Let’s update our configuration to

(def email (mailer {:from "reimann@example.com"}))

(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    (changed-state {:init "ok"}
      (email "james@example.com")))))

The first thing we’ve added is a function called email that configures the emailing of events. Under the covers Riemann uses Postal to send email for you. This basic configuration uses local sendmail to send emails. The From email will be riemann@example.com. You could also configure sending via SMTP. To send emails you’ll need to ensure you have local mail configured on your host. To do this I usually install the mailtools package.

$ sudo apt-get -y install mailtools

If you don’t install a suitable local mail server then you’ll receive a somewhat cryptic error in your Riemann log along the lines of:

riemann.email$mailer$make_stream threw java.lang.NullPointerException

Next we’ve used a helper shortcut called changed-state to monitor for events whose state has changed. The init variable specifies the base assumption of an event’s state, here ok. This is because Riemann doesn’t know about the previous state of events when it starts. This tells Riemann to assume previous events are all okay. Now the changed-state shortcut will match any events whose state is not ok and pass them to the email function we defined earlier.

Let’s see this in action. First, we need to restart or HUP Riemann. Next, whilst I’ve been explaining this, the riemann-varnish tool has been sending events to Riemann. Those events are from my Varnish host, varnish.example.com, and an event is generated by each Varnish metric. Each event has a state of ok and a TTL of 10 seconds.

:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0

If Varnish fails or I stop the riemann-varnish tool then the events flow will cease. When the TTL has expired, 10 seconds later, this should trigger an event with a state of expired and email notifications telling us that the Varnish services have changed state.

If we check our Riemann log file we should see the following event.

:time 1420058947163/1000, :state expired, :metric 7184.0, :tags nil, :service varnish client_conn, :host varnish.example.com

As well as additional events for each Varnish metric that has also expired. If we check our inbox we should also see email notifications for each service that has stopped reporting.

If the service starts working again you’ll receive another set of notifications that things are back to normal.

Preventing spikes and flapping

Like most monitoring systems we also have to be conscious of the potential for state spikes and flapping. Riemann provides a useful variable to help us here called stable. This variable allows us to specify a time period and event field, like the state (or usefully the metric for certain types of monitoring), and it monitors for spikey or flapping behavior. Let’s add stable to our example.

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      (changed-state {:init "ok"}
        (stable 60 :state
          (email "james@example.com")))))))

Here we’ve specified the stable variable with a time period of 60 seconds watching the state of events. This will mean that Riemann will only pass on events where the state remains the same for at least 60 seconds. Hopefully avoiding service flapping. (Also potentially interesting here is the ability to rollup and throttle event streams.)

Sending events to PagerDuty

We aren’t limited to email either for alerting. Riemann comes with some additional options, most notably PagerDuty.

(def pd (pagerduty "pagerduty-service-key"))

(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    (changed-state {:init "ok"}
    (stable 60 :state
      (where (state "ok") (:resolve pd))
      (where (state "expired") (:trigger pd)))))))

Here we’ve defined a function called pd that creates a connection to PagerDuty. We’ve specified a service key we previously defined in PagerDuty. We’ve updated our state monitoring to trigger in two cases:

When an event has a state of expired we send an alert trigger to PagerDuty.
When an event has a state of ok we send a resolution signal to PagerDuty.

This ensures we can both trigger and resolve issues created from Riemann.

Let’s trigger some PagerDuty alerts. First, we need to restart or HUP Riemann to update our configuration. Next, we can generate some alerts by stopping our riemann-varnish tool again. The expired events should trigger some PagerDuty alerts like these.

Summary

Pretty cool stuff eh? Well this post just scratches the surface of things you can do with Riemann streams. There are a bunch of other ideas and examples in the Riemann HOWTO section that you can explore. Also look out for my next post on Riemann where I’ll be looking at streams again, this time with a focus on metrics and Graphite.

Using Riemann for Fault Detection was originally published by James Turnbull at Kartar.Net on January 05, 2015.

An Introduction to Riemann

2014-12-26T00:00:00-05:00

If only I had the theorems! Then I should find the proofs easily enough - Bernard Riemann

For the last year I’ve been using nights and weekends to look to a variety of monitoring and logging tools. For reasons. I’ve spent a lot of hours playing with Nagios again (some years ago I wrote a book about it) as well as looking at tools like Sensu and Heka. One of the tools I am reviewing and am quite excited about is Riemann.

Riemann is a monitoring tool that aggregates events from hosts and applications and can feed them into a stream processing language to be manipulated, summarized or actioned. The idea behind Riemann is to make monitoring and measuring events an easy default. Riemann also provides alerting and notifications, the ability to send events onto other services and storage and a variety of other integrations. Overall, Riemann is fast and highly configurable. Most importantly however it is an event-centric push model.

So why does this matter? Most monitoring systems I’ve been examining are pull or polling-based systems like Nagios where your monitoring system queries the components being monitored. A classic (perhaps even traditional) check might be an ICMP-based ping of a server. This type of polling is focused on measuring uptime and availability. There’s nothing fundamentally wrong with wanting to know that assets are available and running. Except if that’s the only question you ask. Then it reinforces the view of IT as a cost center.¹ Everything in the IT organization tends to be focused around minimizing downtime rather than maximizing value.

Push based models in comparison are generally about measurement. You still get availability measurement but as a side effect of measuring components and services. The push model also introduces some changes in the way monitoring is architected. Monitoring is no longer a monolithic central function and we don’t need to vertically scale that monolith as hosts are added. Instead pushes are decentralized and the focus is on measuring your applications, your business and your user experience. This changes the focus inside your IT organization towards measuring value, throughput and performance. All levers that are about profit rather than cost.²

So with this in mind, let’s take a look at installing Riemann, configuring it and doing some basic service and event monitoring.

Introducing Riemann

Riemann is open source and licensed with the Eclipse Public license. It is primarily authored by Kyle Kingsbury aka Aphyr.³ Riemann is written in Clojure and runs on top of a JVM.

Installing Riemann

We’re going to install Riemann onto an Ubuntu 14.04 host. We’re going to use the Riemann project’s DEB packages. Also available are RPM packages and tarballs. I am going to do a manual install so you can see the steps involved but you could also install Riemann via Docker, Puppet, Vagrant, or Chef.

First, we’ll need Java and Ruby installed. The Java to run Riemann itself and Ruby for some supporting libraries, a client and the Riemann dashboard. For Java we’re going to use the default OpenJDK available on Ubuntu. For Ruby we’re going to install the ruby-dev package which will drag in Ruby and all the required dependencies we need. We also need the build-essential package to allow us to compile some of the Ruby dependencies.

$ sudo apt-get -y install default-jre ruby-dev build-essential

Then let’s check Java is installed correctly.

$ java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Now let’s grab the DEB package of the current release.

$ wget https://aphyr.com/riemann/riemann_0.2.8_all.deb

And then install it via the dpkg command.

$ sudo dpkg -i riemann_0.2.8_all.deb

The Riemann DEB package installs the riemann binary and supporting files, service management and a default configuration file.

Lastly, let’s install some supporting tools, the Riemann client and dashboard.

$ sudo gem install --no-ri --no-rdoc riemann-client riemann-tools riemann-dash

Running Riemann

We can run Riemann interactively via the command line or as a daemon. If we’re running it as a daemon we can use the Ubuntu service management commands:

$ sudo service riemann start
$ sudo service riemann stop
. . .

Let’s start though with running it interactively using the riemann binary. To do this we need to specify a configuration file. Conveniently the installation process has added one at /etc/riemann/riemann.config.

$ sudo riemann /etc/riemann/riemann.config
loading bin
INFO [2014-12-21 18:13:21,841] main - riemann.bin - PID 18754
INFO [2014-12-21 18:13:22,056] clojure-agent-send-off-pool-2 - riemann.transport.websockets - Websockets server 127.0.0.1 5556 online
INFO [2014-12-21 18:13:22,091] clojure-agent-send-off-pool-4 - riemann.transport.tcp - TCP server 127.0.0.1 5555 online
INFO [2014-12-21 18:13:22,099] clojure-agent-send-off-pool-3 - riemann.transport.udp - UDP server 127.0.0.1 5555 16384 online
INFO [2014-12-21 18:13:22,102] main - riemann.core - Hyperspace core online

We can see that Riemann has been started and a couple of services have been started: a Websockets server on port 5556 and TCP and UDP servers on port 5555. By default Riemann binds to localhost only.

The default configuration on Ubuntu logs to /var/log/riemann/riemann.log and you can also follow the daemon’s activity there.

Configuring Riemann

Riemann is configured using a Clojure configuration file, by default on Ubuntu it is available at /etc/riemann/riemann.config. Let’s take a quick look at the default file.

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on the local interface over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "127.0.0.1"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server  {:host host}))

; Expire old events from the index every 5 seconds.
(periodically-expire 5)

(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    ; Log expired events.
    (expired
      (fn [event] (info "expired" event))))))

We can see the file is broken into a few stanzas. The first stanza sets up Riemann’s logging to a file: /var/log/riemann/riemann.log. The second stanza controls Riemann’s interfaces: binding TCP, UDP and Websockets interfaces to localhost by default. Let’s make a quick change here to bind these interfaces to all available networks.

(let [host "0.0.0.0"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server  {:host host}))

We’ve updated the host value from 127.0.0.1 to 0.0.0.0. This means if one of your interfaces is on the Internet then your Riemann server is now on the Internet. If you’re worried about security you can also configure Riemann with TLS.

The remaining sections configure indexing and streams. Streams are a big part of why Riemann is very cool. Streams are functions you can pass events to for aggregation, modification, or escalation. Streams can also have child-streams that they can pass events to, allowing filtering or partitioning of the event stream. Using streams is amazingly powerful and you can find sample configurations and a wide variety of howtos on the Riemann site.

Let’s make a small change to our streams stanza to output events to STDOUT and our log file. Add the following at the bottom of the file after all of the other stanzas.

;print events to the log
(streams
  prn

  #(info %))

The prn prints all events to STDOUT and the #(info %) sends events to the log file. Now restart Riemann to enable our new configuration.

Sending data to Riemann

Riemann has a variety of ways you can send data to it including a set of tools and a variety of client native language bindings. You can find a full list of the clients here and we’ll see how to use a client below. The collection of tools are written in Ruby and available via the riemann-tools gem we installed above. Each tool ships as a separate binary and you can see a list of the available tools here. They include basic health checks, web services like Apache and Nginx, Cloud services likes AWS and a variety of others. The code is clear and you could easily extend or adapt these to provide a variety of other monitoring capabilities.

The easiest of these tools to test is riemann-health. It sends CPU, Memory and load statistics to Riemann. Open up a new session and launch it now.

$ riemann-health

You can either run it locally on the same host you’re running Riemann on or you can point it at a Riemann server using the --host flag.

$ riemann-health --host myriemann.example.com

Remember the default Riemann is only bound to localhost but we updated our configuration to bind to all interfaces.

Now let’s look at our incoming data. Let’s start with looking at the Riemann log file.

$ tail -f /var/log/riemann/riemann.log
INFO [2014-12-23 17:23:47,050] pool-1-thread-16 - riemann.config - #riemann.codec.Event{:host riemann.example.com, :service disk /, :state ok, :description 11% used, :metric 0.11, :tags nil, :time 1419373427, :ttl 10.0}
INFO [2014-12-23 17:23:47,055] pool-1-thread-18 - riemann.config - #riemann.codec.Event{:host riemann.example.com, :service load, :state ok, :description 1-minute load average/core is 0.11, :metric 0.11, :tags nil, :time 1419373427, :ttl 10.0}
. . .

Here we can see a couple of events, one for disk space and another for load. Each Riemann event is a struct. Each event can contain one of a number of optional fields including: host, service, state, a time and description, a metric value or a TTL. They can also contain custom fields.

Let’s examine one of the disk events riemann-health has sent:

:host riemann.example.com, :service disk /, :state ok, :description 11% used, :metric 0.11, :tags nil, :time 1419373427, :ttl 10.0

We can see the event has a host, service, and state. If we peek over at the code that produced the event we can how it is generated and sent. As event APIs go it’s very lightweight but still hugely extensible.

Let’s try another tool, riemann-varnish, which reports Varnish metrics. On one of my hosts with Varnish installed I run.

$ riemann-varnish --host riemann.example.com

And on the Riemann host I see in /var/log/riemann/riemann.log.

INFO [2014-12-24 02:01:41,660] pool-1-thread-19 - riemann.config - #riemann.codec.Event{:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0}
INFO [2014-12-24 02:01:41,706] pool-1-thread-21 - riemann.config - #riemann.codec.Event{:host varnish.example.com, :service varnish client_drop, :state ok, :description Connection dropped, no sess/wrk, :metric 0.0, :tags nil, :time 1419404501, :ttl 10.0}
INFO [2014-12-24 02:01:41,751] pool-1-thread-22 - riemann.config - #riemann.codec.Event{:host varnish.example.com, :service varnish client_req, :state ok, :description Client requests received, :metric 15452.0, :tags nil, :time 1419404501, :ttl 10.0}

And to drill down to a specific event.

:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0

Here we can see the Varnish client connections accepted metric. If we look at the riemann-varnish code we can see a shell-out to varnishstat that captures our metrics and sends them to Riemann. Pretty easy to replicate for a variety of services.

If you think the shell-out and parse is a little clumsy then we can also write our own tool or use the Riemann client directly. Let’s embed Riemann into a Sinatra application.

require 'rubygems'
require 'sinatra'
require 'riemann/client'
require 'socket'

configure do
  set :bind, '0.0.0.0'
end

get '/' do
  send_event(metric = rand)
  '<h1>This does something awesome</h1>'
end

def send_event(metric)
  c = Riemann::Client.new host: 'localhost', port: 5555, timeout: 5
  c << {
    host: Socket.gethostname,
    service: 'something awesome',
    metric: metric,
    description: "What an awesome number: #{metric}",
    time: Time.now.to_i - 10
  }
end

Our Sinatra app is very basic. It responds on / with the HTML: <h1>This does something awesome</h1>. As part of that connection it also sends an event to Riemann using the Riemann client we installed earlier.

To do this we’ve required the riemann/client and inside the send_event method we’ve connected to the Riemann host on localhost. This method then accepts a metric, which is a random number created by the rand method, from the get block and sends that metric with an event.

If we run this app (you might need to gem install sinatra to install Sinatra first).

$ ruby riemann_sinatra.rb

And then look at our Riemann logs we’ll see an event much like this:

:host riemann.example.com, :service something awesome, :state nil, :description What an awesome number: 0.9984397664300542, :metric 0.9984397664300542, :tags nil, :time 1419449388, :ttl nil

Displaying Riemann events

Obviously reading events from the log output isn’t overly practical or useful. To allow you to work with your events Riemann comes with a dashboard. It’s a Sinatra application and we already installed it via the riemann-dash gem.

Let’s start it now.

$ riemann-dash

You can then view it on port 4567 on the localhost. You can also change the dashboard’s configuration by creating a config.rb file in the directory from which you’ve launch the dashboard. This provides control over where and how the dashboard binds and some other configuration options.

The dashboard is a little janky in places but can produce some excellent dashboards. The dashboard is made up of view panels that are configurable. You can select or add a view using the boxes and plus symbol in the top left of the dashboard.

We just want to see the events coming into our dashboard though. So let’s edit our current view to show those events. First, Ctrl-Click (or Meta-Click on OSX) on the big Riemann title in the centre top of the dashboard to select this view. This will highlight it gray (The Escape key de-selects the view). Now type “e” to edit the view.

Change the view from Title to Grid and then put true into the query box.

This will change this view into a grid, which shows a table of events, and select all events, the true in the query box. This is the simplest query you can create but you can do much more. To get started you can find some sample queries here).

Now you should see some of the events you’re generating displayed in a per-host grid.

If you’re not taken with the Riemann dashboard there is a Grid layout alternative or for graphing you could direct all your metrics to Graphite which has a very fully-featured dashboard.

Summary

We’ve barely scratched the surface of Riemann’s capabilities with this introduction. From here we could configure a variety of streams, matching events by service or host, and convert our events into summaries, metrics and collections.⁴ We can take alerting actions (email, PagerDuty) based on everything from failed services (replace Nagios anyone?), to metric thresholds, or even Holt-Winters anomaly detection. We can also send data onto longer-term storage or into other tools like Graphite. The Riemann HOWTO has a number of examples and ideas to help you build your Riemann environment further. I really recommend taking a look at Riemann if you’re interested in where modern monitoring is headed.

It also tends to reward conservatism and fear of change. ↩
This is a highly simplistic analysis of the potential for change in IT monitoring behaviour. Your mileage may vary. ↩
Kingsbury also published an excellent series on the CAP properties of a variety of distributed systems. ↩
Of couse there’s even a Puppet Riemann report processor. ↩

An Introduction to Riemann was originally published by James Turnbull at Kartar.Net on December 26, 2014.