Anyone with significant infrastructure needs network monitoring. The larger the network, the greater the reward, but even very small networks today tend to be business critical and getting pro-active issue alerts beats the “your network is down” phone calls in reactive mode.

While in corporate life, I had mixed experiences with commercial systems. At one time or another, I had experience with many of the well known brands, from What’s Up Gold to Solarwinds to OpenView. All could do the job, some better than others. They varied quite widely in expense, learning curve, and what exactly the “job” was.

Recently, however, I have been exploring Open Source alternatives. While Open Source has certainly become more mainstream of late this is really nothing new. Back around 2006 or so, I charged a merged I.T. Infrastructure team with “pick one” when we found we had far too many tools due to acquisitions. I did this with some trepidation; I had not told them to consider cost, just “pick one, not three”.  Much to my surprise, they picked Open Source Nagios, even though we owned a couple of the biggest names (including OpenView). With a bit of Cacti thrown into the mix, soon we had 110 physical sites, most with redundant WAN links, and close to 1000 servers all monitored with Nagios.

So… as I write this, I can hardly claim to be covering virgin territory, but most of my involvement at that time was to sign invoices and review reports. With more time on my hands of late, and with a client hoping to capitalize on Open Source to improve their network monitoring services to their clients, I decided to do a fairly thorough survey through Open Source tools and find if there were a clear winner.

So… spoiler alert: I picked Zabbix, but there is not necessarily a clear winner. Each has some strengths and weaknesses.  And much like trying to pick the best potato chip, every time you walk down the shopping aisle you find a new flavor, and your comparison job could start over. Indeed, just spending a year in off-and-on effort means for many products undergoing rapid change you need to go back and redo a lot of your work to make sure it is current.

For this article I am not going to dive deep into details, I will not even mention some products viewed, as my data may not be current even as I write this, much less as you read it. Instead, I want to try to give a flavor of a few well known different tools, and comment as to their philosophy may be more important than individual details. A lot of this is subjective, but a lot of that subjectivity comes from trying the products on different clients in real world environments from the perspective of one who would be monitoring their network, and calling their admins’ attention to problems.

What’s Up Gold (IPSWITCH)

So if you are paying attention you may wonder why this is here. I view What’s Up Gold (WUG) as a good, middle-ground, representative Network Monitoring System (NMS), and as such it is worth reviewing it in a similar fashion for comparison. Think of it as a benchmark.  Website here.

WUG tries to be an “all things to all people” tool, it covers areas ranging from device discovery and inventory to “up/down” monitoring of both devices and services, to network performance, to application level performance. They then add configuration management and layer two discovery just to keep things interesting.  It is (relatively) easy to set up, following the Windows model of “just run Setup and trust me” to a large degree. At least to get the basics going, it is easy to use. It has recently gone through a major rework of its user interface, and has nice showy graphs and gauges to impress management, and make your users believe you really know what is happening on your network. It is extensible; you can write your own checks easily, based either on SNMP, or using a scripting language WMI, ODBC, LDAP, etc.

WUG has a lot going for it, and is reasonably affordable as commercial products go. If you prefer commercial, it is a nice balance between cost and function.  I have used it at one company, and actually installed and configured it at two others.  It works.

To balance that, it suffers from some of the worst aspects of commercial software.  The priority is always “more sales” in particular “more new sales”.  Giving new customers what they want may or may not be aligned with actually making the product better.  Sometimes the answer really needs to be “that is not aligned with our product” as opposed to “let me see if I can super-glue it onto the left side.”

WUG suffers from lack of good integration amongst its components.  It has two major (and at least one other minor) alerting facility, so that the mechanism for finding out your SMTP server is down is completely and totally different from the mechanism for finding out it ran out of disk space.  At a more fundamental level, it suffers from lack of any real architectural discipline – the priority is always “more stuff” not “make it fit s a well thought out whole”.

Welcome to commercial software. To be fair, being Open Source does not mean this problem never exists, but frequently there the deciding factors are driven by a core supporting group doing the work largely either for their own benefit (also), or at least partially to enhance their own reputation.  It is frequently more art than product, and driven by those who still get their hands dirty, rather than MBA’s, JD’s, and VC’s.  There is still space for architectural discipline in Open Source software; not so much in most commercial software, as it is displaced by the next Quarterly Financial report’s requirements to announce some new feature.

But… WUG is a worthy benchmark, and a functional and usable product.

Nagios, OMD and Variations

So let’s shift to some of the (mostly) Open Source tools.

Nagios is perhaps best known, but like any Open Source tool over time, it has had many forks and look-alikes such as Icinga, Shinken, Check_MK, Thurk, and OMD. I do not mean to imply these are the same, as most of these took off from Nagios to fix issues Nagios had, but to my brief look and act in a somewhat similar fashion.

Nagios is a mostly text-based, script-based collection of features for configuring and collecting data. It is a bit of an exaggeration to call it a “system” so much as “cooperating chaos”. It’s age means it has various workarounds for large volumes, for graphics, and horizontal distribution. I think it is fair to say it grew more or less organically as opposed to being designed for today’s network.  It speaks well for it that it survives and thrives over that much time and technology change.

Some of the variations, such as OMD, Check_MK, Multisite are mechanisms to better package the various components, and reduce the sophisticated Linux admin effort needed to keep it running. To some extent these are layers over top and around what is under the covers, as opposed to re-engineering from the ground up.

In my opinion Nagios is showing its age, in all of its incarnations. Eventually all successful tools need to be re-invented and re-engineered from a clean sheet of paper. Equally true is that the more successful and pervasive such a tool is, the more inertia it carries and the less likely such a re-do is to be supported by its users (or, perhaps, the less likely its contributors can agree on when, why and how such a re-do occurs).

I tried several of these variations. OMD (http://omdistro.org/start) was my favoriate, and Check_MK and work by Mathias Kettner was some of the more impressive work at trying to improve and unify that system of tools (as of this writing an interesting diagram showing the complexity of that world is at the bottom of his page here.

To compare: WUG is like buying a frozen dinner, as opposed to making a fancy dinner from scratch. The latter will give you a much better and more complete meal, but it sure takes a lot more work (and takes some expertise) to get there.

I will leave the Nagios variations universe by this observation: If you have plenty of Linux and scripting expertise, flavors you find here will do anything you want, but be prepared to have someone truly dedicated to this for quite some time. Also expect an ongoing, low-level amount of integration effort as linux and each component go through upgrades.  My advice: Unless you have Nagios expertise and preference in house, keep looking.

OpenNMS

This is another older tool, and one which has a number of underlying components, but not nearly the fragmentation of the Nagios world. Its authors have exercised reasonable discipline to keep it under control and targeted; while it shows its age (in the GUI for example), it is reasonably well organized. It also is (somewhat) a database oriented tool (PostgreSQL); so much of unix lives in the “everything is a text file” world that struggles to scale, and is a bit foreign to those who like databases as transaction, safe data stores.

It also is one of the fairly typical tools, which is slightly commercialized – in this case a commercial flavor for those who want a more packaged solution, and a (free) edition which has less packaging and support. That seems a common model for many tools now.

In experimenting with OpenNMS, I found it my second favorite Open Source tool. On the good side it has very deep integration with many specific network devices. It has Layer 2 discovery built in and in use as part of its monitoring.  It is (somewhat) template and rules based, so you can scale to many similar devices more easily.

But… it is pretty arcane, and still requires editing a lot of text files to configure devices. Much like Nagios, someone who spends their life in the guts of Linux will feel fairly at home, but someone who is mostly a Cisco engineer is going to bang their head against the wall a lot.  On a related note the rules hierarchy and provisioning system is fairly inflexible in addition to arcane, making situations where similar devices have a vast array (but differing array) of services possible to handle gracefully, but not easily. It is very easy to head down the wrong path, and find yourself tangled up in conflicting setup rules.

On the good side, I like that it expects Services (e.g. SMTP, HTTP) to come and go, and it is set up to look for them, discover and provision for them, and report on them.  WUG (for example) is the opposite – it just keeps on monitoring a device for the same set of services all the time, and is not designed for dynamic discovery.

Fundamentally I passed on OpenNMS because the tool I choose is not intended for me, it is for network consultants to use day-in and day-out, and it was too complicated and fragile to be managed by those whose skills lie in managing routers, not linux. But it is certainly worth a look.  Web site: http://www.opennms.org/en

Other Also Ran

I tried a lot of other tools, such as Nedi (more of a continuous discovery tool), NTop (more about performance), Observium (too limited in the free version), PRTG (too limited, only 100 sensors).  There are many, but none made it past the first blush attempts to use them. Maybe I gave up too quickly on some.  Eventually it came down to OpenNMS and…

Zabbix

Zabbix LogoZabbix is both a company and product (website here), and follows a somewhat similar paradigm as OpenNMS. The product is free, but the company will provide a turn-key solution for fee, with consulting and startup assistance. Unlike some others (but mostly like OpenNMS, with a few caveats), it is the same software; the company has confidence that its expertise is worth money and does not hamstring the software.

I have not followed its history closely, but zabbix looks and feels like a product that was purpose built with an architecture that was less assembly of disparate pieces, and more building up a framework around a design.  Certainly it uses existing tools (e.g. net-snmp) but there’s a defined architecture in which they fit. Most importantly, where they did not fit, someone wrote them – no superglue involved.

Important to me, Zabbix is almost entirely database-based.  Configurations of monitored devices are all in the database, the performance data goes there, the monitoring data goes there. The database is (mostly) normalized and well designed, with minimal cases of structured data stuffed into columns.  A database programmer will be right at home, and can construct lots of useful tools (I chose Postgresql which rivals commercial products in features and performance).  On the downside (or maybe it is an upside) it supports multiple databases, limiting how precisely and effectively it can use the strengths of specific databases out of the box. It also contains a full featured API which can be used for integration above the database level, including user written scripts to do almost anything, from adding hosts to changing what is monitored.

So why (some from the network world may ask) do I care it is all in a database?   Transactional control and corruption. In Zabbix, for example, I might make a change that touches hundreds of monitored devices, and thousands of monitored items.  Suppose it is in conflict with one near the end of this long list — zabbix can just roll back the change and leave all hosts consistent.  With text based systems, great programmer skill is necessary to prevent half updated changes in such situations.

Another strength that Zabbix has is that it is strongly template based. You can, but should not, set up specific monitoring configuration on a device. Instead you should associate the device with templates, and the template has the monitoring configuration.  Then if you (for example) want to change how alerts for bad packets on an ethernet interface work, you change one template, and it is automatically applied to all hosts.  This makes it possible to set up Zabbix so that real network administrators can do device changes without any serious Linux or database expertise, as the templates provide an easy level of abstraction.  A device gets, for example, a Active Directory Domain Controller template to monitor AD DC related items.  You don’t have to know, necessarily, what should be monitored on an AD DC; you don’t have to know how to poll the DC for that data, just “this is a DC, it gets a template for DC’s”.

The template basis of Zabbix, along with its consistent architecture, is what led me to choose it. Network Monitoring Systems should not be built just for linux admins, but for Admins generally – windows Admins, Cisco Admins, Phone system Admins, etc. By providing a templated level of abstraction (and a reasonably friendly web based mechanism to created and change them), each can configure their devices without having diving into the guts of /etc. This is not to say that Zabbix itself does not require some care and feeding from a Linux and Database admin, but routine setup of new network devices and moves and changes does not.

Zabbix is also very much designed to be extended.  Templates can be built that are general or quite device specific, and shared and reused.  While similar to “checks” in Nagios, they are a more isolated abstraction layer, and also more complete.  They can not only include monitoring details, but how and when to alert, graphs of monitored data, and screens that assemble multiple aspects of the device for presentation — all in a XML template that can be exported, and given to another Zabbix user for use.

So what’s the downside…

Zabbix is limited to a certain portion of the monitoring world.  Its device discovery is primitive (basically IP address range scans), it does no layer 2 discovery at all, and its out-of-the-box templates are simplistic.  The framework is there, and it is nicely polished in terms of working out of the box, but it is not complete.  It works nicely, but you need someone to build better and more complete templates (or find them – there is a lot of contributed work available for download).  My guess is that if you paid for installation, the consultant would bring a huge bag of such templates ready to go, ones they do not give away.

It is also incomplete on a few fronts.  It does not try to do layer 2 discovery; if you want that, a tool like Netdisco is needed.  It does not do configuration control for Cisco or other network devices (downloading configs, noticing changes).  For that you might want Rancid. It is weak in log file monitoring, for that you might want ELK or Graylog.

So (you may be thinking), if I buy WUG I get all those things.  Yes, you do (well, if you buy all the associated license pieces).  However… WUG is mediocre in all of these areas; each of the tools mentioned above is significantly better in their respective area, just not integrated.  Buying together on one PO is not the same.

To some extent, it is a choice of the partially integrated, mediocre tools, or non-integrated, your-choice-of-best-of-breed tools to add on to zabbix.  I’ll take my choice of better tools.

So… Zabbix. I decided on Zabbix.  Now what?

More on some of the guts of Zabbix in articles to follow.