Monday, April 14, 2008

Amazon adds Persistent Storage

http://www.amazon.com/gp/html-forms-controller/ec2-persistent-storage

UPDATE: http://www.allthingsdistributed.com/2008/04/persistent_storage_for_amazon.html

Werner Vogels has more detail than I saw in the email at his blog. This is amazing. The ability to MOVE the volume from host to host is more icing on the cake. There is NO reason not to use Amazon EC2 as the starting point for hosting your entry into a new market space (assuming the SLA from Amazon meets your requirements). This is simply the cheapest way to spin up and the concepts are ones that you will use on your OWN environment should you host it yourself at some point.

I got this in my email. This is "a good thing". The biggest problem with EC2 was the fact that it did not offer persistent storage beyond the size of your root volume. Combine that with the performance of your S3 /mnt volume and you couldn't expect to host any database-driven applications of substantial size on there. Obviously people have done it but it required a few layers of complexity that made the solution more work than it was worth (i.e. replication with ANOTHER Amazon instance just to make sure that, should something happen and your VM went down, you didn't lose your database.

This was the BIGGEST sticking point I had with Amazon EC2. We made heavy use of the technology at RBX but the lack of a persistent volume made it simply inappropriate for any production quality applications. To use EC2 appropriately, you had to rethink your application architecture. Before I left, we actually had the opportunity to do that for a customer application that was being written from the ground up. The plan I proposed was to host certain static content and assets on S3 and also provide an EC2 gateway application for more specialized usage. This would shift the burden of bandwidth over to Amazon and provide a more predictable rate scale for the our customer's customers.

One of the biggest questions in a new internet application is hosting. Do you host it somewhere else where you might be limited to a single server and thus have performance probelms? How much bandwidth do we really need? Are we going to go broke OVER-purchasing resources (server capacity, bandwidth).

The Amazon persistent storage makes those choices a little less risky. I'm looking forward to creating some solutions on top of this.

Friday, April 11, 2008

Hostname vs. IP address for host definitions

A question just popped up on #nagios asking the following:

does nagios cache hostnames?
I'm just wondering, if I start using hostnames instead of IP addresses, and DNS goes down, will the whole system just fail?

The short answer I provided is that, yes, it will fail. This might prompt some of you to use the IP in the address line of a host definition. I'd like to advise against that though for several reasons:

1) DNS is a "good thing"

Simply put, DNS is good and proper. There's a reason that it was implemented. With any network of substantial size, it will simply become LESS efficient to try and refer to everything by IP than by a DNS entry. Obviously this falls apart if you try, like most people do at first, to be "witty" with DNS names. When I was at Bellsouth DCS, we had some networked printers. Not a stretch, I know. They had 4 networked printers throughout the building. Each printer was named after a season i.e. winter,spring,summer and fall. This obviously worked for quite sometime. However, once they added that fifth printer, all of the memory that everyone had around printer names went to pot. Now they realized they had to rename all the printers and this caused no end to confusion for end users. Printers aren't such a big deal but when it's servers, it gets dangerous. Was xavier.mydomain.com the production server or was it qa? I thought wolverine.mydomain.com was qa so I made my changes there.

It's also important to use names are descriptive. dbs-0X.mydomain.com (where X is a random number) is a good start but again, was dbs-01 the production database or the development database? Personally, I'm a fan of using an additional descriptor in the DNS entry:

- devdbs-01.mydomain.com
- qadbs-01.mydomain.com
- proddbs-01.mydomain.com


The company I'm with now actually uses an interesting hybrid and follows the second method I mentioned but with good reason - servers are interchangeable. At any given point, all the applications on app-04 can be moved to app-07 to balance resources and app-04 could be repurposed as a QA server. In this environment, servers are nothing more than part of a pool of resources. In this case, EXTENSIVE use of DNS is made and production applications never share physical collocation with nonproduction applications. If one production application is running on a server then every other application on that server is production.

2) FQDN are more flexible

Simply put, it's easier to change an IP in one place than 5.

3) DNS failures can be good

Many times, because we defined address as a FQDN, I knew about failures in DNS before Nagios had even polled my DNS server.

Having said all of that, if you insist on using IP addresses or mixing them up (because of service checks against virtual hosts for instance), let me suggest the following:

Use a FQDN in host_name and the IP in address.

Using Nagios macros, you can easily modify a service or host check to use the hostname as opposed to the IP by changing the check command to use $HOSTNAME$ instead of $HOSTADDRESS$. The Nagios macro $HOSTNAME$ pulls from host_name while $HOSTADDRESS$ pulls from address.

Wednesday, April 9, 2008

Sold on unit testing

I'll tell anyone who asks that I'm NOT a programmer. I've always understood the value in unit tests but considered them somewhat limited in scope. I mean it's pretty much impossible to write a unit test for every possible scenario and (IMHO) impossible to write a unit test that simulates the flow of a user from application login to data entry to logout using the entire infrastructure (i.e. client browser -> through load balancer -> through app server -> database -> back).

Because of those reasons, while I understood the need for unit tests for "stupid stuff", the fact that you had to add unit tests for each bug that popped up over the life of a product felt "odd". Of course this wouldn't stop me from bitching at a developer who's lack of a unit test caused me to restore a filesystem because it would ascend to the parent directory if the file or directory it was trying to delete did not exist! (True story).

So here I am writing my ruby nagios library and realizing that I'm duplicating A LOT of code writing quick little scripts to test my output. I decide to write unit tests for each of my classes. I found an interesting little article on about.com talking about learning ruby via unit tests. It was an interesting concept and one that made sense to me. So I start with my contact class.

Basically all of the classes that exist for a specific nagios object (contact,service,host) have their own class. One method they have is called "hashify". It's different for each object type but essentially creates a nested hash like so:

{
"nagiosadmin"=>
{"service_notification_period"=>"24x7",
"host_notification_options"=>"d,u,r",
"service_notifications_enabled"=>nil,
"host_notification_enabled"=>nil,
"pager"=>nil,
"service_notification_commands"=>"notify-service-by-email",
"host_notification_period"=>"24x7",
"alias"=>"Nagios Admin",
"host_notification_commands"=>"notify-host-by-email",
"service_notification_options"=>"w,u,c,r",
"email"=>"root@localhost"},
"johnv"=>
{"service_notification_period"=>"24x7",
"host_notification_options"=>"d,u,r",
"service_notifications_enabled"=>nil,
"host_notification_enabled"=>nil,
"pager"=>nil,
"service_notification_commands"=>"notify-service-by-email",
"host_notification_period"=>"24x7",
"alias"=>"John E. Vincent",
"host_notification_commands"=>"notify-host-by-email",
"service_notification_options"=>"w,u,c,r",
"email"=>"root@localhost"}
}
So I start writing unit tests. I test that hashify returns a Hash. I test that I have a hash called 'johnv'. Then I start testing for each of the attributes for johnv. This is where it all falls apart. Attribute email is always returning nil even though I can see right in the debugger that it's set to 'root@localhost'.

After about 30 minutes of dicking around I finally realize that, being a Ruby beginner who has used some of the higher-level Ruby stuff for quickies without actually getting into the meat of the language (Rails, Ruport), I was totally confused on the usage of ":" in hashes much less anywhere else in Ruby. Two minutes after that, I stopped throwing symbols all over the place in my classes when building a hash ;)

My unit test passed! I continue to write the remaining tests and run them. I get a failure. I look in detail and realize that in my original Contact class, I had a typo in the hashify method that defined a key as 'host_notifications_period' instead of 'host_notification_period'!

So yeah, I'm sold on unit tests and I'm not writing another lick of code until I finish writing them for the classes I have now.

Friday, April 4, 2008

Nagios Configuration Tips Part 1 - cfg_dir

One of the key problems I see with people using Nagios is the fact that they add EVERYTHING into a single file for each type of object. This is obviously fine when you have only a few systems to monitor but starts to become unwieldy when you have 10, 20 or even 100 servers to monitor. This article is to show you what I consider to be a very flexible file system layout for your Nagios configurations. The end result is a configuration structure that allows you to easily jump to the source of a configuration problem and encourages the use of object templating. This is the first in a multi part series of Nagios Configuration Tips.

nagios.cfg
Since we are only concerned about configuration files, I'm only going to paste the relevant lines from the nagios.cfg:


cfg_file=/etc/nagios/objects/commands.cfg
cfg_file=/etc/nagios/objects/contacts.cfg
cfg_file=/etc/nagios/objects/contactgroups.cfg
cfg_file=/etc/nagios/objects/timeperiods.cfg
cfg_dir=/etc/nagios/objects/(organization)


(organization is really just an arbitrary directory to logically group a collection of objects. )

The explicitly defined files are where we keep the more "global" stuff. Objects that are shared across all configs for all organizations/domains or system-level stuff.

commands.cfg
This file is a standard nagios cfg file containing a list of command definitions. In this case, I'm only keeping the stuff that applies to the local system (check_local_*) and the notify-service-by-email/notify-host-by-email defines.
contacts.cfg/contactgroups.cfg/timeperiods.cfg
contacts.cfg has a definition for a nagiosadmin account, contactgroups.cfg has a definition for a testgroup contactgroup and timeperiods.cfg has a definition for 24x7.

And that's it for the base configuration files. Notice that there really isn't much in them. As you'll see, all of the heavy lifting will be done by the stuff in the cfg_dir.

cfg_dir
So now let's look at what we have in our cfg_dir.

For this example, we're going to assume that we have two areas that we need to monitor, systems and processes. Let's also use the fictional company name of widgetcorp. Systems are exactly what they sound like. This is where we monitor things at the host level like reachability, loadavg and disk utilization. Process would be things that we monitor at a higher level like database locks, http connections, jvm usage or even specific business processes like user logins, outstanding orders shipments or even the date of the last warehouse load.

So let's create the following directory structure under /etc/nagios/objects/:


widgetcorp
/processes
/systems


Now before we write any configs, let's think about how we want to categorize our these new directories. At widgetcorp, we have three classes of systems - database and application. Let's create those directories under systems:


widgetcorp
/systems
/database


Being the sane company that they are, widgetcorp was smart enough to invest in a minimal level of high availability. This environment consists of 4 servers using round-robin DNS to balance between application servers and using linux-ha to provide access to the database servers. Notice that I've not yet defined WHAT application server is running or what dbms is being used. These systems are named app01,app02,dbs01,dbs02.


systems
/application
/app01
/app02
/database
/dbs01
/dbs02


As far as the process monitoring goes, we have two types of "processes" we need to concern ourselves with - application response time and database server availability.


processes
/database
/application


Sidebar: One thing you'll note is a particular attitude I have. I consider physical systems "interchangeable". I don't want to tie the fact that I run MySQL on db01.widgetcorp.com to the status of db01.widgetcorp.com as a whole. What if we're operating using MySQL Proxy or operating a Linux-HA MySQL cluster or using HACMP on AIX for DB2? The availability of a single system is really quite independent from the higher level availability of the service that grouping of systems provides. Our application servers would never be configured to talk to JUST db01 but instead would use the name of the mysql proxy server or the VIP assigned to the HA cluster - db.widgetcorp.com. Using service dependencies, you can still tie the polling process of db.widgetcorp to a specific server or uplink.

Back to the layout. We now get to discuss what programs are actually installed on each server because the facts that we need are from those programs.

In the case of app01 and app02, they are both running tomcat and apache with mod_jk. All traffic coming in from the internet is balanced between each apache server on port 80 talking to a localhost-listening tomcat instance on the jk connector port. These details aren't really important for the purposes of this document except to say that our customers don't go to app01.widgetcorp or app02.widgetcorp but instead www.widgetcorp.com.

As for the database, the databases are using Linux-HA and MySQL replication to talk to the database server. Each server has a VIP assigned which is aliased to dbrw.widgetcorp and dbro.widgetcorp. The current MASTER in the replication process is assigned the VIP for dbrw and the SLAVE is assigned the VIP for dbro. When one of the systems fails, the other assumes the role of BOTH as the application performs lookups against dbro while doing actual inserts and updates against dbrw.

All of the above means our directory structure now looks like this:


widgetcorp/
systems/
application/
app01/
app02/
database/
dbs01/
dbs02/
processes/
database/
dbrw/
dbro/
application/
www/


And that's it for the first part of this post. The next post will get into the actual naming, location and content of the configuration files. Please feel free to leave comments and let me know your thoughts. Please also be aware that I'm intentionally trying to be generic in these examples. Don't get too caught up in the fictional implementation of the company. I'm aware of the limitations of both round-robin DNS as well as the MySQL implementation. I only picked these as high-level examples.

Thanks and I look forward to the comments!

An interesting log entry

A few (okay 2) years ago I wrote a small page on my website documenting how to get container statistics out of Websphere 5.x from the command-line. To this day, awstats shows that, and the page on monitoring DB2 UDB with Nagios as the single biggest hit pages. I like to occasionally take note of where people are coming from to look at the information.

In looking at the logs this morning, I see the standard incoming search result from google for something like "websphere + jdbc + command-line" or "websphere + jdbc + monitoring" but what caught my eye was something in the USERAGENT string:

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080211 Fedora/2.0.0.12-1.oc2 (CK-IBM) (CK-IBM) Firefox/2.0.0.12"

Was someone from IBM looking at *MY* notes? Why, yes, they were:

torolab.ibm.com

Thanks for visiting guys. I'm even more honored that it was someone from Toronto.

EDIT: I just noticed the CK part of the IBM thing. I finally remembered why that looked familiar. Does IBM have an internal client customization kit for Firefox? For those not in the know, here's a link to the mozilla wiki entry on the concept/program:

http://wiki.mozilla.org/CCK:Overview

Basically it was the repackager for Netscape that allowed ISP branding, locked settings and a host of other features. I wasn't aware that there was one for Firefox but I haven't really needed one since I left CLA.

Thursday, April 3, 2008

Bob Barr

So on the way home, if I want to get ANY worthwhile traffic reports, I'm forced to sit through Sean Hannity. This is usually a recipe for increased blood pressure.

However, today he had Bob Barr on. As you may or may not know, Bob Barr is strongly considering running on the Libertarian ticket. You may or may not know that *I* am a registered Libertarian.

Sean kept wanting to harp on the fact that Barr would be "splitting the vote" for the Republican party and baiting him about how he would feel about that.

I really wanted to call in but for some reason the number wasn't given out on the the air so I'm just going to vent here.

For some reason, Sean Hannity assumes that anyone who is "conservative" automatically should be voting for the Republican candidate no matter how distasteful they find that candidate or that those votes would have automatically been Republican votes had there not been a third party running.

What kind of logic is this? Oh wait, it's RIAA logic. It's BSA logic.

I'm going to provide a little insight for you. Just because I vote for Bob Barr or the Libertarian candidate does NOT mean I would have voted for the Republican candidate. I might have voted for the Green candidate. I might have voted for any other party just as equally as I would have voted for the Libertarian candidate. I am NOT a Republican. I'm not a conservative. I'm not a liberal. I'm a citizen of the United States of America who has his own conscience and heart to follow in deciding who *I* think would be the best leader for this country. I don't think that person is John McCain. I don't think that person is Hillary Clinton OR Barack Obama. In fact, I thought that person was Ron Paul.

Sure I'll be dismissed as a fringe minority and some sort of political funny math will somehow figure out in the end that it is entirely my fault and the people like me caused a "liberal" to get elected.

Nagios and Ruby

So as I mentioned in my post here, I've been working on a sort of Nagios toolbox in Ruby. I never really found anything at large about it and honestly, most people are doing stuff in shell scripts or with Perl.

I don't honestly blame them but I wanted a project to force me to learn Ruby better and I had a specific need. I could have this written already in Perl or even straight Bash but, again, I wanted to learn Ruby better.

As I said in my other post, here are the design goals starting out:

  • Import existing configs (objects only. Not concerned about the operational parameters)
  • Parse existing objects (apart from reading the configurations)
  • Create objects (hosts,services,contacts, *groups, etc...)
  • Write new configs
  • Enforce/Support Templating
I'm not writing a new front end. I'm not writing a global configuration suite for Nagios. My goal is object management. Many of the defaults in nagios.cfg work OOB. However, any time I spend in #nagios, the MAJORITY of the questions are related to actually defining what to monitor and how to monitor it. The example included are pretty solid but they're VERY verbose to the point of being confusing. On the flip side, someone like me who can write a .cfg in his sleep, is still using vi to manage entries.

At my new company, they are BIG on automation. I mean EVERYTHING is automated from an SA perspective. The only exception comes with NEW environments. This is still a mishmash of manual processes and checklists with a slathering of the above automation. One of my first things I was tasked with when I walked through the door was to modernize the existing Nagios configuration. This project is part of that. I need to integrate anything I do seamlessly into the existing automation for it to be accepted as process.

So consider this a "declaration of intent to proceed" or somesuch politispeek. Feel free to comment on my lack of Ruby skill and where you think I can change things.

Here's the code I have so far:

http://dev.lusis.org/nagios/ruby/

Grand Opening

I figured that I should be a big boy and join the modern era. Quit "doing it on my own" and just let someone else take over the heavy lifting.

This means using Blogger. I'm so trendy.

I've linked to my old stuff under Hot Pockets.