I’ve worked on cloud computing frameworks with a couple of companies meanwhile. DevOps like processes are always an issue along with these cooperations – even more when it comes to monitoring and how to innovatively approach the matter.
As an example I am ever and again emphasizing Netflix’s approach in these conversations: I very much like Netflix’s philosophy of how to deploy, operate and continuously change environment and services. Netflix’s different component teams do not have any clue on the activities of other component teams; their policy is that every team is self-responsible for changes not to break anything in the overall system. Also, no one really knows in detail which servers, instances, services are up and running to serve requests. Servers and services are constantly automatically re-instantiated, rebooted, added, removed, etc. Such is a philosophy to make DevOps real.
Clearly, when monitoring such a landscape traditional (SLA-fulfilment oriented) methods must fail. It simply isn’t sufficient for a Cloud-aware, continuous delivery oriented monitoring system to just integrate traditional on-premise monitoring solutions like e.g. Nagios with e.g. AWS’ CloudWatch. Well, we know that this works fine, but it does not yet ease the cumbersome work of NOCs or Application Operators to quickly identify
- the impact of a certain alert, hence its priority for ongoing operations and
- the root cause for a possible error
After discussing these facts the umpteenth time and (again) being confronted with the same old arguments about the importance of ubiquitous information on every single event within a system (for the sake of proving SLA compliancy), I thought to give it a try and dig deeper by myself to find out whether these arguments are valid (and I am therefore wrong) or whether there is a possibility to substantially reduce event occurrence and let IT personal only follow up the really important stuff. Efficiently.
At this stage, it is time for a little
DISCLAIMER: I am not a monitoring or APM expert; neither am I a .NET programming expert. Both skill areas are fairly familiar to me, but in this case I intentionally approached the matter from a business perspective – as least technical as possible.
In autumn last year I had the chance to get a little insight into 2 pure-SaaS monitoring products: Ruxit and newRelic. Ruxit back then was – well – a baby: Early beta, no real functionality but a well-received glimpse of what the guys are on for. newRelic was already pretty strong and I very much liked their light and quick way of getting started.
As that project back then got stuck and I ended my evaluations in the middle of getting insight, I thought, getting back to that could be a good starting point (especially as I wasn’t able to find any other monitoring product going the SaaS path that radically, i.e. not even thinking of offering an on-premise option; and as a cloud “aficionado” I was very keen on seeing a full-stack SaaS approach). So the product scope was set pretty straight.
The investigative scope, this time, should answer questions a bit more in a structured way:
- How easy is it to kick off monitoring within one system?
- How easy is it to combine multiple systems (on-premise and cloud) within one easy-to-digest overview?
- What’s alerted and why?
- What steps are needed in order to add APM to a system already monitored?
- Correlation of events and its appearance?
- The “need to know” principle: Impact versus alert appearance?
The setup I used was fairly simple (and reduced – as I didn’t want to bother our customer’s workloads in any of their datacenters): I had an old t1.micro instance still lurking around on my AWS account; this is 1 vCPU with 613MB RAM – far too small to really perform with the stuff I wanted it to do. I intentionally decided to use that one for my tests. Later, the following was added to the overall setup:
- An RDS SQL Server database (which I used for the application I wanted to add to the environment at a later stage)
- IIS 6 (as available within the Server image that my EC2 instance is using)
- .NET framework 4
- Some .NET sample application (some “Contoso” app; deployed directly from within Visual Studio – no changes to the defaults)
2 things popped into my eyes only hours (if not minutes) after commencing my activities in newRelic and Ruxit, but let’s first start with the basics.
Setting up accounts is easy and straight forward in both systems. They are both truly following the cloud affine “on-demand” characteristic. newRelic creates a free “Pro” trial account which is converted into a lifetime free account when not upgraded to “paid” after 14 days. Ruxit sets up a free account for their product but takes a totally different approach – closer resembling to consumption-based pricing: you get 1000 hours of APM and 50k user visits for free.
Both systems follow pretty much the same path after an account has been created:
- In the best case, access your account from within the system you want to monitor (or deploy the downloaded installer package – see below – to the target system manually)
- Download the appropriate monitoring agent and run the installer. Done.
Both agents started to collect data immediately and the browser-based dashboards produced the first overview of my system within some minutes.
As a second step, I also installed the agents to my local client machine as I wanted to know how the dashboards display multiple systems – and here’s a bummer with Ruxit: My antivirus scanner alerted me with an Win32.Evo-Gen suspicion:
Avast virus alert upon Ruxit agent install
It wasn’t really a problem for the agent to install and operate properly and produce data; it was just a little confusing. In essence, the reason for this is fairly obvious: The agent is using a technique which is comparable to typical virus intrusion patterns, i.e. sticking its fingers deep into the system.
The second observation was newRelics approach to implement web browser remote checks, called “Synthetics”. It was indeed astonishingly easy to add a URL to the system and let newRelic do their thing – seemingly from within the AWS datacenters around the world. And especially with this, newRelic has a very compelling way of displaying the respective information on their Synthetics dashboard. Easy to digest and pretty comprehensive.
At the time when I started off with my evaluation, Ruxit didn’t offer that. Meanwhile they added their Beta for “Web Checks” to my account. Equally easy to setup but lacking some more rich UI features wrt display of information. I am fairly sure that this’ll be added soon. Hopefully. My take is, that combining system monitoring or APM with insights displaying real user usage patterns is an essential part to efficiently correlate events.
I always spend a second thought on security questions, hence contemplated Ruxit’s way of making sure that an agent really connects to the right tenant when being installed. With newRelic you’re confronted with an extra step upon installation: They ask you to copy+paste a security key from your account page during their install procedure.
newRelic security key example
Ruxit doesn’t do that. However, they’re not really less secure; it’s just that they pre-embed this key into the installer package that is downloaded,c so they’re just a little more convenient. Following shows the msiexec command executed upon installation as well as its parameters taken form the installer log (you can easily find that information after the .exe package unpacks into the system’s temp folder):
@msiexec /i "%i_msi_dir%\%i_msi%" /L*v %install_log_file% SERVER="%i_server%" PROCESSHOOKING="%i_hooking%" TENANT="%i_tenant%" TENANT_TOKEN="%i_token%" %1 %2 %3 %4 %5 %6 %7 %8 %9 >con:
MSI (c) (5C:74) [13:35:21:458]: Command Line: SERVER=https://qvp18043.live.ruxit.com:443 PROCESSHOOKING=1 TENANT=qvp18043 TENANT_TOKEN=ABCdefGHI4JKLM5n CURRENTDIRECTORY=C:\Users\thome\Downloads CLIENTUILEVEL=0 CLIENTPROCESSID=43100
After having applied the package (both packages) onto my Windows Server on EC2 things popped up quickly within the dashboards (note, that both dashboard screenshots are from a later evaluation stage; however, the basic layout was the very same at the beginning – I didn’t change anything visually down the road).
newRelic server monitoring dashboard showing the limits of my too-small instance 🙂
The Ruxit dashboard on the same server; with a clear hint on a memory problem 🙂
What instantly stroke me here was the simplicity of Ruxit’s server monitoring information. It seemed sort-of “thin” on information (if you want a real whole lot of info right from the start, you probably prefer newRelic’s dashboard). Things, though, changed when my server went into memory saturation (which it constantly does right away when accessed via RDP). At that stage, newRelic started firing eMails alerting me of the problem. Also, the dashboard went red. Ruxit in turn did nothing really. Well, of course, it displayed the problem once I was logged into the dashboard again and had a look at my server’s monitoring data; but no alert triggered, no eMail, no red flag. Nothing.
If you’re into SLA fulfilment, then that is precisely the moment to become concerned. On second thought, however, I figured that actually no one was really bothered by the problem. There was no real user interaction going on in that server instance. I hadn’t even added an app really. Hence: why bother?
So, next step was to figure out, why newRelic went so crazy with that. It turned out that with newRelic every newly added server gets assigned to a default server policy.
newRelic’s monitoring policy configuration
I could turn off that policy easily (also editing apparently seems straight forward; I didn’t try). However, to think that with every server I’m adding I’d have to figure out first, which alerts are important as they might be impacting someone or something seemed less on a “need to know” basis than I intended to have.
After having switched off the policy, newRelic went silent.
BTW, alerting via eMail is not setup by default in Ruxit; within the tenant’s settings area, this can be added as a so called “Integration” point.
As said above, I was keen to know how both systems integrate multiple monitoring sources into their overviews. My idea was to add my AWS tenant to be monitored (this resulted from the previously mentioned customer conversations I had had earlier; that customer’s utmost concern was to add AWS to their monitoring overview – which in their case was Nagios, as said).
A nice thing with Ruxit is that they fill their dashboard with those little demo tiles, which easily lead you into their capabilities without having setup anything yet (the example below shows the database demo tile).
This is one of the demo tiles in Ruxit’s dashboard – leading to DB monitoring in this case
I found an AWS demo tile (similar to the example above), clicked and ended up with a light explanation of how to add an AWS environment to my monitoring ecosystem (https://help.ruxit.com/pages/viewpage.action?pageId=9994248). They offer key based or role based access to your AWS tenant. Basically what they need you to do is these 3 steps:
- Create either a role or a user (for use of access key based connection)
- Apply the respective AWS policy to that role/user
- Create a new cloud monitoring instance within Ruxit and connect it to that newly created AWS resource from step 1
Right after having executed the steps, the aforementioned demo tiled changed into displaying real data and my AWS resources showed up (note, that the example below already contains RDS, which I added at a later stage; the cool thing here was, that that was added fully unattended as soon as I had created it in AWS).
Ruxit AWS monitoring overview
Ruxit essentially monitors everything within AWS which you can put a CloudWatch metric on – which is a fair lot, indeed.
So, next step clearly was to seek the same capability within newRelic. As far as I could work out, newRelic’s approach here is to offer plugins – and newRelic’s plugin ecosystem is vast. That may mean, that there’s a whole lot of possibilities for integrating monitoring into the respective IT landscape (whatever it may be); however, one may consider the process to add plugin after plugin (until the whole landscape is covered) a bit cumbersome. Here’s a list of AWS plugins with newRelic:
newRelic plugins for AWS
newRelic plugins for AWS
Adding APM to my monitoring ecosystem was probably the most interesting experience in this whole test: As a preps for the intended result (i.e.: analyse data about a web application’s performance at real user interaction) I added an IIS to my server and an RDS database to my AWS account (as mentioned before).
The more interesting fact, though, was that after having finalized the IIS installation, Ruxit instantly showed the IIS services in their “Smartscape” view (more on that a little later). I didn’t have to change anything in my Ruxit environment.
newRelic’s approach is a little different here. The below screenshot shows their APM start page with .NET selected.
newRelic APM start page with .NET selected
After having confirmed each selection which popped up step by step, I was presented with a download link for another agent package which I had to apply to my server.
The interesting thing, though, was, that still nothing showed up. No services or additional information on any accessible apps. That is logical in a way, as I did not have anything published on that server yet which resembled an application, really. The only thing that was accessible from the outside was the IIS default web (just showing that IIS logo).
So, essentially the difference here is that with newRelic you get system monitoring with a system monitoring agent, and by means of an application monitoring agent you can add monitoring of precisely the type of application the agent is intended for.
I didn’t dig further yet (that may be subject for another article), but it seems that with Ruxit I can have monitoring for anything going on on a server by means of just one install package (maybe one more explanation for the aforementioned virus scan alert).
However, after having published my .NET application, everything was fine again in both systems – and the dashboards went red instantly as the server went into CPU saturation due to its weakness (as intended ;)).
Smartscape – Overview
So, final question to answer was: What do the dashboards show and how do they ease (root cause) analysis?
As soon as the app was up and running and web requests started to role in, newRelic displayed everything to know about the application’s performance. Particularly nice is the out-of-the-box combination of APM data with browser request data within the first and the second menu item (either switch between the 2 by clicking the menu or use the links within the diagrams displayed).
newRelic APM dashboard
The difficulty with newRelic was to discover the essence of the web application’s problem. Transactions and front-end code performance was displayed in every detail, but I knew (from my configuration) that the problem of slow page loads – as displayed – lied in the general weakness of my web server.
And that is basically where Ruxit’s smartscape tile in their dashboard made the essential difference. The below screenshot shows a problem within my web application as initially displayed in Ruxit’s smartscape view:
Ruxit’s smartscape view showing a problem in my application
By this view, it was obvious that the problem was either within the application itself or within the server as such. A click to the server not only reveals the path to the depending web application but also other possibly impacted services (obviously without end user impact as otherwise there would be an alert on them, too).
Ruxit smartscape with dependencies between servers, services, apps
And digging into the server’s details revealed the problem (CPU saturation, unsurprisingly).
Ruxit revealing CPU saturation as a root cause
Still, the amount of dashboard alerts where pretty few. While I had 6 eMails from newRelic telling me about the problem on that server, I had only 2 within Ruxit: 1 telling me about the web app’s weak response and another about CPU saturation.
Next step, hence, would be to scale-up the server (in my environment) or scale-out or implement an enhanced application architecture (in a realistic production scenario). But that’s another story …
Event correlation and alerting on a “need to know” basis – at least for me – remains the right way to go.
This little test was done with just one server, one database, one web application (and a few other services). While newRelics comprehensive approach to showing information is really compelling and perfectly serves the objective of complete SLA compliancy reporting, Ruxit’s “need to know” principle much more meets the needs of what I would expect form innovative cloud monitoring.
Considering Netflix’s philosophy from the beginning of this article, innovative cloud monitoring basically translates into: Every extra step is a burden. Every extra information on events without impact means extra OPS effort. And every extra-click to correlate different events to a probable common root-cause critically lengthens MTTR.
A “need to know” monitoring approach while at the same time offering full stack visibility of correlated events is – for me – one step closer to comprehensive Cloud-ready monitoring and DevOps.
And Ruxit really seems to be “spot on” in that respect!