Cobe Agent Publicly Available

Here at Cobe we are building a monitoring platform with a searchable live topology of your infrastructure. We set out to build a platform and from the start wanted to make it easy to feed data into Cobe. Primarily this is achieved using our entityd agent, which sends in any data it discovers via Cobe’s open API. Since this is a piece of software which will be running on your servers we thought it only appropriate to make it available as open-source. So we are happy to announce that entityd is now available under the GNU Lesser General Public License Version3.

The entityd agent is written in Python and can also be directly installed from the PyPI index using pip, though we recommend you to follow the installation instructions on your personalised download page once you have signed up and created a topology instance. It is the same agent which is installed into our entityd container on the Docker Hub for monitoring Kubernetes clusters. Internally the agent is designed around a plugin architecture which makes it easy to extend its monitoring capabilities.

Cobe attends EuroPython 2016

From the 17th July to 24th July, Cobe will be at the EuroPython 2016 conference in Bilbao, Spain. Come and find us at our stall to find out more about our SaaS monitoring product.

EuroPython 2016 branding image

EuroPython 2016

EuroPython 2016 is the largest European Python conference with over 1400+ attendees, it is also the second largest Python Conference in the world and a meeting point for Python enthusiasts, programmers, students and companies.

Our CTO Dave Charles will be delivering a talk titled - Managing Kubernetes from Python using Kube.

Our Architect Floris Bruynooghe will also be giving a talk titled - Build your microservices with ZeroMQ.

So we are not only attending the conference, and sponsoring it, we are also giving back to the Python community. Come and speak to us at our stall to find out more about our Cobe SaaS Monitoring product.

Win a Raspberry Pi 3

Image of Raspberry PI 3 box

To celebrate attending the EuroPython 2016 conference we will be giving away two Raspberry Pi 3’s.

How to enter

Prize 1: Raspberry Pi 3

Question: We at Cobe are very excited by the features and capabilities offered by Kubernetes. How do you feel that Kubernetes could help you with your application? Or, if you are using it now, how has it helped you within your current stack?

To enter you need to tweet your answer before Tuesday 19th July 4.30pm.

The winner will be selected at random from all of the Twitter entries and announced at our stall and online after the competition finishes at Tuesday 19th July 4.30pm.

Prize 2: Raspberry Pi 3

Question: We have lots of ideas for future Cobe features, but would love to know what you think. Can you tell us your idea for a new Cobe.io feature?

To enter you need to tweet your answer before Thursday 21st July 4.30pm.

The winner will be selected by our CTO Dave Charles from all of the Twitter entries and announced at our stall and online after the competition finishes at Thursday 21st July 4.30pm.

Terms and Conditions

  • To enter the competition you need to be a EuroPython 2016 attendee.
  • You must be able to pick up your prize from the Cobe stall at the EuroPython Conference.
  • To enter the competition you have to send a tweet to @cobeio using the appropriate hashtag.
  • By entering the competition you agree to be contacted occasionally by Cobe.io.
  • Entrants into the competition shall be deemed to have accepted these Terms and Conditions.
  • The competition to win Prize 1: Raspberry Pi 3 closes on Tuesday 19th July 4.30pm. Entries received after that will not be considered.
  • The competition to win Prize 2: Raspberry Pi 3 closes on Thursday 21st July 4.30pm. Entries received after that will not be considered.
  • No purchase necessary.
  • Cobe accepts no responsibility for any damage, loss, liability, injury or disappointment incurred or suffered by you as a result of entering the Competition or accepting the prize. Cobe further disclaims liability for any injury or damage to your or any other person’s computer relating to or resulting from participation in or downloading any materials in connection with the competition.
  • The prize is non-exchangeable, non-transferable, and is not redeemable for cash or any other prize.
  • Cobe reserves the right to substitute the prize with another prize of similar value in the event the original prize offered is not available.
  • The winners will be notified of their prize at the Cobe stall and online using Twitter.
  • The winners will be selected by Cobe from all the entries in accordance with the Terms and Conditions at the specified dates and times.

Memory Adventures with InfluxDB

At Cobe, we’ve been using time-series database InfluxDB. Being a relatively new database, we thought we’d share some of our experiences of using InfluxDB and controlling memory usage.

“Why does Cobe need a time series database?”

Cobe shows real-time metrics for hosts, processes, applications and other entities that form a customer’s infrastructure topology. In developing Cobe, the choice of a time series database was our preferred route to store metrics data, to make this development task as easy as possible. Time series databases do rather useful out-of-the-box things such as deleting entries older than a pre-defined age, rolling up values over time periods by mean or other function for reduced latency, and efficiently compressing data to minimise storage requirements.

“Why InfluxDB — Mongo not good enough for you?!”

We decided to go with InfluxDB as it has the capabilities I mentioned above, plus it was reputed to be fast and very easy to set up. Okay, it’s still in development by InfluxData (we’ll come back to that…) and it’s not yet at version 1.0 at the time of writing. However, it has strong backing, shows great promise, and we felt that we could benefit and enjoy the ride from the evolution of one of the latest time-series databases.

“So, how’s it been, hanging out with InfluxDB?”

Well, in most respects, rather good. Easy to set up and use, it “does what it says on the tin”, as the now famous line from a well known UK TV advert puts it. It all seemed pretty cushy, topology metrics being provided reliably and with low latency. InfluxDB appeared to be working trouble free — all perhaps too good to be true. Which it was… until one dark stormy night RAM used by InfluxDB for our largest test topology, which we had already noted was creeping up, ate up its node’s remaining RAM, crashing the node (we use Kubernetes/Google Container Engine). Okay, I forget the actual weather, but that’s obviously how the future movie will have to depict it, weather respecting severe occasions. Only a few seconds of data lost, fortunately, but not quite what we had in mind. It was back to the drawing board…

“Oh dear, so why was memory usage growing and what was the solution?”

The first step was to put an 800MB RAM limit on the InfluxDB container to prevent InfluxDB from using all of the node’s memory, and reduce InfluxDB’s cache memory configurations in its config file to increase the frequency that InfluxDB would write its cache to disk (there’s no configuration means to put a cap on overall memory use). However, it was found that InfluxDB would not live within this container limit, the RAM (RSS) memory limit being reached at a similar growth rate and the container being restarted. Interestingly, a container restart would clear the memory needed by InfluxDB, the RAM usage climb restarting from near zero, giving a few days operation before this container restart cycle repeated. At this point, we came across InfluxData’s Hardware Sizing Guidelines (which we’d either not previously come across, or that had been written since we had started with InfluxDB) indicating that we should have 2-4GB RAM available for our “low load” scenario. Okay, so maybe we needed more RAM than we expected, but we desired to use significantly less than 4GB, and felt the RAM usage cycle to be somewhat odd, so persevered with testing.

Next, we wrote a Python script to test InfluxDB RAM usage on a laptop, broadly reproducing our production schema and typical data write rate. Here’s a quick summary of the test:

The schema design broadly reproduced the production schema, albeit simplified to save time whilst, hopefully, still capturing the key InfluxDB operating dynamics we wished to understand:

{
    'measurement': 'entity/Process/attr/raw/metric0',
    'tags': {
        'ueid': 'MHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHg=',
        'traits': '["metric:counter"]',
    },
    'time': 1.4653998982705984e+18,
    'fields': {
        'value:float': 85.14005871032059,
        'vtype': 'float',
    }
}
  • Four measurements were used — i.e. .../metric{0/1/2/3} — given that UEIDs (described below) typically each have around four metrics;
  • We ran with 100 fixed and 100 varying UEID values (‘unique entity identifiers’ that represent steady-state and transient entities in Cobe such as hosts, processes and applications);
  • The 100 varying UEID values were generated as a rolling first-in-first-out list of values that was continually updated with completely new UEID values at a rate that would create an increase in InfluxDB series of 9 per minute, to mimic the growth rate we saw in production;
  • A total of 615 points were written to InfluxDB per minute, cycling through our fixed+varying combined list of UEIDs, writing a point for each of the four measurements for each ueid;
  • value:float values corresponding to each of the four measurements were randomly set, 0 <= value < 100, for each minutely write cycle, so applied to all UEIDs during the minute;
  • In case you’re wondering, our schema uses field vtype to enable us to point to the field name that we use to store values, enabling us to vary the value types we store for particular measurements. This is because InfluxDB only allows a single value type for a particular measurement’s field key. So, alternatively, 'vtype':'int' would indicate that we’re storing an integer value under key value:int;
  • Our production 14-day retention policy was simulated by applying a 70 minute retention policy with a shard duration set to 5 minutes to scale equivalently to production’s 1-day shard duration.

In practice, we found that RAM usage was higher than production, perhaps because of not using the same distribution of different measurements for different UEIDs, which would be time consuming to simulate exactly. However, we found the simulation gave us what we wanted in terms of testing InfluxDB’s memory usage dynamics.

/images/influxdb-memory-usage/fig1-memory-usage-with-ueid-as-tag-influx-0.12.1.png

Fig 1 — Memory usage with UEID as a tag (InfluxDB 0.12.1).

Initial results proved somewhat disturbing. Fig 1. above shows RAM usage over 42 simulated days of data entry to InfluxDB. It was over 800MB after just 4 days and increasing rapidly — and indefinitely — at an average rate of nearly 30MB/day, similar to production, despite a retention policy that should mean that only a 14-day horizon of data is stored. Disk storage did indeed top out, as expected, but not so RAM requirements.

The next step was to review our schema and acknowledge that we were indeed likely suffering from being bad in breaking InfluxDB guidelines by defining each data entry’s UEID for each data entry as a tag. Tags in InfluxDB are key values that are indexed, reducing search latency. InfluxData’s low-load hardware recommendations are that numbers of unique series should be fewer than 100k. Series are collections of data in InfluxDB’s data structure that share a measurement name, tag set, and retention policy. In our particular case, our variation in UEID tags meant that InfluxDB reported that we had over 500k series after just 38 days – and it would be potentially much more than this for future, more extensive, topologies — far too many! However, very odd was that memory usage and the number of series reported by InfluxDB were both growing indefinitely when we had that 14 day data retention policy, but hey, “let’s deal with a likely key issue and see what’s left…”.

So, we retreated to storing UEIDs as InfluxDB data fields, which aren’t indexed. This might increase search latencies, but it didn’t look like we had much choice for our use-case.

Our schema therefore changed to:

{
    'measurement': 'entity/Process/attr/raw/metric0',
    'tags': {
        'traits': '["metric:counter"]',
    },
    'time': 1.4653998982705984e+18,
    'fields': {
        'ueid': 'MHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHg=',
        'value:float': 85.14005871032059,
        'vtype': 'float',
    }
}

Fig 2, below, shows the RAM usage over time that tests now demonstrated with this revised schema.

/images/influxdb-memory-usage/fig2-memory-usage-with-ueid-as-field-influx-0.12.1.png

Fig 2 — Memory usage with UEID as a field (InfluxDB 0.12.1).

Now, memory usage was much lower, at 450MB or so. For our search purposes, latency was little affected with UEID as a field, which was pleasing, although we still saw a creeping increase in RAM averaging around +1MB/day.

The above testing had been using InfluxDB 0.12.1, which we were using in production. Version 0.13 had been released just days before, so we repeated the above tests under 0.13 with the original schema, UEID stored as a tag again. This resulted in RAM usage shown in Fig 3, below.

/images/influxdb-memory-usage/fig3-memory-usage-with-ueid-as-tag-influx-0.13.png

Fig 3 — Memory usage with UEID as a tag (InfluxDB 0.13).

Ahaa! The situation was now much healthier than before, memory growth just about topping out at 1.7GB, but still showing a long-term average (taking data beyond the 50 days shown) increase of +1.5MB/day. Indeed, this was an InfluxDB performance improvement reflecting the retention policy series clean-up issue officially declared by InfluxData to be fixed by v0.13. Number of series reported by InfluxDB topped out at around 335k, memory step changes clearing of series data older than the retention period. Okay, not after 15 days as might have been expected, dropping a day’s data stored in a shard at a time, but at least it was doing so eventually, after 28 days or so. Better late than never. Perhaps this will improve with future InfluxDB releases.

Right, so how were things when UEID was again changed from being an indexed tag to a field?

The resulting chart was Fig 2’s doppelganger, so I’ll save you from staring unnecessarily at another that’s just the same. No tangible difference to InfluxDB 0.12.1, the series clean-up fix of 0.13 not having influence here, there being a similarly low number of series to influence (only 190 series after 42 days). An apparent memory creep was still there, varying a little between tests, but appearing a general trend of around +1MB/day.

Net result: we updated production to InfluxDB 0.13 and changed the schema, UEID becoming a field, to reduce RAM requirements.

“Okay, so how is production performance now?”

We are now in a situation where memory requirements are significantly lower. Any increase to latency is not noticeable by our front-end’s queries to InfluxDB. Lower memory requirements will make a useful impact on keeping costs per customer under control, so we can offer customers more for less. We are keeping an eye on things to see if there’s any long-term issue with upwardly creeping RAM requirements in production, but so far it’s looking good.

However, going forward, as Cobe develops, we will have to see if InfluxDB continues to satisfy our use-case, given InfluxDB’s significantly increasing memory demands under greater loads and the fact that we’re not indexing by UEID, where, for possible future queries by UEID over longer time periods, we may have undesirable latencies (something else we are testing…).

We may well also want to use InfluxDB’s Continuous Queries to roll up data by time periods, to further speed queries over long time periods. We would want to do this by UEID, so number of series — and therefore memory requirements — would then grow dramatically, not to mention the increased processing power likely required by InfluxDB.

“Cor blimey! So after all that, would you recommend InfluxDB?”

Yes, we would, although it depends on your use-case. Where you only have a constrained number of tag values that you want to index by, then in that respect you’re fine. In our case, however, we have a very large number of UEID values we’d ideally like to index and therefore make tags – and that would be of an ever expanding number in future situations where we might not want a retention policy limiting data stored. So InfluxDB hardware requirements in our scenario could become over-demanding. Indeed, if we’re forced to back-off using tags and Continuous Queries, then it could be argued that we don’t get a significant benefit from using InfluxDB over a more conventional database and might be best reviewing our database choice.

“What have been the lessons learnt?”

Well, we’re reminded of the fact that you can expect the unexpected with pre-1.0 release software. Life with such a new product is interesting and fun, but yes, you can expect some time lost sorting issues out. Secondly, when software is new and documentation and community material less extensive, the risks of a poorer fit to your use case might be higher. Still, InfluxDB gives us what we need for now, and it’s still a very young database with lots of development to come. Indeed, as I close, I today see that InfluxDB 1.0 Beta has just been released. Happy days…!

Watching Kubernetes Resources from Python

Kubernetes has a reasonably nice REST-ish HTTP API which is used pervasively by the system to do pretty much all of its work. It is very open and reasonably well documented which makes it excellent for integrating with so you can manage your cluster from code. However the API has a concept which does not directly map to HTTP: WATCH. This is used to enable an API user to be notified of any changes to resources when they happen. Using this watch functionality turns out to be non-trivial unfortunately.

The Anatomy of a WATCH Request

Using the Kubernetes API from within Python this is very easy to do using the Requests library, the API is well behaved and always consumes and returns JSON messages. However when it comes to issuing a watch request things become more complicated. There are supposedly two ways to issue a watch request: one with a normal HTTP request which returns a streaming result using chunked-encoding and one using websockets. Unfortunately when testing against a Kubernetes 1.1 master it did not seem to correctly use the websocket protocol, so using the streaming result is the way to go.

When using the chunked-encoding streaming the Kubernetes master will start sending a chunk by sending the chunk size. But then it does not send an entire chunk, it will rather only send one line of text terminated by a newline. This line of text is a JSON-encoded object with the event and changed resource item inside it. So the protocol is very much line-based and the chunked-encoding is just a way to stream the results as they become available. On the face of it this doesn’t seem so difficult to do with requests:

resp = requests.get('http://localhost:8001/api/v1/pods',
                    params={'watch': 'true'}, stream=True)
for line in resp.iter_lines():
    event = json.dumps(line)

However the iter_lines method does not do what you expect it to do, it keeps an internal buffer which means you will never see the last event since you are still waiting for that buffer to fill.

The issue raising this suggests a work-around by implementing your own iter_lines() function using the raw socket from the response to read from the socket. Unfortunately that simple solution makes a few mistakes. Firstly it does not process the chunked-encoding correctly, the octets describing the chunk size will appear in the output. But more importantly, there is another layer of buffering going on, one that you cannot work around. The additional buffering is because Requests uses the raw socket’s makefile method to read data from it. This makes sense for Requests, the Python standard library and OS are good at making things fast by buffering. However it does mean that after Requests has parsed the headers of the response the buffering already has consumed an unknown number of bytes from the response body, with no way of retrieving these bytes. This makes it impossible to consume the watch API using Requests.

Manually Doing HTTP

So how can you consume the watch API from Python? By making the request and processing the response yourself. This is easier than it sounds, socket programming isn’t so scary. First you need to connect the socket to the server and send the HTTP request. HTTP is very simple, you just send some headers over the socket:

request = ('GET /api/v1/pods?watch=true HTTP/1.1\r\n'
           'Host: localhost\r\n'
           '\r\n')
sock = socket.create_connection(('localhost', 8001))
sock.send(request.encode('ASCII'))

Note that the Host header is required for the Kubernetes master to accept the request.

Parsing the HTTP response is a little more involved. However the http-parser library does neatly implement the HTTP parsing side of things without getting involved with sockets or anything network like. Thanks to this we can easily read and parse the response:

parser = http_parser.parser.HttpParser()
while not parser.is_headers_complete():
    chunk = sock.recv(io.DEFAULT_BUFFER_SIZE)
    if not chunk:
        raise Exception('No response!')
    nreceived = len(chunk)
    nparsed = parser.execute(chunk, nreceived)
    if nparsed != nreceived:
        raise Exception('Ok, http_parser has a real ugly error-handling API')

Now the response headers have been parsed. Maybe some body data was already received, that is fine however, it will just stay buffered in the parser until we retrieve it. But first let’s keep reading data until there is no more left (don’t do this in production, it’s bad for your memory):

readers = [sock]
writers = out_of_band = []
timeout = 0
while True:
    rlist, _, _ = select.select(readers, writers, out_of_band, timeout)
    if not rlist:
        # No more data queued by the kernel
        break
    chunk = sock.recv(io.DEFAULT_BUFFER_SIZE)
    if not chunk:
        # remote closed the connection
        sock.close()
        break
    nreceived = len(chunk)
    nparsed = parser.execute(chunk, nreceived)
    if nparsed != nreceived:
        raise Exception('Something bad happened to the HTTP parser')

This shows how you can use select to only read data when there is some available instead of having to block until data is available again. Of course as soon as this has consumed all the data the Kubernetes master may have sent the next update to the PodList, but let’s read the events received so far:

data = parser.recv_body()
lines = data.split(b'\n')
pending = lines.pop(-1)
events = [json.loads(l.decode('utf-8')) for l in lines]

That’s it! If the data received ends in a newline then the lines.split() call will return an empty bytestring (b'') as last item. If the data did not end in a newline an incomplete event was received so we need to save it for later when we get the rest of the data.

Conclusion

So to correctly consume the response from a Kubernetes watch API call you need to create your own socket connection and parse the HTTP response. Luckily this isn’t all that difficult as hopefully I’ve managed to show here. But you don’t need to write all this yourself! We’ve already implemented all this and more in our kube project, which offers a decent implementation of the above wrapped into a nice iterator API. Kube itself still needs a lot more features, but the watch implementation is already very useful.

A Brief History of Monitoring (Part 2)

Part two of a series of articles that describe the history and evolution of monitoring from the perspective of Cobe CTO, Dave Charles.

Part 2: Then SNMP Happened

In my first blog post on the History of Monitoring (Part-1: The time before monitoring) I described my experiences working as a developer on the UK Air Defence System, back in the day. There wasn’t any monitoring, at least in the modern sense. Monolithic applications of the day were considered to be either running or not running. More complicated systems might have complimentary programs that operators would use to interactively start, stop and status them. These tools would certainly be home grown, by developers or operators, and usually as command line scripts. Compute was at a premium, and running applications to monitor applications would have been deemed frivolous.

As architecture progressed and became more commoditised, where there was monitoring, it was for vendor supplied devices. Therefore early monitoring was siloed; single channel monitoring for actual pieces of equipment. There would be several, if not many panes of glass that would have to be observed. Later, vendors would provide solutions that worked across their entire offering, using their own standardised interfaces.

However, it soon became obvious that a vendor agnostic approach would be preferable. Customers wanted choice, and would often have devices from multiple vendors. A standards approach was adopted, first with the little remembered Simple Gateway Monitoring Protocol (SGMP), and by 1988, the Simple Network Management Protocol (SNMP).

SNMP was defined in order to normalise the way in which we could gather the status of stuff, a standard protocol for collecting and organising information about managed devices on IP networks. However the standard also caters for modifying configuration information, enabling one to remotely change a device’s behaviour.

The thinking was, if everything conformed to the same simple standard then monitoring (and management) of devices would be trivialised. To be fair SNMP made a big difference, but over time suffered from being designed by committee. Additionally to maintain compatibility when adding capability, newer versions of the standard were introduced and nowadays there are three version in use; v1, v2c and v3.

SNMP v1 was indeed simple, operating over several protocols including TCP/IP. However because SNMP was devised for fixing sick networks, rather than doing clever things with healthy ones, it was more commonly used over the connectionless User Datagram Protocol (UDP); for both performance reasons and so as not to impact an already beleaguered infrastructure. The first specifications for SNMP (RFCs 1065, 1066 and 1067) appeared in August 1988 but were soon obsoleted in 1990 and parts again in 1991. SNMP v1 was criticised for its poor security, however the standard was approved expeditiously based on a belief that it was an interim protocol desperately needed while moving towards large scale deployment of the Internet and its commercialisation.

The first version of SNMP v2 wasn’t widely accepted due to a controversial security model so it was soon followed by v2c. The ‘c’ in v2c refers to the simpler, community-based scheme employed in that version of the standard.

The most recent version of SNMP, v3, made some significant changes to the protocol including the addition of cryptographic security. All versions are still used widely today, and while many organisations aspire to adopt SNMP v3, it can be hard to administer and in the case of embedded SNMP support in network devices there is still a lot of legacy kit around with v2c (and sometimes, just v1) support.

A diagram depicting how an SNMP manager interacts with managed devices

A classic SNMP set up.

The diagram depicts how SNMP is typically used. The protocol supports the polling of devices for information (e.g. statuses and performance metrics), and for setting device configuration, with calls like get and set. SNMP Managers invoke these calls on SNMP managed devices.

An SNMP managed device is one that has an SNMP agent installed. This might be embedded in the firmware of a network device, or installed as a lightweight application on a server. A managed device exposes data (accessible via the SNMP get and set calls) as objects in a hierarchical structure.

The hierarchy and other metadata (like object types and descriptions) are described by a Management Information Base (MIB). A MIB makes SNMP extensible; the SNMP standard doesn’t have to define every piece of managed information for every type of managed device, that’s the MIB’s job. And you can define as many as those as you like. Indeed, many device manufacturers provide MIBs (and the necessary MIB implementations) for their own devices. For “standard kit”, i.e. things that sit on the internet, many SNMP agents ship with a MIB-II implementation. MIB-II provides objects that describe a network attached thing (e.g. system name and description, interface descriptions, system up-time and so on).

Polling an SNMP managed device isn’t the only approach supported by the standard. SNMP also supports asynchronous notifications called traps (by the way, the shape of a trap is also specified by a MIB). An SNMP Trap is a mechanism used to send an unsolicited message to an SNMP Manager, to notify it of a significant event like an interface on a router going down, or coming back up.

So, as the first commercial (SNMP based, at least) monitoring solutions emerged, they would perform a combination of SNMP polling to get information about a managed device in the infrastructure, as well as have the ability to receive and process SNMP Traps. These SNMP Managers would collect, store and display performance related data obtained through polling (performance monitoring); and receive, process and display events from sources sending traps (availability monitoring). Through these features, network operations staff would react to alarms related to traps sent; and use the gathered performance data to gain an understanding of capacity utilisation in the infrastructure and where the limits and bottlenecks were.

Essentially this approach is still used today, widely, in anger and at scale. Some of the monitoring software using the protocol is a bit smarter and more sophisticated, as you’ll learn in the next instalment “Part-3: It’s Complicated”. However I’ll also discuss how this seemingly neat approach to solving the monitoring problem not only got ignored a bit (as custom approaches to monitoring re-emerged); but also how it increasingly struggles to keep up with the ever changing IT landscape.

Cobe Earns Queen’s Award For Enterprise

Cobe, part of the Abiligroup group of companies earns the Queen’s Award for Enterprise 2016 in the prestigious Innovation category.

Cobe Receives The Queen’s Award For Enterprise 2016

Cobe, part of the Abiligroup group of companies, is delighted to be one of the select businesses to be recognised as winners of the 2016 Queen’s Award for Enterprise – the UK’s highest accolade for business success. We are particularly delighted to have been recognised in the prestigious ‘Innovation’ category.

Queens Award for Enterprise Blue Logo

The Award is in recognition of the Cobe.io SaaS monitoring solution. Cobe is a platform for monitoring IT applications and services. It allows you to search your IT infrastructure in the same way you search for pages and text on the World Wide Web.

The information provided by Cobe’s live ‘mapping’ system enables it to understand the impact of any issue affecting your business critical services. This is because it considers every component in the infrastructure and the relationships between them. The key benefit is that Cobe is able to pinpoint all components affected by a particular issue, enabling you to quickly identify a solution. No other monitoring platform on the market is able to do this.

Founder and Director Andy Onacko explains what winning this award means to Abiligroup; “We are a relatively small IT company, that prides itself on employing people that have innovation in their blood. The company attracts people that like to work at the ‘bleeding edge’ of technology and we believe in giving our people the autonomy to push the boundaries. The result is innovation, like the exciting Cobe.io platform”.

Cobe Attends KubeCon EU Conference

Earlier this month some of us from Cobe attended KubeCon EU 2016 in London. This conference was the international follow-up to the inaugural Kubernetes conference which took place in San Francisco at the end of 2015. Kubernetes enthusiasts from far and wide were treated to a variety of expert technical talks, all designed to spark creativity and promote Kubernetes education.

There has been a lot of talk about Kubernetes in recent months and I could understand if the uninitiated were left wondering, what is all the hype about? To explain what Kubernetes is, explaining the etymology of the word is a pretty good place to start; at which point you’ll discover it’s not such an alien term after all. It shares its roots with the ubiquitous prefix cyber, giving us cyberspace, cybercrime, cybersecurity and thanks to Norbert Wiener, cybernetics.

The Kubecon logo

Kubernetes is ancient Greek for a ship’s steersman (from kubernao which means “to steer a ship”). Cybernetes is the normal Latin transliteration but Roman sailors adopted the colloquialism guberno, in favour of kubernao, which is where we get the word govern. And guberno is the origin of one of my favourite words; gubernatorial, used to describe the elections for US State governors.

So it’s something about steering or governing then? Pretty much. Kubernetes is a system for managing containerised applications across a cluster of machines, orchestrating containers to provide “planet scale” services with high availability. Open sourced by Google in September 2014, it is a system whose design is heavily influenced by Google’s Borg system.

Our interest in Kubernetes here at Cobe is very relevant; it’s the platform we use to provision our monitoring SaaS. Back in 2015 we started to deploy early versions of Cobe atop a pre-version 1 Kubernetes, underpinning a Google Container Engine (GKE) barely out of alpha (fun, games and jolly japes ensued). Early on we knew that Cobe itself would need to be “Kubernetes-aware” because we wanted to dog-food Cobe to help us understand the state of Cobe itself. As it turns out, we thought that monitoring Kubernetes, and the containers and applications it was orchestrating, was a pretty useful feature not just for us but for Cobe users too. Therefore attending KubeCon was essential in learning more about the technology, who was using it and any pain-points they may have.

People milling around the KubeCon 2016 venue

Code Node was an excellent choice of venue

KubeCon EU was held at the excellent Skills Matter venue, CodeNode. The two day conference kicked off with an entertaining keynote from Google’s Kelsey Hightower, Staff Developer Advocate for the Google Cloud Platform. After a little chant to warm us up (“I say Kube, you say Con!”) Kelsey delivered a great live demo of some Kubernetes 1.2 features as well as covering off some alpha and beta features upcoming in 1.3. A sage piece of advice Kesley had for the audience, learned through bitter experience while trying to prep the demo on route to London, was “never, ever run npm install on a plane”. Having done this on numerous occasions with terrestrial bandwidth I’m able to fully sympathise with his pain.

David Aronchick, Product Manager at Google, followed the keynote with details on upcoming content for Kubernetes 1.2 (out by the time I got around to writing this) and the ambitions for 1.3. These included legacy application support through a new construct called PetSet, cluster federation (aka Ubernetes) and lots of scale work. I was glad to see this because there has been some criticism of k8s performance, especially from the competition.

I stayed in the main room to listen to Matthew Bates’ talk on The State of State. Containerised deployments of microservices bring with them many benefits however databases can be an issue, a topic touched on by Matt Ranney, Senior Staff Engineer at Uber at AYB Conference last year. Matthew presented the advantages of deploying databases with Kubernetes that included efficient resource utilisation, automatic scaling and consistency of automation and management. He then shared how one would use Persistent Volumes and Persistent Volume Claims to allocate suitable storage for database instances. In addition Matthew introduced the upcoming Kubernetes feature for nominal services (PetSets), and Vitess, a database solution for scaling MySQL.

In another talk I got to hear from Open Stack on their view of how they are relevant in a Kubernetes world; essentially if you are a large enterprise or deploying your own platform then Open Stack shouldn’t be ignored.

Sinclair Schuller, CEO of Apprenda gave us some insights into catering for both cloud native and traditional applications. Interestingly he spoke of enterprise clients that have in excess of ten thousand applications and decomposing all of them into microservices is not an option. Consequently Apprenda have engineered a platform that understands the mechanics of “mixed era” applications to address this issue. It was good to hear this from an alternate source (we had been told similar from our clients in the enterprise world). One of Cobe’s strengths is its ability to monitor “mixed era” applications and show the runtime relationships between each.

There were so many great talks crammed into a very brisk two days, if you have the time to take a look at any of them online I would recommend “Killing containers to make weather beautiful” by Jacob Tomlinson, and “ITNW? Orchestrating an enterprise” by Michael Ward. I was gutted to have missed “Kubernetes Hardware Hacks” by Ian Lewis for which I blame the conference schedule app. It didn’t update on my phone properly and I trotted full of expectation into the room just as it finished. Hopefully I’ll get the full experience when it appears on-line.

Finally, if you get the opportunity to attend a KubeCon sometime in the future, you should. I found it really worthwhile and met some great people doing really clever stuff. I want to call them Kubernauts? But either that’s not a thing yet, or it was a thing and everyone decided it shouldn’t be a thing. Anyway, +1 for it being a thing.

A Brief History of Monitoring (Part 1)

Part one of a series of articles that describe the history and evolution of monitoring from the perspective of Cobe CTO, Dave Charles.

Part 1: The time before monitoring

The monitoring of IT infrastructures has always struggled to keep pace with the ever changing nature and complexity of the infrastructures themselves. As we all know too well IT itself never stops evolving and we all seem to be constantly wrestling with nascent technologies in order to solve our customer’s problems in ever better ways. In my career to date I’ve developed on mainframes and witnessed the move to mini-computers and desktops, followed by the move to servers in data centres, and those servers subsequently overlaid with lots of virtualisation goodness. More recently, containerisation has gathered momentum and it would be hard to believe that there is a single large enterprise that does not plan to follow in the footsteps of companies like Netflix and transmogrify their monolithic applications into microservices and deploy them across containerised environments.

Side-by-side with the changes in infrastructure architecture, I have seen application architecture evolve too, from monolithic applications to client-server, to three-tier, to service oriented and now to microservice architectures. So developing the systems that assure the proper operation of any computer system is, by its very nature, a game of constant catchup. And, as if keeping pace isn’t a big enough challenge, some or all of those things coexist in many a large organisation’s infrastructure today, so we almost always have to continue catering for the “legacy” stuff too.

I started out as a developer working on the UK’s Air Defence System, more specifically IUKADGE (Improved United Kingdom Air Defence Ground Environment - don’t blame me, I didn’t come up with the snappy name). IUKADGE was (is, it’s called ASACS nowadays and has changed shape a bit) a vast collection of integrated, bespoke, multi-vendor systems scattered across dozens of sites that accepted data feeds from all sorts of sources including static and mobile ground based installations, radars, ships, aircraft and satellites.

Irving Metzman as Richter next to WOPR in the 1983 film War Games

Like me back in the day, but Richter is slimmer and has more hair.

Simply put, IUKADGE was the post-war, digitised version of RAF operators pushing cardboard planes and ships around with a snooker cue on a giant map table, just like in Angels One Five. Furthermore systems like this were the inspiration for “WOPR” in the 1983 film War Games, except that as far as I know IUKADGE never gained consciousness. Or was that Skynet?

So what was the monitoring like? In my experience there was none, well, in the modern sense at least. The operators would use the OS system tools that were available to them to observe the systems running but relied largely on reacting to issues reported by us developers and the actual users. The operators (in their fine white dust coats) would walk around the shiny kit in a room we called the “Bureau” and press test buttons, check lights and run commands using a “line-printer” to check that all was well. For the thirty-somethings and below amongst you, a Bureau is a sort of data centre but much smaller, and a line-printer is computer terminal that printed the input and output on paper as opposed to a screen. Vendor supplied test equipment was used to run diagnostics and to unearth problems with the mainframes, networks and peripheral devices.

However it’s not like the application architectures were simple. One subsystem I worked on was composed of a large set of communicating processes that used a home grown inter-process communication mechanism to exchange messages. There were plenty of moving parts that could go wrong, but resilience was built-in with fail-over mechanisms. If a main system failed catastrophically its standby would kick in. Importantly this fail-over could be manually invoked by an operator if they thought there was an issue with the main system (effectively turning it off and on again).

But before new code got into the live system there was testing, lots of manual testing. We also developed tools to start, stop and status the applications, to help us understand what was running and how data propagated through the system. This would enable us and the operators to ascertain that things were behaving as expected. Some developers (including myself for a time) were tasked with creating a simulation data generator, so that the system could be exercised with realistic data. A whole team was dedicated to using those tools to generate simulations. Furthermore, teams of analysts would compile pages upon pages of elaborate, tabulated test plans that were laboriously executed over many days, to verify and validate how the system behaved. It was labour intensive to say the least, and many forests were laid to waste at that time.

And that’s how it was, in the main. Plenty of testing, resilience built in (albeit very coarse grained) and tools to help us “inspect” the system while it was running. I never witnessed or heard of any disastrous failures. I suppose other factors helped, like a cast iron VAX VMS operating system, stringent control of the infrastructure by the people in white coats, and a relatively simple, statically typed language (Fortran 77) with no dynamic memory allocation so SEGVs weren’t even a thing.

However, hardware and software architectures, as ever, were changing. Hardware was becoming more commoditised, and the internet (well, the world wide web) was about to happen. Things were getting more distributed and complicated. Look out for the next installment in Part-2: Then SNMP Happened where I’ll describe how monitoring evolved around this time.

All Your Base Are Belong to Us

In November Cobe sponsored and attended All Your Base 2015 in London. This increasingly popular gathering, tagged the practical database conference, is a one-day, single-track seminar focusing on data and databases.

Cut-screen from the popular 1980's game Zero Wing

Cut-screen from the popular 1980’s game Zero Wing

The interesting title for this annual database conference is a play on the broken English phrase “All your base are belong to us” found in the opening cut-scene of the 1989 video game Zero Wing. When I first heard of this conference a couple of years ago the nomenclature instantly appealed to my unequivocal devotion to the geekgeist. Now in its fourth year, All Your Base attracts an audience of 250-300 back-end developers, programmers and database engineers.

Previous years have seen a mix of relatable and inspirational talks on a range of database technologies. This was my first year attending AYB and I wasn’t disappointed; I found all the talks entertaining, informative and insightful. Charity Majors gave a great delivery on upgrading databases and the challenges she’s faced at Facebook’s Parse. Working on one of world’s largest, most complicated MongoDB deployments that supports half a million Facebook Apps, Charity told us she never knows whether to say that with pride or shame! She shared, full of mirth, that if you can upgrade your database technologies well, then no-one will ever know that you’ve done a thing. This, she divulged, is why she believes most operations engineers drink heavily and swear a lot. Check out her talk, it’s really entertaining.

Another highlight was the talk “Break your database before it breaks you” by Matt Ranney, Senior Staff Engineer at Uber. At Uber they’ve scaled up fast. He made the (fair) claim that Uber has one of the fastest growing engineering teams in the history of engineering teams; when he started in 2014 there were 200 engineers. Now there are over 1,500, a little under a third of the entire Uber workforce.

When a team grows this fast the challenges are interesting to say the least. Matt gave us an Uber database history, where the platform moved from a very straight forward PHP + MySQL in 2009, moving to Mongo, Redis and PostgreSQL (depending on whether the data underpinned the dispatch or API functions) and more recently implementing a sharded, cassandra-like, schema-less approach (actually called Schemaless). In Uber’s dispatch function, technologies like Riak are now being deployed.

Things broke a lot, there were some dark times”, Matt went on to share the “cringe-worthy” details of a particularly bad outage where the Uber API went down for 16 hours. The only reason this did not make the news was because of a dispatch system fallback capability. Watch the talk, it’s very entertaining and includes tales of dismissed critical alerts and c-programs written at midnight to bypass data corruption issues.

A cartoon chicken with a speech bubble saying "moo"

Make failure testing possible with chickens.

It was clear that no one thing was wholly responsible for the outage at Uber. A contributory factor was lack of context. This was put down to alert titles; the PagerDuty critical alert(s) that informed the operators that storage capacity was compromised were dismissed because they seemed less important than other prevailing issues. That is, it was not clear a primary database instance was impacted - it was assumed by the operations team it was a replica (of which they have very many).

It was also determined that thresholds are not as useful as derivatives e.g. rate of change. The rate of change from the initial remaining storage warning of 80Gb exponentiated, and the warning of 0Gb free (not enough apparently) came fast and too late. Had the monitoring identified an anomalous rate of change in available capacity the engineers might have identified the problem cause much sooner.

Finally, the special nature of nodes in the system was an issue. Since the outage, Uber perform failure testing, a leap forward in understanding failure modes and improving the ability to cope with them. Testing for failure is supported by the architectural approach where you treat things like cattle (or maybe even chickens) rather than precious pets. However in the database world, this is a hard sell. Using replication partners is a common approach that provides database scalability and availability; however this entails assigning special purpose and meaning to database instances, the pets. Failure testing is at odds with that. Moving towards a situation where database nodes are just a commodity where you can simply crash-test and break anything, and then recover by spinning up a new node as a replacement is an appealing approach. However removing specialty from a system (as Uber discovered with their Riak deployments) brings in a new set of problems, not least because they mask underlying issues so well! Therefore, challenges still remain but the approach is probably worth it especially if users actually prefer data that is sometimes inaccurate rather than sometimes not there at all.

The anecdote from Uber was extremely relevant to us a Cobe because the contributing factors described are exactly what we are trying to solve. Lack of context of a monitored system is fundamental in exacerbating the problems in identifying relevance and root cause of issues. Because Cobe builds a topological model of the monitored infrastructure, it understands the relationships between anything and everything. This enables the consumer of an alarm to, for example, understand the relevance of a storage warning on a particular database node. Even more exciting is the prospect of pointing some analytics at this rich model data. At the same time we are working hard on making monitoring easier in a landscape where cattle (or chickens) prevail. Assuring the smooth operation of a containerised environment is a challenge and our plans to provide a high level of context when monitoring this type of deployment is more and more relevant.

Cobe Attends DJUGL at Potato HQ

On October 6th a few of us from Cobe pootled along to Potato HQ to attend a Django User Group London (DJUGL) meetup. It was an enlightening evening of talks where I also had the opportunity to speak myself, and describe our new monitoring SaaS to those attending. This was my first time at a DJUGL meet-up and also my first visit to Potato HQ. I have to say it looks like a fantastic place to work and the Potato HQ guys hosting the event were really welcoming.

Potato HQ Logo

After kicking off with beer and pizza (two of my favourite things), I kicked off the talks with a brief discussion about Cobe, our recent soft launch, the current capabilities and the features we’re working on as we move towards our hard launch. For the uninitiated, Cobe is a platform for monitoring applications and services and provides a live, searchable view of your infrastructure components and their relationships. As we progress towards that hard launch in early 2016 we’re adding the capability to explore and display in more detail the metrics and alarms associated with those components. This together with other key features will set Cobe apart from most other monitoring solutions because it has the ability to consider every infrastructure component and any relationships that exist between them. I also discussed a key feature that will enable Django (and other Python) developers to instrument their applications so that telemetry from within them can be easily surfaced using Cobe and its search capability.

Attendees eating pizza during DJUGL meet-up at Potato HQ

Beer and Pizza, two of my favourite things…

After my talk, Potato HQ resident, Python enthusiast and Django developer Ana Balica delivered a hugely entertaining talk on Django Signals and AppConfig through the medium of Rick and Morty. Other highlights for me included a great talk by Andrea Grandi on creating custom Django middleware, Daniel Roseman gave us guidance on how to ask a question and Alexandre González delivered a great introduction to deploying a simple python application into Kubernetes. It was great to see other folk getting into and being enthusiastic about that platform as it’s the one we use to deliver Cobe. The evening was rounded off nicely by Potato HQ developer Stu Cox and his thought provoking talk titled Abandoning HTML.

All in all DJUGL was a great evening at a great venue with really interesting people in it. I had some great interactions and really hope I get the chance to attend again.