Memory Adventures with InfluxDB

At Cobe, we’ve been using time-series database InfluxDB. Being a relatively new database, we thought we’d share some of our experiences of using InfluxDB and controlling memory usage.

“Why does Cobe need a time series database?”

Cobe shows real-time metrics for hosts, processes, applications and other entities that form a customer’s infrastructure topology. In developing Cobe, the choice of a time series database was our preferred route to store metrics data, to make this development task as easy as possible. Time series databases do rather useful out-of-the-box things such as deleting entries older than a pre-defined age, rolling up values over time periods by mean or other function for reduced latency, and efficiently compressing data to minimise storage requirements.

“Why InfluxDB — Mongo not good enough for you?!”

We decided to go with InfluxDB as it has the capabilities I mentioned above, plus it was reputed to be fast and very easy to set up. Okay, it’s still in development by InfluxData (we’ll come back to that…) and it’s not yet at version 1.0 at the time of writing. However, it has strong backing, shows great promise, and we felt that we could benefit and enjoy the ride from the evolution of one of the latest time-series databases.

“So, how’s it been, hanging out with InfluxDB?”

Well, in most respects, rather good. Easy to set up and use, it “does what it says on the tin”, as the now famous line from a well known UK TV advert puts it. It all seemed pretty cushy, topology metrics being provided reliably and with low latency. InfluxDB appeared to be working trouble free — all perhaps too good to be true. Which it was… until one dark stormy night RAM used by InfluxDB for our largest test topology, which we had already noted was creeping up, ate up its node’s remaining RAM, crashing the node (we use Kubernetes/Google Container Engine). Okay, I forget the actual weather, but that’s obviously how the future movie will have to depict it, weather respecting severe occasions. Only a few seconds of data lost, fortunately, but not quite what we had in mind. It was back to the drawing board…

“Oh dear, so why was memory usage growing and what was the solution?”

The first step was to put an 800MB RAM limit on the InfluxDB container to prevent InfluxDB from using all of the node’s memory, and reduce InfluxDB’s cache memory configurations in its config file to increase the frequency that InfluxDB would write its cache to disk (there’s no configuration means to put a cap on overall memory use). However, it was found that InfluxDB would not live within this container limit, the RAM (RSS) memory limit being reached at a similar growth rate and the container being restarted. Interestingly, a container restart would clear the memory needed by InfluxDB, the RAM usage climb restarting from near zero, giving a few days operation before this container restart cycle repeated. At this point, we came across InfluxData’s Hardware Sizing Guidelines (which we’d either not previously come across, or that had been written since we had started with InfluxDB) indicating that we should have 2-4GB RAM available for our “low load” scenario. Okay, so maybe we needed more RAM than we expected, but we desired to use significantly less than 4GB, and felt the RAM usage cycle to be somewhat odd, so persevered with testing.

Next, we wrote a Python script to test InfluxDB RAM usage on a laptop, broadly reproducing our production schema and typical data write rate. Here’s a quick summary of the test:

The schema design broadly reproduced the production schema, albeit simplified to save time whilst, hopefully, still capturing the key InfluxDB operating dynamics we wished to understand:

{
    'measurement': 'entity/Process/attr/raw/metric0',
    'tags': {
        'ueid': 'MHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHg=',
        'traits': '["metric:counter"]',
    },
    'time': 1.4653998982705984e+18,
    'fields': {
        'value:float': 85.14005871032059,
        'vtype': 'float',
    }
}
  • Four measurements were used — i.e. .../metric{0/1/2/3} — given that UEIDs (described below) typically each have around four metrics;
  • We ran with 100 fixed and 100 varying UEID values (‘unique entity identifiers’ that represent steady-state and transient entities in Cobe such as hosts, processes and applications);
  • The 100 varying UEID values were generated as a rolling first-in-first-out list of values that was continually updated with completely new UEID values at a rate that would create an increase in InfluxDB series of 9 per minute, to mimic the growth rate we saw in production;
  • A total of 615 points were written to InfluxDB per minute, cycling through our fixed+varying combined list of UEIDs, writing a point for each of the four measurements for each ueid;
  • value:float values corresponding to each of the four measurements were randomly set, 0 <= value < 100, for each minutely write cycle, so applied to all UEIDs during the minute;
  • In case you’re wondering, our schema uses field vtype to enable us to point to the field name that we use to store values, enabling us to vary the value types we store for particular measurements. This is because InfluxDB only allows a single value type for a particular measurement’s field key. So, alternatively, 'vtype':'int' would indicate that we’re storing an integer value under key value:int;
  • Our production 14-day retention policy was simulated by applying a 70 minute retention policy with a shard duration set to 5 minutes to scale equivalently to production’s 1-day shard duration.

In practice, we found that RAM usage was higher than production, perhaps because of not using the same distribution of different measurements for different UEIDs, which would be time consuming to simulate exactly. However, we found the simulation gave us what we wanted in terms of testing InfluxDB’s memory usage dynamics.

/images/influxdb-memory-usage/fig1-memory-usage-with-ueid-as-tag-influx-0.12.1.png

Fig 1 — Memory usage with UEID as a tag (InfluxDB 0.12.1).

Initial results proved somewhat disturbing. Fig 1. above shows RAM usage over 42 simulated days of data entry to InfluxDB. It was over 800MB after just 4 days and increasing rapidly — and indefinitely — at an average rate of nearly 30MB/day, similar to production, despite a retention policy that should mean that only a 14-day horizon of data is stored. Disk storage did indeed top out, as expected, but not so RAM requirements.

The next step was to review our schema and acknowledge that we were indeed likely suffering from being bad in breaking InfluxDB guidelines by defining each data entry’s UEID for each data entry as a tag. Tags in InfluxDB are key values that are indexed, reducing search latency. InfluxData’s low-load hardware recommendations are that numbers of unique series should be fewer than 100k. Series are collections of data in InfluxDB’s data structure that share a measurement name, tag set, and retention policy. In our particular case, our variation in UEID tags meant that InfluxDB reported that we had over 500k series after just 38 days – and it would be potentially much more than this for future, more extensive, topologies — far too many! However, very odd was that memory usage and the number of series reported by InfluxDB were both growing indefinitely when we had that 14 day data retention policy, but hey, “let’s deal with a likely key issue and see what’s left…”.

So, we retreated to storing UEIDs as InfluxDB data fields, which aren’t indexed. This might increase search latencies, but it didn’t look like we had much choice for our use-case.

Our schema therefore changed to:

{
    'measurement': 'entity/Process/attr/raw/metric0',
    'tags': {
        'traits': '["metric:counter"]',
    },
    'time': 1.4653998982705984e+18,
    'fields': {
        'ueid': 'MHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHg=',
        'value:float': 85.14005871032059,
        'vtype': 'float',
    }
}

Fig 2, below, shows the RAM usage over time that tests now demonstrated with this revised schema.

/images/influxdb-memory-usage/fig2-memory-usage-with-ueid-as-field-influx-0.12.1.png

Fig 2 — Memory usage with UEID as a field (InfluxDB 0.12.1).

Now, memory usage was much lower, at 450MB or so. For our search purposes, latency was little affected with UEID as a field, which was pleasing, although we still saw a creeping increase in RAM averaging around +1MB/day.

The above testing had been using InfluxDB 0.12.1, which we were using in production. Version 0.13 had been released just days before, so we repeated the above tests under 0.13 with the original schema, UEID stored as a tag again. This resulted in RAM usage shown in Fig 3, below.

/images/influxdb-memory-usage/fig3-memory-usage-with-ueid-as-tag-influx-0.13.png

Fig 3 — Memory usage with UEID as a tag (InfluxDB 0.13).

Ahaa! The situation was now much healthier than before, memory growth just about topping out at 1.7GB, but still showing a long-term average (taking data beyond the 50 days shown) increase of +1.5MB/day. Indeed, this was an InfluxDB performance improvement reflecting the retention policy series clean-up issue officially declared by InfluxData to be fixed by v0.13. Number of series reported by InfluxDB topped out at around 335k, memory step changes clearing of series data older than the retention period. Okay, not after 15 days as might have been expected, dropping a day’s data stored in a shard at a time, but at least it was doing so eventually, after 28 days or so. Better late than never. Perhaps this will improve with future InfluxDB releases.

Right, so how were things when UEID was again changed from being an indexed tag to a field?

The resulting chart was Fig 2’s doppelganger, so I’ll save you from staring unnecessarily at another that’s just the same. No tangible difference to InfluxDB 0.12.1, the series clean-up fix of 0.13 not having influence here, there being a similarly low number of series to influence (only 190 series after 42 days). An apparent memory creep was still there, varying a little between tests, but appearing a general trend of around +1MB/day.

Net result: we updated production to InfluxDB 0.13 and changed the schema, UEID becoming a field, to reduce RAM requirements.

“Okay, so how is production performance now?”

We are now in a situation where memory requirements are significantly lower. Any increase to latency is not noticeable by our front-end’s queries to InfluxDB. Lower memory requirements will make a useful impact on keeping costs per customer under control, so we can offer customers more for less. We are keeping an eye on things to see if there’s any long-term issue with upwardly creeping RAM requirements in production, but so far it’s looking good.

However, going forward, as Cobe develops, we will have to see if InfluxDB continues to satisfy our use-case, given InfluxDB’s significantly increasing memory demands under greater loads and the fact that we’re not indexing by UEID, where, for possible future queries by UEID over longer time periods, we may have undesirable latencies (something else we are testing…).

We may well also want to use InfluxDB’s Continuous Queries to roll up data by time periods, to further speed queries over long time periods. We would want to do this by UEID, so number of series — and therefore memory requirements — would then grow dramatically, not to mention the increased processing power likely required by InfluxDB.

“Cor blimey! So after all that, would you recommend InfluxDB?”

Yes, we would, although it depends on your use-case. Where you only have a constrained number of tag values that you want to index by, then in that respect you’re fine. In our case, however, we have a very large number of UEID values we’d ideally like to index and therefore make tags – and that would be of an ever expanding number in future situations where we might not want a retention policy limiting data stored. So InfluxDB hardware requirements in our scenario could become over-demanding. Indeed, if we’re forced to back-off using tags and Continuous Queries, then it could be argued that we don’t get a significant benefit from using InfluxDB over a more conventional database and might be best reviewing our database choice.

“What have been the lessons learnt?”

Well, we’re reminded of the fact that you can expect the unexpected with pre-1.0 release software. Life with such a new product is interesting and fun, but yes, you can expect some time lost sorting issues out. Secondly, when software is new and documentation and community material less extensive, the risks of a poorer fit to your use case might be higher. Still, InfluxDB gives us what we need for now, and it’s still a very young database with lots of development to come. Indeed, as I close, I today see that InfluxDB 1.0 Beta has just been released. Happy days…!

Comments

Comments powered by Disqus