Energy MonitoringUncategorized

Moving to production with InfluxDB, Grafana and AWS EC2

Current Setup

So I’ve got an InfluxDB collecting time-series data from a half a dozen IoT devices. They feed about 100 points every 10s. Then Grafana runs queries on the database. I install them like this.

Grafana and InfluxDB are running on one t2.micro instance with 1Gb RAM. On occasion I get out-of-memory errors from InfluxDB and the instance crashed once. These problems will only increase with more devices.

So how to move from this dev-style setup to a (small scale) production setup with some backup and robustness?

Understand Current Setup

Grafana seems lightweight, as see by checking top:

ec2>top -o %MEM

InfluxDB, however, uses 30% of the 1Gb and for my “low load” 2-4Gb is recommended.

A simple bash script can be run to monitor memory hogs:

echo cmd,mem,cpu
while
date
ps -eo cmd,%mem,%cpu --sort=-%mem | head -n 4 | grep -v grep
do sleep 10; done

You can plot a histogram of the %mem like so:

pip install bashplotlib
cat memory-log.log | grep influx | awk '{print $4}' | hist

Also, checking the logs is a good idea:

tail -f /var/log/influxdb/influxd.log
systemctl status influxd
tail -f /var/log/messages

Fortify  Influxdb

Config

There’s a lot of scope to flood influxd with massive queries. I noticed in the influxd.conf, there are some options to control the chaos. These are fairly self explanatory. It’s up to the clients to retry if they get a write failure.

Memory

I had out of memory problems with Influxdb. I was using a 2Gb t2.small AWS instance, with 8 databases. Each database writes via HTTP at 10s intervals in a batch. Each batch contains about 10 readings.

> use iotalcibne
Using database iotalcibne
> show measurements
name: measurements
name
----
hourly
iotawatt
> select * from iotawatt limit 10
time                 PF    Volts  Watts   ct                device  units
----                 --    -----  -----   --                ------  -----
2018-12-15T08:10:00Z              255.01  Lights1           ilcibne Watts
2018-12-15T08:10:00Z       269.43         Voltage           ilcibne Volts
2018-12-15T08:10:00Z 0.748                Lights1PF         ilcibne PF
2018-12-15T08:10:00Z              0.11    Power1Kitchen     ilcibne Watts
2018-12-15T08:10:00Z              0.23    Power4SouthOffice ilcibne Watts
2018-12-15T08:10:00Z              0.24    Power5OpenPlan    ilcibne Watts
2018-12-15T08:10:00Z              0.22    Power2Rack        ilcibne Watts
2018-12-15T08:10:00Z              1012.44 TotalCircuits     ilcibne Watts
2018-12-15T08:10:00Z              1011.41 TotalLights       ilcibne Watts
2018-12-15T08:10:00Z              756.4   Lights2           ilcibne Watts

Memory problems (see Monitoring, below) occur when:

  • Compaction
  • Continuous queries: run all at the same time: how to offset them?
  • Whenever it wants!

To improve this I:

  • upgraded to a t2.medium instance (4Gb) which is the minimum recommended for InfluxDb.
  • added swap via a swapfile.  I couldn’t add ephermal storage to my instance – maybe not included/possible. So now a total of 4+4Gb memory, with 4Gb on the 16gb disk.
  • use max-concurrent-compactions = 1 in the influxdb.conf file
  • [There are more options in the influxdb.conf to limit resources]

Monitor Influxdb

Use Grafana for an overview

Let’s use Grafana and InfluxDB to monitor icheck host influxdb with address live.phisaver.com

Setup a fresh pair on a new machine and connect to the _internal database to view performance. There4 are a number of monitors available on grafana.com

Use AWS and Cloudwatch for basics

Simple metrics can be monitored such as CPU. Disk space was more difficult (but possible with AWS tools). However, endpoint checking is charged, and restarting services is not supported. So…

Use Monit for detailed monitoring

Monit is a nice system for monitoring and restarting dead services. I chose it as easy-to-use and robust. There is a just enough help on Google to do everything. See details on installation on AWS.

On the running machine, I do the following:

To check the http endpoint (This is good as sometimes the process is running, but the http endpoint is dead):

start program = "/usr/bin/systemctl start influxdb" 
stop program = "/usr/bin/systemctl stop influxdb"
if failed
port 8086
protocol http
username bbeeson p
password ******
method GET
request "/query?q=SHOW DATABASES"
for 2 cycles then restart

To check the system is not running low on resources:

check system $HOST
if cpu usage > 50% for 10 cycles then alert
if memory usage > 75% for 5 cycles then alert
if swap usage > 25% then alert

To check (from another machine) the webserver is open:

 

 

Use AWS Cloud ECAgent to Monitor

As per the instructions, install and run

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s

… to start the service. This provides diskfree and CPU and memory, in addition to the ‘standard’ metrics.

Design new setup for backup

So, thinking backwards, how to configure things to:

  • be simple
  • enable fast recovery in case of ec2 server crash
  • backup only the data needed

Some options are:

  1. Backup whole server to an AMI.
  2. Backup EBS root volume.
  3. Use multiple EBS volumes for the instance. This is good linux practice. I would have a seperate /var and maybe /etc. To restart after disaster, I’d start do the same as #2. Since /var contains other apps, I maybe need to create seperate volume just for /var/lib/influxdb???

Method 1: Backup Root Volume and Test Restore

  • Stop instance
  • Using a test instance, take a root volume snapshot

Easy enough, but can’t automate easily and is not incremental. This doesn’t allow for incremental backup. But one AMI for a working system is a good idea to enable fast recovery.

Not a good option for continuous backup

Method 2: Backup EBS

  • Automate easily with EBS LifeCycle
  • To restore, you can “Create an Image” from the snapshot of the EBS volume, and start an instance with this

This is easy to automate with EBS LifeCycle Manager and is incremental. It works well.

A good simple option.

Method 3: Multiple EBS Volumes

More complex.

  • Create a seperate EBS Volume and mount it to, say /data.
  • Configure InfluxDB via .conf file, and/or use symlinks to store /var/lib/influxdb and /etc/influxdb/influxdb.conf on the new volume. This easy enough with AWS docs, although need to check ownership and permissions carefully.
  • Can do the same for Grafana and Apache2
  • Upon failure, de-attach the /data Volume, and re-attach it to a fresh instance.

To upgrade the instance type (aka get more memory)

Within reason, I can upsize an instance to a bigger machine after shutting down. To do this:

  • Shutdown the instance
  • Backup by creating an AMI of the instance  (AWS>EC2>Images>AMI)

  • Change instance type to a t2.micro. Super easy. Restart and go!

 

Leave a Reply

Your email address will not be published. Required fields are marked *