Consideration for moving to production with InfluxDB, Grafana and AWS EC2

Current Setup

This provides background – see Installing Grafana, InfluxDB, Apache2 on AWS EC2: Step by Step.

So I’ve got an InfluxDB collecting time-series data from a half a dozen IoT devices. They feed about 100 points every 10s. Then Grafana runs queries on the database. I install them like this.

Grafana and InfluxDB are running on one t2.micro instance with 1Gb RAM. On occasion I get out-of-memory errors from InfluxDB and the instance crashed once. These problems will only increase with more devices.

So how to move from this dev-style setup to a (small scale) production setup with some backup and robustness?

Understand Current Setup

Grafana seems lightweight, as seen by checking top:

ec2>top -o %MEM

InfluxDB, however, uses 30% of the 1Gb and for my “low load” 2-4Gb is recommended.

Monitoring Method 1

A simple bash script can be run to monitor memory hogs:

echo cmd,mem,cpu
while
date
ps -eo cmd,%mem,%cpu --sort=-%mem | head -n 4 | grep -v grep
do sleep 10; done

You can plot a histogram of the %mem like so:

pip install bashplotlib
cat memory-log.log | grep influx | awk '{print $4}' | hist

Monitor Method 2

Use dstat to monitor and log CPU/memory

sudo yum install dstat # or apt-get
# dstat.sh: Run dstat to record mem/cpu hogs every 60s
dstat -ta --sys --mem --top-mem --top-cpu --output dstat.csv 60 >> ~/dstat.log
crontab -e
# cron
@reboot /home/ec2-user/dstat.sh

More Security

Lock down firewall to absolute minimum. Logs show repeated SSHD brute force attacks, so limit ssh to specific IPs

Fortify  Influxdb

Analyse Write Rate and Cardinality

Use this to quickly check cardinality. My cardinality was ~1000 which is ok (docs have 10,000 to 100,000 for single nodes):

show series exact cardinality

Get runtime stats. Hard to understand.

show stats

Enable “monitor” in influxd.conf and view the “_internal” database (don’t do in production). View with a grafana dashboard.

From the command line:

`$ influx -execute 'select derivative(pointReq, 1s) from "write" where time > now() - 5m' -database '_internal' -precision 'rfc3339'

See all the internal database measurements.

Analyse Write Rate and Cardinality

Config

There’s a lot of scope to flood influxd with massive queries. I noticed in the influxd.conf, there are some options to control the chaos. These are fairly self explanatory. It’s up to the clients to retry if they get a write failure.

Memory

I had out of memory problems with Influxdb. I was using a 2Gb t2.small AWS instance, with 8 databases. Each database writes via HTTP at 10s intervals in a batch. Each batch contains about 10 readings.

Memory problems (see Monitoring, below) occur when:

  • Compaction
  • Continuous queries: run all at the same time: how to offset them?
  • Whenever it wants!

To improve, get a bigger machine, and/or:

  • added swap via a swapfile.  I couldn’t add ephermal storage to my instance – maybe not included/possible. So now a total of 4+4Gb memory, with 4Gb on the 16gb disk.
  • use max-concurrent-compactions = 1 in the influxdb.conf file
  • [There are more options in the influxdb.conf to limit resources]

Further invesit

Monitor Influxdb

Use Grafana for an overview

Let’s use Grafana and InfluxDB to monitor icheck host influxdb with address live.phisaver.com

Setup a fresh pair on a new machine and connect to the _internal database to view performance. There are a number of monitors available on grafana.com

Simple metrics can be monitored such as CPU. Disk space was more difficult (but possible with AWS tools). However, endpoint checking is charged, and restarting services is not supported. So…

Add AWS Cloud ECAgent for detailed monitoring and “watches”

This provides diskfree and CPU and memory, in addition to the ‘standard’ metrics.

As per the instructions, install and run it.

Cloudwatch can be used to monitor and set alarms (get an SMS!)

Add Monit for detailed control

Monit is a nice system for monitoring and restarting dead services. I chose it as easy-to-use and robust. There is a just enough help on Google to do everything. Easy on Ubuntu with apt (see here for aws-linux-2 instructions).

It is different to AWS CloudWatch, in that it runs on the machine and can restart processes, etc. AWS CW looks at the whole machine and cannot restart individual processes.

To check (from another machine) the webserver is open:

check host live2-phisaver with address live2.phisaver.com
if failed port 80 protocol http
and request / with content = "Grafana"
then alert

Design new setup for backup

So, thinking backwards, how to configure things to:

  • be simple
  • enable fast recovery in case of ec2 server crash
  • backup only the data needed

I use multiple EBS volumes for the instance. This is good linux practice. I have a seperate /var.

Manual Method: Create AMI

  • Just use AWS consule to EC2 to create an image of the instance. This will backup both volumes.

Easy enough, but can’t automate easily and is not incremental. This doesn’t allow for incremental backup. But one AMI for a working system is a good idea to enable fast recovery.

Not a good option for continuous backup

Method 2: AWS Backup

  • Automate easily with LifeCycle
  • To restore, you can “Create an Image” from the snapshot of the EBS volume, and start an instance with this
  • Make sure to use “EC2 Instance” as the resource to protect (not the individual volume)

This is easy to automate with LifeCycle Manager and is incremental. It works well.

A good simple option. .

Disk Configuration and Space

  • Use and / and a /var disk
  • Logs can get out of control fast!
    • Check settings in grafana.conf (log-level) and influxdb.conf. (http-write-log, log-level)
    • Edit /etc/systemd/journald.conf and set the following, then restart systemd-journald
[Journal]
SystemMaxUse=100M

Select an instance size

Select a instance size based on Influx guidelines: low is low-moderate. Grafana is on same machine but should be low impact.

Recommendation for low-moderate

  • CPU: 4 core
  • RAM: 8Gb
  • IOPS: 500

At writing (2019), there are “new-gen” T3 and M5. T3s are burstable. This is good for influx, as grafana occasionally hammers it, then there is low, constant writing loads.Hence we select:

  • t3.medium (2 cCPUs, 4Gb)
    • this worked but crashed (out of memory) after weeks. Appears steady now with swap file and reduced logging, etc.
  • t3.large (2 vCPUs, 8Gb RAM)
    • works great. twice the cost.
  •  t3.xlarge : twice everything

 

Leave a Reply

Your email address will not be published. Required fields are marked *