Current Setup
This provides background – see Installing Grafana, InfluxDB, Apache2 on AWS EC2: Step by Step.
So I’ve got an InfluxDB collecting time-series data from a half a dozen IoT devices. They feed about 100 points every 10s. Then Grafana runs queries on the database. I install them like this.
Grafana and InfluxDB are running on one t2.micro instance with 1Gb RAM. On occasion I get out-of-memory errors from InfluxDB and the instance crashed once. These problems will only increase with more devices.
So how to move from this dev-style setup to a (small scale) production setup with some backup and robustness?
Understand Current Setup
Grafana seems lightweight, as seen by checking top:
ec2>top -o %MEM
InfluxDB, however, uses 30% of the 1Gb and for my “low load” 2-4Gb is recommended.
Monitoring Method 1
A simple bash script can be run to monitor memory hogs:
echo cmd,mem,cpu while date ps -eo cmd,%mem,%cpu --sort=-%mem | head -n 4 | grep -v grep do sleep 10; done
You can plot a histogram of the %mem like so:
pip install bashplotlib cat memory-log.log | grep influx | awk '{print $4}' | hist
Monitor Method 2
Use dstat to monitor and log CPU/memory
sudo yum install dstat # or apt-get
# dstat.sh: Run dstat to record mem/cpu hogs every 60s
dstat -ta --sys --mem --top-mem --top-cpu --output dstat.csv 60 >> ~/dstat.log
crontab -e
# cron
@reboot /home/ec2-user/dstat.sh
More Security
Lock down firewall to absolute minimum. Logs show repeated SSHD brute force attacks, so limit ssh to specific IPs
Fortify Influxdb
Analyse Write Rate and Cardinality
Use this to quickly check cardinality. My cardinality was ~1000 which is ok (docs have 10,000 to 100,000 for single nodes):
show series exact cardinality
Get runtime stats. Hard to understand.
show stats
Enable “monitor” in influxd.conf and view the “_internal” database (don’t do in production). View with a grafana dashboard.
From the command line:
`$ influx -execute 'select derivative(pointReq, 1s) from "write" where time > now() - 5m' -database '_internal' -precision 'rfc3339'
See all the internal database measurements.
Analyse Write Rate and Cardinality
Config
There’s a lot of scope to flood influxd with massive queries. I noticed in the influxd.conf, there are some options to control the chaos. These are fairly self explanatory. It’s up to the clients to retry if they get a write failure.
Memory
I had out of memory problems with Influxdb. I was using a 2Gb t2.small AWS instance, with 8 databases. Each database writes via HTTP at 10s intervals in a batch. Each batch contains about 10 readings.
Memory problems (see Monitoring, below) occur when:
- Compaction
- Continuous queries: run all at the same time: how to offset them?
- Whenever it wants!
To improve, get a bigger machine, and/or:
- added swap via a swapfile. I couldn’t add ephermal storage to my instance – maybe not included/possible. So now a total of 4+4Gb memory, with 4Gb on the 16gb disk.
- use max-concurrent-compactions = 1 in the influxdb.conf file
- [There are more options in the influxdb.conf to limit resources]
Further invesit
Monitor Influxdb
Use Grafana for an overview
Let’s use Grafana and InfluxDB to monitor icheck host influxdb with address live.phisaver.com
Setup a fresh pair on a new machine and connect to the _internal database to view performance. There are a number of monitors available on grafana.com
Simple metrics can be monitored such as CPU. Disk space was more difficult (but possible with AWS tools). However, endpoint checking is charged, and restarting services is not supported. So…
Add AWS Cloud ECAgent for detailed monitoring and “watches”
This provides diskfree and CPU and memory, in addition to the ‘standard’ metrics.
As per the instructions, install and run it.
Cloudwatch can be used to monitor and set alarms (get an SMS!)
Add Monit for detailed control
Monit is a nice system for monitoring and restarting dead services. I chose it as easy-to-use and robust. There is a just enough help on Google to do everything. Easy on Ubuntu with apt (see here for aws-linux-2 instructions).
It is different to AWS CloudWatch, in that it runs on the machine and can restart processes, etc. AWS CW looks at the whole machine and cannot restart individual processes.
To check (from another machine) the webserver is open:
check host live2-phisaver with address live2.phisaver.com
if failed port 80 protocol http
and request / with content = "Grafana"
then alert
Design new setup for backup
So, thinking backwards, how to configure things to:
- be simple
- enable fast recovery in case of ec2 server crash
- backup only the data needed
I use multiple EBS volumes for the instance. This is good linux practice. I have a seperate /var.
Manual Method: Create AMI
- Just use AWS consule to EC2 to create an image of the instance. This will backup both volumes.
Easy enough, but can’t automate easily and is not incremental. This doesn’t allow for incremental backup. But one AMI for a working system is a good idea to enable fast recovery.
Not a good option for continuous backup
Method 2: AWS Backup
- Automate easily with LifeCycle
- To restore, you can “Create an Image” from the snapshot of the EBS volume, and start an instance with this
- Make sure to use “EC2 Instance” as the resource to protect (not the individual volume)
This is easy to automate with LifeCycle Manager and is incremental. It works well.
A good simple option. .
Disk Configuration and Space
- Use and / and a /var disk
- Logs can get out of control fast!
- Check settings in grafana.conf (log-level) and influxdb.conf. (http-write-log, log-level)
- Edit /etc/systemd/journald.conf and set the following, then restart
systemd-journald
[Journal]
SystemMaxUse=100M
Select an instance size
Select a instance size based on Influx guidelines: low is low-moderate. Grafana is on same machine but should be low impact.
Recommendation for low-moderate
- CPU: 4 core
- RAM: 8Gb
- IOPS: 500
At writing (2019), there are “new-gen” T3 and M5. T3s are burstable. This is good for influx, as grafana occasionally hammers it, then there is low, constant writing loads.Hence we select:
- t3.medium (2 cCPUs, 4Gb)
- this worked but crashed (out of memory) after weeks. Appears steady now with swap file and reduced logging, etc.
- t3.large (2 vCPUs, 8Gb RAM)
- works great. twice the cost.
- t3.xlarge : twice everything