Monitoring
Regular, Service Maintenance Checklist.
Weekly, High Frequencey
Short interval reviews exist for the following.
What | With | When |
Server Online State |
daily log, direct connections, groklog, ping, cacti |
twice weekly |
Service Online State.
Verify that all specified services are online and operational.</td>
<td>daily log, direct connections, groklog, cacti</td>
<td>twice weekly</td>
|
Service Performance.
Verify services are functional within guidelines for installation.
Do not make performance changes with little projected improvements.
Review, and where possible correct, issues blocking performance or
causing performance problems (e.g. delays caused by DNS queries)
</td>
<td>daily log, cacti</td>
<td>twice weekly</td>
|
Service Performance.
Consider a formal review of new controls and modification changes. Where
appropriate document an RFC
</td>
<td>Review</td>
<td>half-yearly</td>
|
Backup State.
Review restorability and security aspects of the backup states.
</td>
<td>daily log</td>
<td>twice weekly</td>
|
Resource Utilisation - Disk
Disk useage can have a significant impact on performance and is monitored regularly |
daily log, cacti |
twice weekly |
Resource Utilisation - Other
other system resources that are monitored for abnormal changes include cpu use patterns,
RAM utilisation. |
daily log, cacti, groklog |
twice weekly |
Mail Queue
Review unforeseen extended growths in the incoming and outgoing mailqueues. |
daily log, cacti |
twice weekly |
Web Proxy
Review proxy logs for noticeable performance issues
|
daily log, cacti |
twice weekly |
Monthly
Analysis and reviews requiring longer data collection periods for analysis (typically a month) include the following
What | With | When |
Firewall Reports.
Review a specific section of the networks firewall logs seeking insights to security and performance</td>
<td>groklog, eyeballs</td>
<td>Monthly</td>
|
VPN Report
Use and capacity performance report.
|
sawmill, webalizer(?) |
bi-monthly |
Web Proxy Report
Use and capacity performance report.
|
|
bi-monthly |
Quarterly
What | With | When |
Firewall Document.
Report on the current firewall deployment and potential impact
of network change proposals.
</td>
<td>groklog, eyeballs</td>
<td>semi-annually</td>
|
|
|
|
|
|
|
Annually
What | With | When |
Authentication Keys.
Distribute new public SSH keys for all managed servers.
Ensure managed hosts get new public keys as a security
measure.</td>
<td>daily log</td>
<td>Annually</td>
|
Host Build Test Cycle.
Each host will be given a full pre-release security audit and review.
</td>
<td>eyeball</td>
<td>on commission</td>
|
Perfomance Assessment
Should the client concur, a review can be made assessing capacity of existing host
to achieve required performance for next 12 months.
|
|
on commission |
A checklist is useful in ensuring a minimal level of consistency. Of-course we accept that the check-lists do not ensure the quality service. It still belongs in the hand of the code monkey, or admin monkey carrying the check-list around.
In the absence of a good automated system, the more mundane manual process still needs to be completed.
Below is a quick list of activities that should/could be reviewed and an argument (for or against) their value to clients and Nullcube. Otherwise known as the feature checklist of what would be nice in a monitoring system that is scalable.
The suggestion to use Control, Histogram, and Pareto Charts allow us to discuss Policy Procedures. The nature of the charts give specific data points that can be connected to specific Policy or Flag items enabling both Nullcube and clients to pre-allocate behaviour and resources.
We all benefit from a visual indicator that can become a common “language” between different skill levels and orientation.
All Hosts
For all managed hosts, the following are points of interest to regularly monitor.
Item | Monitor | Purpose | Audit tools |
Uptime |
Control Chart |
Set boundaries for median and maximum uptime.
The graph should be linear and interesting factors exist both below a pre-determined minimum and maximum control bars. Where the system spends too much time below the minimum uptime (e.g. set minimum uptime of 2 days, so when a machine is below that bar for more than a week this should flag a review of the installation.) There should be a maximum number of days live control, above this control should begin to worry us whether the system can survive a restart on the occassion of a <strong>forced restart</strong>.
Clients may neglect or are themselves not aware of power cycling of servers due to various issues such as short-term power failure onsite.
Benefits - see above reference to Policy and Behaviour
</td>
<td>groklog, daily output</td>
|
Resource Utilisation |
Control Charts |
Set boundaries for median and maximum disk use.
The maximum value is critical, but we also need to know if there is a pattern
of use that is systematically driving use towards the control borders.
<table>
<tr>
<th>Item
</th>
<th>Description
</th>
<th>Tool
</th>
</tr>
<tr>
<td>Disk
</td>
<td>Load, expansion
</td>
<td></td>
</tr>
<tr>
<td>Ram
</td>
<td>Load, expansion
</td>
<td>Cacti</td>
</tr>
<tr>
<td>CPU.
</td>
<td>Load, expansion
</td>
<td>Cacti</td>
</tr>
</table>
</td>
<td>groklog, </td>
|
Network Links |
Control Chart |
Set boundaries for median and maximum state behaviour of link.
Benefits - see above reference to Policy and Behaviour
</td>
<td>groklog, netstat</td>
|
Changes to Configuration Files |
Change List |
Track changes to configuration files such as:
/etc/pf.conf
|
Firewall rules
|
/etc/rc, /etc/rc.local, /etc/rc.conf.local, /etc/login.conf, root's cron
|
Startup changes
|
/etc/mail/*, /etc/samba/*, /etc/squid/*
|
Sendmail, Samba and Squid configuration files
|
/home/*/.ssh;/home/*/.bash_profile;/home/*/.profile
|
|
|
groklog, |
Specific Services
For special services on hosts, below is the beginnings of a list of issues to monitor.
Service | Item | Monitor | Purpose | Audit Tools |
Firewall |
Traffic |
Control, Histogram, and Pareto Chart |
Control Charts can be used for visualising overall traffic patterns as well as behavioural changes
for different types of traffic.
Histogram Chart:
Pareto Chart: Visually highlight volume differences in types of traffic.
Benefits - see above reference to Policy and Behaviour
</td>
<td>groklog, </td>
|
Mail Server |
Mail Queue |
Control, Histogram Chart |
Set boundaries for median and maximum disk use.
Benefits - see above reference to Policy and Behaviour
</td>
<td>groklog, </td>
|
Mail Server |
Traffic |
Control, Histogram Chart |
There are various issues with mail traffic that should be of value to ourselves and to clients. Some of these, charted would significantly improve ability to react.
Item | Description |
activity |
end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag. |
Denied/Failed |
Denied and failed send-to accounts could imply a user error or some sort of
software misconfiguration. A huge denial/failure may indicate a potential security
data point.
If the user has failed to adjust their behaviour from the mailserver error messages, then we may
need to look at other means of resolving the problem.</td>
|
Denied/Failed |
Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM. |
Item | Description |
activity |
end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag. |
Denied/Failed |
Denied and failed send-to accounts could imply a user error or some sort of
software misconfiguration. A huge denial/failure may indicate a potential security
data point.
If the user has failed to adjust their behaviour from the mailserver error messages, then we may
need to look at other means of resolving the problem.</td>
|
Denied/Failed |
Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM. |
Benefits - see above reference to Policy and Behaviour
| groklog, |