Monitoring

Regular, Service Maintenance Checklist.

Weekly, High Frequencey

Short interval reviews exist for the following.

WhatWithWhen
Server Online State daily log, direct connections, groklog, ping, cacti twice weekly
Service Online State.
Verify that all specified services are online and operational.</td>
<td>daily log, direct connections, groklog, cacti</td>
<td>twice weekly</td>
Service Performance. Verify services are functional within guidelines for installation.
Do not make performance changes with little projected improvements.

Review, and where possible correct, issues blocking performance or
causing performance problems (e.g. delays caused by DNS queries)
</td>
<td>daily log, cacti</td>
<td>twice weekly</td>
Service Performance.
Consider a formal review of new controls and modification changes. Where
appropriate document an RFC

</td>
<td>Review</td>
<td>half-yearly</td>
Backup State.
  Review restorability and security aspects of the backup states.
</td>
<td>daily log</td>
<td>twice weekly</td>
Resource Utilisation - Disk Disk useage can have a significant impact on performance and is monitored regularly daily log, cacti twice weekly
Resource Utilisation - Other other system resources that are monitored for abnormal changes include cpu use patterns, RAM utilisation. daily log, cacti, groklog twice weekly
Mail Queue Review unforeseen extended growths in the incoming and outgoing mailqueues. daily log, cacti twice weekly
Web Proxy Review proxy logs for noticeable performance issues daily log, cacti twice weekly

Monthly

Analysis and reviews requiring longer data collection periods for analysis (typically a month) include the following

WhatWithWhen
Firewall Reports.
Review a specific section of the networks firewall logs seeking insights to security and performance</td>
<td>groklog, eyeballs</td>
<td>Monthly</td>
VPN Report Use and capacity performance report. sawmill, webalizer(?) bi-monthly
Web Proxy Report Use and capacity performance report. bi-monthly

Quarterly

WhatWithWhen
Firewall Document.
  Report on the current firewall deployment and potential impact
  of network change proposals.
</td>
<td>groklog, eyeballs</td>
<td>semi-annually</td>

Annually

WhatWithWhen
Authentication Keys.
Distribute new public SSH keys for all managed servers.
Ensure managed hosts get new public keys as a security
measure.</td>
<td>daily log</td>
<td>Annually</td>
Host Build Test Cycle.
  Each host will be given a full pre-release security audit and review.
</td>
<td>eyeball</td>
<td>on commission</td>
Perfomance Assessment Should the client concur, a review can be made assessing capacity of existing host to achieve required performance for next 12 months. on commission

A checklist is useful in ensuring a minimal level of consistency. Of-course we accept that the check-lists do not ensure the quality service. It still belongs in the hand of the code monkey, or admin monkey carrying the check-list around.

In the absence of a good automated system, the more mundane manual process still needs to be completed.

What activities do we want to perform, and why is that beneficial for us and our clients.

Below is a quick list of activities that should/could be reviewed and an argument (for or against) their value to clients and Nullcube. Otherwise known as the feature checklist of what would be nice in a monitoring system that is scalable.

The suggestion to use Control, Histogram, and Pareto Charts allow us to discuss Policy Procedures. The nature of the charts give specific data points that can be connected to specific Policy or Flag items enabling both Nullcube and clients to pre-allocate behaviour and resources.

We all benefit from a visual indicator that can become a common “language” between different skill levels and orientation.

All Hosts

For all managed hosts, the following are points of interest to regularly monitor.

ItemMonitorPurposeAudit tools
Uptime Control Chart Set boundaries for median and maximum uptime.
  The graph should be linear and interesting factors exist both below a pre-determined minimum and maximum control bars. Where the system spends too much time below the minimum uptime (e.g. set minimum uptime of 2 days, so when a machine is below that bar for more than a week this should flag a review of the installation.) There should be a maximum number of days live control, above this control should begin to worry us whether the system can survive a restart on the occassion of a <strong>forced restart</strong>.

  Clients may neglect or are themselves not aware of power cycling of servers due to various issues such as short-term power failure onsite.

  Benefits - see above reference to Policy and Behaviour
</td>
<td>groklog, daily output</td>
Resource Utilisation Control Charts
  Set boundaries for median and maximum disk use.

  The maximum value is critical, but we also need to know if there is a pattern
  of use that is systematically driving use towards the control borders.
  <table>
            <tr>
                <th>Item
                </th>
                <th>Description
                </th>
                <th>Tool
                </th>
              </tr>
            <tr>
                <td>Disk
                </td>
                <td>Load, expansion
                </td>
                <td></td>
              </tr>
            <tr>
                <td>Ram
                </td>
                <td>Load, expansion
                </td>
                <td>Cacti</td>
              </tr>
              <tr>
                <td>CPU.
                </td>
                <td>Load, expansion
                </td>
                <td>Cacti</td>
              </tr>
</table>
  </td>
<td>groklog, </td>
Network Links Control Chart
  Set boundaries for median and maximum state behaviour of link.

  Benefits - see above reference to Policy and Behaviour

</td>
<td>groklog, netstat</td>
Changes to Configuration Files Change List Track changes to configuration files such as:
/etc/pf.conf Firewall rules
/etc/rc, /etc/rc.local, /etc/rc.conf.local, /etc/login.conf, root's cron Startup changes
/etc/mail/*, /etc/samba/*, /etc/squid/* Sendmail, Samba and Squid configuration files
/home/*/.ssh;/home/*/.bash_profile;/home/*/.profile
groklog,
Specific Services

For special services on hosts, below is the beginnings of a list of issues to monitor.

ServiceItemMonitorPurposeAudit Tools
Firewall Traffic Control, Histogram, and Pareto Chart
  Control Charts can be used for visualising overall traffic patterns as well as behavioural changes
  for different types of traffic.

  Histogram Chart: 

  Pareto Chart: Visually highlight volume differences in types of traffic.

  Benefits - see above reference to Policy and Behaviour
  </td>
  <td>groklog, </td>
Mail Server Mail Queue Control, Histogram Chart
  Set boundaries for median and maximum disk use.

  Benefits - see above reference to Policy and Behaviour

  </td>
<td>groklog, </td>
Mail Server Traffic Control, Histogram Chart
  There are various issues with mail traffic that should be of value to ourselves and to clients. Some of these, charted would significantly improve ability to react.
ItemDescription
activity end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag.
Denied/Failed Denied and failed send-to accounts could imply a user error or some sort of software misconfiguration. A huge denial/failure may indicate a potential security data point.
              If the user has failed to adjust their behaviour from the mailserver error messages, then we may
              need to look at other means of resolving the problem.</td>
Denied/Failed Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM.
ItemDescription
activity end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag.
Denied/Failed Denied and failed send-to accounts could imply a user error or some sort of software misconfiguration. A huge denial/failure may indicate a potential security data point.
              If the user has failed to adjust their behaviour from the mailserver error messages, then we may
              need to look at other means of resolving the problem.</td>
Denied/Failed Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM.
Benefits - see above reference to Policy and Behaviour
groklog,