A system monitor (SM) in systems engineering is a process within a distributed system for collecting and storing state data. This is a fundamental principle supporting Application Performance Management.
The argument that system monitoring is just a nice to have, and not really a core requirement for operational readiness, dissipates quickly when a critical application goes down with no warning. The configuration for the system monitor takes two forms:
- configuration data for the monitor application itself, and
- configuration data for the system being monitored. See: System configuration
The monitoring application needs information such as log file path and number of threads to run with. Once the application is running, it needs to know what to monitor, and deduce how to monitor. Because the configuration data for what to monitor is needed in other areas of the system, such as deployment, the configuration data should not be tailored specifically for use by the system monitor, but should be a generalized system configuration model.
The performance of the monitoring system has two aspects:
- Impact on system domain or impact on domain functionality: Any element of the monitoring system that prevents the main domain functionality from working is in-appropriate. Ideally the monitoring is a tiny fraction of each applications footprint, requiring simplicity. The monitoring function must be highly tunable to allow for such issues as network performance, improvements to applications in the development life-cycle, appropriate levels of detail, etc. Impact on the systems' primary function must be considered.
- Efficient monitoring or ability to monitor efficiently: Monitoring must be efficient, able to handle all monitoring goals in a timely manner, within the desired period. This is most related to scalability. Various monitoring modes are discussed below.
There are many issues involved with designing and implementing a system monitor. Here are a few issues to be dealt with:
- data access
System monitor basics
There are many tools for collecting system data from hosts and devices using the SNMP (Simple Network Management Protocol). Most computers and networked devices will have some form of SNMP access. Interpretation of the SNMP data from a host or device requires either a specialized tool (typically extra software  from the vendor) or a Management information base (MIB), a mapping of commands/data references to the various data elements the host or device provides. The advantage of SNMP for monitoring is its low bandwidth requirements and universal usage in the industries.
Unless an application itself provides a MIB and output via SNMP, then SNMP is not suitable for collecting application data.
Other protocols are suitable for monitoring applications, such as CORBA (language/OS-independent), JMX (Java-specific management and monitoring protocol), or proprietary TCP/IP or UDP protocols (language/OS independent for the most part).
Data access refers to the interface by which the monitor data can be utilized by other processes. For example, if the system monitor is a CORBA server, clients can connect and make calls on the monitor for current state of an element, or historical states for an element for some time period.
The system monitor may be writing data directly into a database, allowing other processes to access the database outside the context of the system monitor. This is dangerous however, as the table design for the database will dictate the potential for data-sharing. Ideally the system monitor is a wrapper for whatever persistence mechanism is used, providing a consistent and 'safe' access interface for others to access the data.
The data collection mode of the system monitor is critical. The modes are: monitor poll, agent push, and a hybrid scheme.
- Monitor poll
- In this mode, one or more processes in the monitoring system actually poll the system elements in some thread. During the loop, devices are polled via SNMP calls, hosts can be accessed via Telnet/SSH to execute scripts or dump files or execute other OS-specific commands, applications can be polled for state data, or their state-output-files can be dumped.
- The advantage of this mode is that there is little impact on the host/device being polled. The host's CPU is loaded only during the poll. The rest of the time the monitoring function plays no part in CPU loading.
- The main disadvantage of this mode is that the monitoring process can only do so much in its time. If polling takes too long, the intended poll-period gets elongated.
- Agent push
- In agent-push mode, the monitored host is simply pushing data from itself to the system monitoring application. This can be done periodically, or by request from the system monitor asynchronously.
- The advantage of this mode is that the monitoring system's load can be reduced to simply accepting and storing data. It doesn't have to worry about timeouts for SSH calls, parsing OS-specific call results, etc.
- The disadvantage of this mode is that the logic for the polling cycle/options are not centralized at the system monitor, but distributed to each remote node. Thus changes to the monitoring logic must be pushed out to each node.
- Also, in agent-based monitoring, a host cannot inform that it is completely "down" or powered off, or if an intermediary system (such as a router) is preventing access to the system.
- Hybrid mode
- The median mode between 'monitor-poll' and 'agent-push' is a hybrid approach, where the system configuration determines where monitoring occurs, either in the system monitor or agent. Thus when applications come up, they can determine for themselves what system elements they are responsible for polling. Everything however must post its monitored-data ultimately to the system monitor process.
- This is especially useful when setting up a monitoring infrastructure for the first time and not all monitoring mechanisms have been implemented. The system monitor can do all the polling in whatever simple means are available. As the agents become smarter, they can take on more of the load.
- "Event Management: Reactive, Proactive, or Predictive?". APM Digest. 1 August 2012.
- "CISCO SNMP Page (strip #1)". Retrieved 2007-08-25.
- "SolarWinds is the winner of Best of TechEd 2012 for Systems & Application Management and Monitoring Solution.". TechEd. June 2012.