User:Meelislubi
http://www.itlibrary.org/index.php?page=Incident_Management
IT service monitoring
[edit]IT Monitoring is checking of system events for informing people based on predefined logic.
IT Monitoring can include keeping a record of system event statuses (problem <-> OK ), but this is not monitoring primary goal.
event -> logic -> trigger -> notification
For monitoring IT services you can only monitor CI-s and by doing so you will also monitor services depending on them as CI can have one-to-one relations with IT services.
SLA OLA
Objects in question
[edit]- Event - Event in system
- Incident - (ITILv3) An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service.
- Great risk increasing -
Costumer usually wants to get notified by them, but they do not follow the definition of incident, as there is not impact to service or quality (jet)
- Problem - (out of monitoring scope- but references as next step)
DEFINE: event DEFINE: message
- Passive monitoring - monitoring or logs
- Active monitoring - emulating user (logging in checking balance, making payment ...)
Checking methods
[edit]- Checks for error
- Alerts on Errors (excludes unknown) (stateless monitoring)
- (Optional) Alerts on Success after error. (Usually hard to accomplice as tool is already designed for only Error monitoring) (state based monitoring)
- Checks for Success (state based monitoring)
- Alerts on non-Success (includes unknown)
- Alerts on Success after non-Success
Alerting
[edit]Monitoring View
[edit]Stateless
[edit]Messages are just coming, not possible to understood when errors are fixed.
Sometimes it is presumed that if message does not repeat then event / incident has ended (Error no found).
But this may not always be so as Checking is being performed on error, not success.
State based
[edit]Messages are coming when errors occure. (Non-Success)
Messages are being automatically closed when error situation is over. (Success reached)
State based event can also have unknown state.
Mixed
[edit]Unknown which messages will close automatically and which not (If not separated!!!)
Actions
[edit]- Automatic - Action is run automatically. Example: error detected SMS
- Manual - Action is being run manually (needs human involvement) Example: call Admin
- Semi-Automatic - Verified error situation (human) -> run automatic action (machine)
Methology
[edit]Failure monitoring
[edit]Success monitoring (Monitoring monitoring)
[edit]NoData triggers)
[edit]Impact Analyzes
[edit]Message is basically impact to CI-s.(No Impact, Working Slow, Partly Working, Not Working)
Message is needed for admin to start fixing - identifying object (CI) and error (Message text)
- No Impact - no impact to service(yet)
- Working Slow - self explanatory
- Partly Working - Service non-critical functionality affected
- Not Working - Service critical functionality affected