User:Meelislubi

http://www.itlibrary.org/index.php?page=Incident_Management

IT service monitoring

IT Monitoring is checking of system events for informing people based on predefined logic.
IT Monitoring can include keeping a record of system event statuses (problem <-> OK ), but this is not monitoring primary goal.

event -> logic -> trigger -> notification

For monitoring IT services you can only monitor CI-s and by doing so you will also monitor services depending on them as CI can have one-to-one relations with IT services.

SLA OLA

Objects in question

Event - Event in system
Incident - (ITILv3) An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service.
- Great risk increasing -

Costumer usually wants to get notified by them, but they do not follow the definition of incident, as there is not impact to service or quality (jet)

Problem - (out of monitoring scope- but references as next step)

DEFINE: event 
DEFINE: message

Passive monitoring - monitoring or logs
Active monitoring - emulating user (logging in checking balance, making payment ...)

Checking methods

Checks for error
- Alerts on Errors (excludes unknown) (stateless monitoring)
- (Optional) Alerts on Success after error. (Usually hard to accomplice as tool is already designed for only Error monitoring) (state based monitoring)
Checks for Success (state based monitoring)
- Alerts on non-Success (includes unknown)
- Alerts on Success after non-Success

Alerting

Monitoring View

Stateless

Messages are just coming, not possible to understood when errors are fixed.
Sometimes it is presumed that if message does not repeat then event / incident has ended (Error no found).
But this may not always be so as Checking is being performed on error, not success.

State based

Messages are coming when errors occure. (Non-Success)
Messages are being automatically closed when error situation is over. (Success reached)
State based event can also have unknown state.

Mixed

Unknown which messages will close automatically and which not (If not separated!!!)

Actions

Automatic - Action is run automatically. Example: error detected SMS
Manual - Action is being run manually (needs human involvement) Example: call Admin
Semi-Automatic - Verified error situation (human) -> run automatic action (machine)

Methology

Failure monitoring

Success monitoring (Monitoring monitoring)

NoData triggers)

Impact Analyzes

Message is basically impact to CI-s.(No Impact, Working Slow, Partly Working, Not Working)
Message is needed for admin to start fixing - identifying object (CI) and error (Message text)

No Impact - no impact to service(yet)
Working Slow - self explanatory
Partly Working - Service non-critical functionality affected
Not Working - Service critical functionality affected