Jump to content

User:Meelislubi

From Wikipedia, the free encyclopedia
http://www.itlibrary.org/index.php?page=Incident_Management

IT service monitoring

[edit]

IT Monitoring is checking of system events for informing people based on predefined logic.
IT Monitoring can include keeping a record of system event statuses (problem <-> OK ), but this is not monitoring primary goal.

event -> logic -> trigger -> notification


For monitoring IT services you can only monitor CI-s and by doing so you will also monitor services depending on them as CI can have one-to-one relations with IT services.

SLA OLA

Objects in question

[edit]
  • Event - Event in system
  • Incident - (ITILv3) An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service.
    • Great risk increasing -

Costumer usually wants to get notified by them, but they do not follow the definition of incident, as there is not impact to service or quality (jet)

  • Problem - (out of monitoring scope- but references as next step)
DEFINE: event 
DEFINE: message


  • Passive monitoring - monitoring or logs
  • Active monitoring - emulating user (logging in checking balance, making payment ...)

Checking methods

[edit]
  • Checks for error
    • Alerts on Errors (excludes unknown) (stateless monitoring)
    • (Optional) Alerts on Success after error. (Usually hard to accomplice as tool is already designed for only Error monitoring) (state based monitoring)
  • Checks for Success (state based monitoring)
    • Alerts on non-Success (includes unknown)
    • Alerts on Success after non-Success

Alerting

[edit]

Monitoring View

[edit]

Stateless

[edit]

Messages are just coming, not possible to understood when errors are fixed.
Sometimes it is presumed that if message does not repeat then event / incident has ended (Error no found).
But this may not always be so as Checking is being performed on error, not success.

State based

[edit]

Messages are coming when errors occure. (Non-Success)
Messages are being automatically closed when error situation is over. (Success reached)
State based event can also have unknown state.

Mixed

[edit]

Unknown which messages will close automatically and which not (If not separated!!!)

Actions

[edit]
  • Automatic - Action is run automatically. Example: error detected SMS
  • Manual - Action is being run manually (needs human involvement) Example: call Admin
  • Semi-Automatic - Verified error situation (human) -> run automatic action (machine)

Methology

[edit]

Failure monitoring

[edit]

Success monitoring (Monitoring monitoring)

[edit]

NoData triggers)

[edit]

Impact Analyzes

[edit]

Message is basically impact to CI-s.(No Impact, Working Slow, Partly Working, Not Working)
Message is needed for admin to start fixing - identifying object (CI) and error (Message text)

  • No Impact - no impact to service(yet)
  • Working Slow - self explanatory
  • Partly Working - Service non-critical functionality affected
  • Not Working - Service critical functionality affected


Reporting

[edit]