Change detection and notification
Change detection and notification (CDN) refers to automatic detection of changes made to World Wide Web pages and notification to interested users by email or other means. Whereas search engines are designed to find web pages, CDN systems are designed to monitor changes to web pages. Before change detection and notification, it was necessary for users to manually check for web page changes, either by revisiting web sites or periodically searching again. Efficient and effective change detection and notification is hampered by the fact that most servers do not accurately track content changes through Last-Modified or ETag headers.
In 1996, NetMind developed the first change detection and notification tool, known as Mind-it, which ran for six years. This spawned new services such as ChangeDetection.com (1999), ChangeDetect (2002) and Google Alerts (2004). Historically, change polling has been done either by a server which sent email notifications or a desktop program which audibly alerted the user to a change. More recent services such as OnWebChange.com (2009) also offer notifications directly to mobile devices and webhooks (HTTP callbacks) for application integration.
The prevalence of cloud computing and smartphones is changing the CDN market, namely how polling is done and how notifications are sent. A mobile CDN device with a cloud back end does not suffer from limited bandwidth, storage or processing power, and notifications are delivered to wherever the device is. One such service is dasPing (2011).
Change detection and notification services can be categorized by the software architecture they use. Three principal approaches can be distinguished:
- A local client application with a graphical user interface polls and tracks changes.
- A server polls, tracks changes and sends email notifications with a web browser user interface.
- A mobile device connects to a cloud server and can be notified in real time by the server when a change is detected.
Some web pages change regularly, due to the inclusion of adverts or feeds in the presented page. This can trigger false-positives in the change-detection, since users are often only interested in changes to the main content. Some approaches to mitigate this issue exist.
- Create a metric of difference between two versions of a page (calculated for example from change in total size, changes in HTML file, or changes in the DOM tree) and ignore changes below some threshold. The threshold may be set by the user, or estimated automatically by comparing some early versions of the page.
- Content extraction. For popular sites, or sites running popular software, content may be actively separated from chaff by selecting a sub-tree of the DOM, for example using XPath. Another typical method is the use of regular expressions to extract only the text the user is interested in.
- Chakravarthy, S.; Hara, S. C. H. (2006). "17th International Conference on Database and Expert Systems Applications (DEXA'06)". p. 465. doi:10.1109/DEXA.2006.34. ISBN 0-7695-2641-1.