Federated search

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user.

This is often a technique to integrate disparate information resources on the web. It can also be a technique to integrate multiple data sources within a large organization or "enterprise."

Purpose[edit]

Federated search came about to meet the need of searching multiple disparate content sources with one query. This allows a user to search multiple databases at once in real time, arrange the results from the various databases into a useful form and then present the results to the user.

As such, it is an information aggregation, or integration approach - it provides single point access to many information resources, and typically returns the data in a standard or partially homogenized form. Other approaches include constructing an Enterprise Data Warehouse, Data Lake, or Data Hub. Federated Search queries many times in many ways (each source is queried separately) where other approaches import and transform data many times, typically in overnight batch processes. Federated search provides a real-time view of all sources (to the extent they are all online and available).

Process[edit]

As described by Peter Jacso (2004[1]), federated searching consists of (1) transforming a query and broadcasting it to a group of disparate databases or other web resources, with the appropriate syntax, (2) merging the results collected from the databases, (3) presenting them in a succinct and unified format with minimal duplication, and (4) providing a means, performed either automatically or by the portal user, to sort the merged result set.

Federated search portals, either commercial or open access, generally search public access bibliographic databases, public access Web-based library catalogues (OPACs), Web-based search engines like Google and/or open-access, government-operated or corporate data collections. These individual information sources send back to the portal's interface a list of results from the search query. The user can review this hit list. Some portals will merely screen scrape the actual database results and not directly allow a user to enter the information source's application. More sophisticated ones will de-dupe the results list by merging and removing duplicates. There are additional features available in many portals, but the basic idea is the same: to improve the accuracy and relevance of individual searches as well as reduce the amount of time required to search for resources.

This process allows federated search some key advantages when compared with existing crawler-based search engines. Federated search need not place any requirements or burdens on owners of the individual information sources, other than handling increased traffic. Federated searches are inherently as current as the individual information sources, as they are searched in real time.

Implementation[edit]

federated search engine
Federating across three search engines

One application of federated searching is the metasearch engine; however, this is not a complete solution as many documents are not currently indexed. These documents are on what is known as the deep Web, or invisible Web. Many more information sources are not yet stored in electronic form. Google Scholar is one example of many projects trying to address this.

When the search vocabulary or data model of the search system is different from the data model of one or more of the foreign target systems the query must be translated into each of the foreign target systems. This can be done using simple data-element translation or may require semantic translation.

A challenge faced in the implementation of federated search engines is scalability, in other words, the performance of the site as the number of information sources comprising the federated search engine increase. One federated search engine that has begun to address this issue is WorldWideScience, hosted by the U.S. Department of Energy's Office of Scientific and Technical Information. WorldWideScience [2] is composed of more than 40 information sources, several of which are federated search portals themselves. One such portal is Science.gov [3] which itself federates more than 30 information sources representing most of the R&D output of the U.S. Federal government. Science.gov returns its highest ranked results to WorldWideScience, which then merges and ranks these results with the search returned by the other information sources that comprise WorldWideScience.[3] This approach of cascaded federated search enables large number of information sources to be searched via a single query.

Another application Sesam running in both Norway and Sweden has been built on top of an open sourced platform specialised for federated search solutions. Sesat,[4] an acronym for Sesam Search Application Toolkit, is a platform that provides much of the framework and functionality required for handling parallel and pipelined searches and displaying them elegantly in a user interface, allowing engineers to focus on the index/database configuration tuning.

Challenges[edit]

When federated search is performed against secure data sources, the users' credentials must be passed on to each underlying search engine, so that appropriate security is maintained. If the user has different login credentials for different systems, there must be a means to map their login ID to each search engine's security domain.[5]

Another challenge is mapping results list navigators into a common form. Suppose 3 real-estate sites are searched, each provides a list of hyperlinked city names to click on, to see matches only in each city. Ideally these facets would be combined into one set, but that presents additional technical challenges.[6] The system also needs to understand "next page" links if it's going to allow the user to page through the combined results.

Some of this challenge of mapping to a common form can be solved if the federated resources support linked open data[ via RDF. Ontologies (rules) can be added to map results to common forms using that technology.

Another challenge is sorting and scoring results. Each web resource has its own notion of relevance score, and may support some sorted results orders. Relevance varies greatly among "federates" in the search, so knowing how to interleave results to show the most relevant is difficult or impossible.

Another challenge is robust query. Federated search may have to restrict itself to the minimal set of query capabilities that are common to all federates. E.g. if Google supports negation and quoted phrases, but science.gov does not, it will be impossible for the federated search to support negated, quoted phrases.

Another challenge is availability and timeout. As the number of federates (federated sources) grows, the likelihood of one or more slow or offline federates becomes high. The federated search must decide when to consider a federate offline, or wait for a slow response. Response times will be dictated by the slowest federate of the bunch.

Another challenge is development and testing within an enterprise (vs. on the public internet). Development groups should typically not hit live, production systems as they do regular work, much less intensive load testing. Also, some resources are secure, and should not be arbitrarily queried and exposed in development due to privacy and security concerns. Therefore, the development, testing and performance test environments must include installation and configuration for many sub-systems to allow safe, secure testing.

Another challenge within an enterprise is HA/DR (high-availability and disaster recovery). For the overall federated system to be HA/DR, every sub-system must be HA/DR.

Similarly, performance modeling and capacity planning for the federated system requires modeling, planning and sometimes expansion of all federates.

For the reasons above, within an enterprise, a Data Hub or Data Lake may be preferable, or a hybrid approach. Data Hubs and Lakes simplify development and access, but may incur some time lag before data is available (without special synchronizing logic). On the web, federation is more typical.

Further reading[edit]

See also[edit]

References[edit]

  1. ^ Thoughts About Federated Searching. Jacsó, Péter, Information Today, Oct 2004, Vol. 21, Issue 9
  2. ^ WorldWideScience
  3. ^ a b Science.gov
  4. ^ Sesat
  5. ^ Mapping Security Requirements to Enterprise Search
  6. ^ 20+ Differences Between Internet vs. Enterprise Search - part 1