Jump to content

YaCy

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 46.21.99.29 (talk) at 11:23, 3 April 2014 (for some reasons the domain yacy.net is not reachable from the US while it is reachable from europe. The hoster states that it is working, while it is effective not. Please use yacy.de meanwhile.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

YaCy
Developer(s)YaCy Community
Stable release
1.68 / 9 February 2014; 10 years ago (2014-02-09)
Repository
Operating systemCross-platform
TypeOverlay network, Search engine
LicenseGPLv2+
Websitewww.yacy.de/en www.yacy.net/en

YaCy (pronounced "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks.[1][2] Its core is a computer program written in Java distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks.

Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server exists. It can be run either in a crawling mode or as a local proxy server, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy.)

Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.

Architecture

YaCy search engine is based on four elements:[3]

Crawler
A search robot which traverses from web page to web page and analyzes their content.
Indexer
Creates a Reverse Word Index (RWI) i.e. each word from the RWI has its list of relevant URLs and Ranking information. Words are saved in form of word hashes.
Search and Administration interface
Made as a web interface provided by a local HTTP servlet with servlet engine.
Data Storage
Used to store the Reverse Word Index Database utilizing a Distributed Hash Table.

Advantages

PDF slides from ApacheCon 2012: A Web Search Appliance with Solr and YaCy
  • As there is no central server, the results cannot be censored, and the reliability is (at least theoretically) higher, because there's no single point of failure and the search index is stored redundantly.[4]
  • Because the engine is not owned by a company, there is no centralized advertising.
  • Because of the design of YaCy, it can be used to index intranets or darknets, where Internet search engines do not or can not operate, including Tor, I2P or Freenet.
  • It is possible to achieve a high degree of privacy.
  • On every search YaCy fetches the pages provided in search results and verifies that they contain the keywords requested by the user. This ensures that the pages that no longer contain the requested keywords are not displayed to the user, among other things.
  • The YaCy protocol uses HTTP requests, which preserves transparency and discoverability, while aiding diagnosis and investigation. Performance can be increased to near that of binary-only protocols (like TCP & UDP, see Disadvantages section), with the use of compression, such as gzip.
  • Built-in support for serving search results via OpenSearch

Disadvantages

  • From development and maintenance point of view the greatest disadvantage of YaCy is that it inherits all Java's disadvantages.
  • There is no NAT traversal functionality built in.
  • As there is no central server and the YaCy network is open to anyone, malicious peers are (theoretically) able to insert inaccurate or commercially biased search results. In theory no search result displayed to the user can be 'wrong' since all results are verified by downloading each page from the result set to see if the searched words actually exist on the page from the search result URL. However, YaCy uses a User agent string to identify itself, meaning a web server could send down different content to a YaCy crawler than to a normal visitor, but this is true for nearly any search engine.
  • Result verification is done client-side on every search, which increases network traffic on the computer running YaCy and makes YaCy slower to display the search results than search engines such as Google. This behavior can be disabled, but that would make the search susceptible to spam.
  • The YaCy protocol uses HTTP-Requests, which can be slower than binary protocols.
  • Missing IPv6 support.[5]
  • The ranking of sites is done on the YaCy client side (users are encouraged to run their own YaCy server, as using a local server is necessary to gain many of the benefits of YaCy). The ranking algorithms, although easily customized, do not have their workload distributed and are limited to the use of the YaCy word index and whatever analysis can be done on the object being ranked. Therefore, more complex ranking algorithms such as those used by Google (which analyse rank using a variety of contextual factors developed during content crawling) are not, yet, feasible in YaCy, placing limits on most users' means to retrieve more relevant results. However, it's possible to apply crowdsourced ranking to YaCy results using software such as Seeks.

See also

  • Dooble; an open source Web Browser with an integrated YaCy Search Engine Tool Widget
  • Sciencenet; a search engine for scientific knowledge, based on YaCy
  • Arado.sf.net – an open source search engine & URL Database

References

  1. ^ "YaCy takes on Google with open source search engine". The Register. 2011-11-29. Retrieved 2012-04-16.
  2. ^ "YaCy: It's About Freedom, Not Beating Google". PC World. 2011-12-03. Retrieved 2012-04-16.
  3. ^ "YaCy Technology Architecture". YaCy.net. Retrieved 2012-02-14.
  4. ^ "Search Engine Technology". Retrieved 28 January 2014.
  5. ^ http://bugs.yacy.net/view.php?id=145