YaCy
This article needs additional citations for verification. (May 2014) |
Original author(s) | Michael Christen |
---|---|
Developer(s) | YaCy Community |
Stable release | 1.90
/ 4 July 2016 |
Repository | |
Operating system | Cross-platform |
Type | Overlay network, Search engine |
License | GPLv2+ |
Website | www.yacy.net/en |
YaCy (pronounced "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks.[1][2] Its core is a computer program written in Java distributed on several hundred computers, as of September 2006[update], so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks. It is a free search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.
Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server exists. It can be run either in a crawling mode or as a local proxy server, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy). Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.
In October 2015, after 11 years since the project launched, a large logic structure of YaCy which provides for all distributed ranking algorithms was declared by its core developers to have always been faulty, effectively impairing any ranking capabilities.[3]
YaCy is available on Windows, Mac and GNU/Linux.
System components
YaCy search engine is based on four elements:[4]
- Crawler
- A search robot which traverses from web page to web page and analyzes their content.
- Indexer
- Creates a Reverse Word Index (RWI) i.e. each word from the RWI has its list of relevant URLs and Ranking information. Words are saved in form of word hashes.
- Search and Administration interface
- Made as a web interface provided by a local HTTP servlet with servlet engine.
- Data Storage
- Used to store the Reverse Word Index Database utilizing a Distributed Hash Table.
Philosophy
The information society of the 21st century is based on free access to all public information. There is a huge focus on transparency, accountability and accessibility of information. YaCy aims to enable this free access to information effectively and realistically. Therefore, while major search engines of the global corporations are closed systems and their search technology is not transparent and comprehensible, YaCy provides an open-source and free search solution. Everyone can see how information is obtained for the search engine and displayed to the user.
Free Search: The Missing Link
There is a lot of free content on the Internet, such as Wikipedia, free music, data under Creative Commons and other free use licenses, etc. This free content should not only be discoverable using proprietary search engines in an increasingly monopolistic Internet infrastructure because then the monopoly holders decide what information is visible. YaCy believes that free information is truly free if it can be accessed using free software and YaCy fills in the missing link between free information and the user, free search.[5]
A Decentralised Search Engine
The Internet was built on original philosophy of an all-to-all infrastructure. But lately only transmitter-receiver connections have flooded the realm of the World Wide Web. Ideally, each consumer of content on the Web should have the same opportunity to produce content as to consume it. YaCy's goal is to help producers and users of information on the Web operate independently of the centralised search technique by making all content open to all people.
Benefits of the YaCy Philosophy
Civil Rights and Privacy
- A central evaluation and monitoring of search queries is impossible.
- Data tracks can not be evaluated. In addition to the data protection and privacy situation, this is an economic factor in terms of industrial espionage.
Ecological
- The operation of data centers with enormous power consumption (and sometimes their own power plants) for central web search could be removed. Distributed search requires only the computers of the searchers.
Sociological
- All seekers have the same rights, such as when adding new content.
- The content of the search engine will be determined by the users, not by commercial aspects of the Web portal operator.
- Individualization of Relevance: everyone can assess the quality and importance of web pages by their own rules and adjust to their personal relevance as a ranking method (both popular and scientific).
Advantages
- As there is no central server, the results cannot be censored easily, and the reliability is (at least theoretically) higher, because there's no single point of failure and the search index is stored redundantly.[6]
- Because the engine is not owned by a company, there is no centralized advertising.
- Because of the design of YaCy, it can be used to index intranets or darknets, where Internet search engines do not or can not operate, including Tor, I2P or Freenet.
- It is possible to achieve a high degree of privacy.
- On every search YaCy fetches the pages provided in search results and verifies that they still contain the keywords requested by the user. This ensures that the pages that no longer contain the requested keywords are not displayed to the user, among other things.
- The YaCy protocol uses HTTP requests, which preserves transparency and discoverability, while aiding diagnosis and investigation. Performance can be increased to near that of binary-only protocols (like TCP & UDP, see Disadvantages section), with the use of compression, such as gzip.
- Built-in support for serving search results via OpenSearch
Disadvantages
- There is no NAT traversal functionality built in.
- As there is no central server and the YaCy network is open to anyone, malicious peers are (theoretically) able to insert inaccurate or commercially biased search results. In theory no search result displayed to the user can be 'wrong' since all results are, if so configured, verified by downloading each page from the result set to see if the searched words actually exist on the page from the search result URL. However, YaCy uses a User agent string to identify itself, meaning a web server could send down different content to a YaCy crawler than to a normal visitor, but this is true for nearly any search engine.
- Result verification is done client-side on every search, which increases network traffic on the computer running YaCy and makes YaCy slower to display the search results than search engines such as Google. This behavior can be disabled, but that would make the search susceptible to spam.
- The YaCy protocol uses HTTP-Requests, which can be slower than binary protocols.
- Missing IPv6 support.[7]
- The ranking of sites is done on the YaCy client side (users are encouraged to run their own YaCy server, as using a local server is necessary to gain many of the benefits of YaCy). The ranking algorithms, although easily customized, do not have their workload distributed and are limited to the use of the YaCy word index and whatever analysis can be done on the object being ranked. Therefore, more complex ranking algorithms such as those used by Google (which analyse rank using a variety of contextual factors developed during content crawling) are not, yet, feasible in YaCy, placing limits on most users' means to retrieve more relevant results. However, it's possible to apply crowdsourced ranking to YaCy results using software such as Seeks.
YaCy as a Search Appliance:Topic-Oriented Search and Search Engine for Projects
- You can search for projects (a combination of wikis, forums and websites)
- It is a topic-oriented search engine (combine a search for several web pages from different domains into a single search portal)
- YaCy helps to preserve your anonymity when searching your things.
- If you run a YaCy peer, you have your own search engine. You can use it either to provide search functionality for your own search portal, or you can join a community of search engine peers to share your web index with the web index of other YaCy peer owners. If you search with YaCy your search requests are anonymous.
Privacy & Security
- Your private search requests are never stored, monitored or evaluated for commercial purposes.
- If you are searching for terms related to product development and innovation, you can potentially give away information about your company activities. To maintain your business secrets, you need your own search engine (which can easily be created with YaCy).
Search Engine Technology
- YaCy is a complete search appliance with user interface, index, administration and monitoring.
- YaCy harvests web pages with a web crawler. Documents are then parsed, indexed and the search index is stored locally. If your peer is part a peer network, then your local search index is also merged into the shared index for that network.
- A search is started then the local index contributes together with a global search index from peers in the YaCy search network.
Components of YaCy
YaCy consists of a variety of components that serve the networking, administration and maintenance of the index with blacklists, moderation functions and community communication. The following graph shows components in YaCy:
1.Statistics
2.XML APi
3.Crawler
with Balancer
4.Web Server
5.Indexing
6.Peer-to-Peer
7.Monitoring
8.Filter & Blacklist
9.Search interface
10.Bookmarks
See also
- Dooble; an open source Web Browser with an integrated YaCy Search Engine Tool Widget
- Sciencenet; a search engine for scientific knowledge, based on YaCy
- Arado.sf.net – an open source search engine & URL Database
References
- ^ "YaCy takes on Google with open source search engine". The Register. 2011-11-29. Retrieved 2012-04-16.
- ^ "YaCy: It's About Freedom, Not Beating Google". PC World. 2011-12-03. Retrieved 2012-04-16.
- ^ "YaCy-Bugtracker". Retrieved 2016-03-08.
- ^ "YaCy Technology Architecture". YaCy.net. Retrieved 2012-02-14.
- ^ "YaCy - The Peer to Peer Search Engine: Philosophy". yacy.net. Retrieved 2016-01-04.
- ^ "Search Engine Technology". Retrieved 28 January 2014.
- ^ "YaCy crawler cannot parse URI's with IPv6 address in it inside square brackets. -". YaCy-Bugtracker. MantisBT Team. Retrieved 7 April 2014.