List of Web archiving initiatives

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Map of Web archiving initiatives worldwide in June, 2014.

This page contains a list of Web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data, and access methods.

This Wikipedia page was originally generated from the results obtained for the research paper A survey on web archiving initiatives[1] published by the Portuguese Web Archive[2] team.

Web archiving initiatives[edit]

Name Country Creation Year Technologies Number of Employees Comments
Full-time Part-time
Australia's Web Archive[3] Australia 1996 PANDORA Digital Archiving System (PANDAS), NLA Trove, HTTrack. 10 >5 The PANDORA Archive which takes a selective approach is a collaborative program of 11 agencies that provide an estimate average monthly staffing equivalent to around 10 FTE. IT support provided by the National Library of Australia: 0.25 person-month. Whole .au domain harvests have been conducted annually since 2005 in collaboration with the Internet Archive using Heritrix, Wayback Machine.
Australian Government Web Archive[4] Australia 2013 PANDORA Digital Archiving System (PANDAS), NLA Trove, HTTrack. 5 The Australian Government Web Archive (AGWA) is a web archiving initiative of the National Library of Australia which complements the Library's long established PANDORA Archive. The AGWA is a collection of Commonwealth Government websites with the earliest content collected in June 2011. Selected older government web content along with some state and local government material may be found at in the PANDORA Archive.[5]
Our digital island, a Tasmanian Web Archive[6] Australia 1996 Web Curator, Heritrix and Wayback Machine 1
PageFreezer[7] Canada, US, Netherlands, Belgium 2005 PageFreezer's Deep Web Crawler, Lucene, Solr Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws.
OoCities - GeoCities Archive / GeoCities Mirror[8] Germany
Web@rchive Austria[9] Austria 2008 Archive-access tools and NetarchiveSuite.dk 2
DILIMAG (Digital Literature Magazines)[10] Austria 2007 WebCurator 2 One technician, one for collecting and metadata.
Bibliothèque et Archives nationales du Québec (BAnQ)[11] Canada 2012 Heritrix, Wayback Machine. 3 1 librarian, 2 developers
Government of Canada Web Archive (GCWA)[12] Canada 2005 Heritrix, Wayback Machine and Nutchwax. 2
Web Information Collection and Preservation - WICP (Chinese Web Archive)[13] China 2003 Heritrix, Wayback Machine and Nutchwax.
Croatian Web Archive (Hrvatski arhiv weba - HAW)[14] Croatia 2004 Crawl: DAMP software, Heritrix

Access: Wayback Machine, Lucene

2 2 The Croatian Web Archive (HAW) is a collection of content harvested from the Internet. In 2004 the Archive started as a concept of selective capturing of web resources. Whole .hr domain harvests have been conducted annually since 2011. as well as thematic/event harvesting for events of national interest. The content of the Archive is publicly available via HAW website. (2 librarians full time, 1 librarian part time, NUL), 2 IT professionals part time(Zagreb University Computing Centre (Srce))
WebArchiv (National Library of the Czech Republic)[15] Czech Republic 2000 Nutch, NutchWAX and WERA tools. 5 3.5 FTE library staff + approx. 1.5 FTE technical staff
Netarkivet.dk[16] Denmark 2005 NetarchiveSuite.dk and Heritrix and Wayback. 22 22 people involved (developers, librarians, operations staff, project managers). All together 5 FTE.
Estonian Web Archive[17] Estonia 2010 NetarchiveSuite, Heritrix, Wayback Machine. 2 1 Since 2006 the Legal Deposit Law allows the National Library of Estonia to collect Estonian websites as legal deposit copies and make these available to the public. The archive is open to the public since November 2013.
Finnish Web Archive[18] Finland 2008 Heritrix, Solr, Wayback Machine. 2 >2 Maintained by the National Library of Finland. Annually, all *.fi domains are harvested, as well as web servers located in Finland. Outside these harvests, the library manually selects relevant websites.
BnF - BnF Web Legal Deposit[19] France 2006 Heritrix, Wayback Machine and NutchWAX. NetarchiveSuite. 9
Ina (Institut National de l'Audiovisuel)[20] France 2009 Crawl : PhagoSite, Croket, Fantomas, Heritrix / Access : Dowser 8 Staff of 80 documentalists taking part in nominating sites and QA
E-diaspora (Télécom ParisTech, FMSH)[21] France 2010 Crawl : PhagoSite 1 30 researchers taking part in nominating sites
Internet memory Foundation France, Netherlands 2004 IM large scale crawler, Heritrix, IM Access software. Storage of Web Content: Hbase Crawls monitoring, developers & infrastructure, manager & administration.
Internet Memory Research (ATN service)[22] France, 2011 IM large scale crawler, Heritrix, IM Access software. Storage of Web Content: Hbase Crawls monitoring (QA, crawl engineering, project management), developers & infrastructure, manager & administration
Bibliotheksservice-Zentrum Baden-Württemberg[23] Germany 2003 7.5
Web archive of the German Bundestag[24] Germany 2005
Iceland[25] Iceland 2004 Heritrix, Wayback Machine
Israel Web Archive[26] Israel 2011 Heritrix,Web curator tool, Wayback Machine, Rosetta 1 >3 National Library of Israel collecting '.IL' domains, 1 Project Manager part time, 1 Technical Leader full time, 1 librarian part time, 1 IT Infrastructure part time
Japan Web Archiving Project[27] Japan 2002 Heritrix, Solr. Previously: Wget, Accela BizSearch 9 1 Launched in April 2004 as a pilot project, WARP (Web Archiving Project) has been in full-scale operation since July 2007.[28]
National Library of Korea - OASIS (Online Archiving & Searching Internet Sources)[29] Korea 2001 Own system based on Oracle DBMS and specialized search engine (IRS) that performs data management and search function. 3 11
Koninklijke Bibliotheek[30] Netherlands 2006 Heritrix, KB e-Depot system 1 ~7
National Library of Latvia[31] Latvia 2005 Heritrix 1 Currently only storing for preservation, access to public in development (ETA June 2012). The Latvian term for web harvesting is "rasmošana".
New Zealand Web Archive[32] New Zealand 1999 Wayback Machine 3 >10 3-4 people at the National Library (various hours) and 2 people at the Internet Archive during the time of domain harvests.
Selective web archiving = 3 full time staff.
Technical services = 1 staff member responds to technical problems when they arise.
National Digital library = 2-3 staff members ad hoc.
NDHA (National Digital Heritage Archive) = various staff members respond to web archiving issues as they arise.
The National Library of Norway[33] Norway
Portuguese Web Archive[2] Portugal 2007 PWA-technologies, Heritrix, Wayback Machine, NutchWAX 3
Web archive of Cacak[34] Serbia 2009 HTTrack 1
Web Archive Singapore[35] Singapore Wayback Machine, Heritrix, NutchWAX, WERA
Slovenian Web Archive[36] Slovenia 2007 Heritrix, Wayback Machine 1
Archivo de la Web Española[37] Spain 2009 Heritrix, Wayback Machine, NutchWAX 2+supervisor 1 Can pool additional resources occasionally from IT and financial departments.
PADICAT: The Web Archive of Catalonia[38] Spain 2005 Heritrix, Wayback Machine, WERA, Nutchwax, Web Curator and CAT. 4 PADICAT is the open access Web Archive of Catalonia, created by the Biblioteca de Catalunya: the public institution responsible for collecting, preserving and distributing the bibliographic heritage of Catalonia, in Spain.
Basque Digital Heritage Archive[39] Spain 2008 Heritrix, Wayback Machine, Nutchwax and Web Curator. 1
Sweden (Kulturarw3)[40] Sweden 1996 Heritrix. Own system for storage, maintenance and access 1.25 Paus in operation November 2009 - May 2011.
Aleph Archives[41] Switzerland/USA 2010 Distributed crawler, ArchiView access plugin, High performance search engine, Near real time indexing, Web Monitoring tools 7 Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...).
Web Archiving Bucket[42] Switzerland/USA/Canada 2012 WARC Software Development Kit,Cobalt,Holon web server The “Web Archiving Bucket” is an initiative launched by Aleph Archives, to preserve data and provide libraries and organizations with free-to-use web archiving tools and components.

The Web Archiving Bucket provides set of tools to help archivists and professionals in their daily work.

Web Archive Switzerland[43] Switzerland 2008 Heritrix, Wayback Machine 5 1 crawl engineer, 3 persons for quality assurance (sharing less than 1 full time), 1 coordinator. The curators, who do the selection, are partner libraries all over Switzerland.
NTU Web Archiving System, NTUWAS[44] Taiwan 2007 Lucene 3
Web Archive Taiwan[45] Taiwan 2007
The UK Web Archive[46] UK 2004 Heritrix, Web Curator Tool, Wayback Machine, Solr for searching.
Hanzo Archives[47] UK/USA 2006 Hanzo Crawler, Search, and Access Tools. Commercial web archiving and social media archiving products and services for capturing dynamic websites and social enterprise use. E-discovery, information management, and cultural heritage preservation for regulated corporations and government organizations whose compliance or legal obligations extend to their websites, intranet, and social media content. Also includes multiple 'dark' archives across Europe and USA.
UK Government Web Archive[48] UK 2003 ATN Service 4 2 Technical side of our web archiving operation is contracted out to the Internet Memory Foundation so the figures account for QA, curatorial and management staff only
Internet Archive (provides Archive-it service)[49] USA 1996 Heritrix, Wayback Machine, NutchWAX and other tools developed by the Internet Archive 12
Reed Tech Archives[50] USA 2010 TrueArchive™ Technology Reed Tech Archives provides support for Information Governance, Litigation Protection, Compliance, e-Discovery and Social Media Management. Solution offers both an automated approach or manual capture. For automated website and social media capture,

the application captures sites on a recurring frequency and interval. The entire site is completely rebuilt inside the archive to provide the exact user experience afforded on the live web. A user will have the ability to navigate the site from a set of URLs or from within the visible archived site. Generally this approach supports compliance and risk mitigation as well as the legal function. • On-demand manual capture provides clients with the ability to capture a fully functioning page or series of pages from a website or social media property as needed through the Reed Tech Web Preserver plug-in. This approach tends to be used to support the legal, marketing and competitive intelligence functions.

Stanford University Libraries[51] USA 2007 Heritrix, HTTrack, Wayback Machine, CDL Web Archiving Service, Internet Archive Archive-It 2 5 Stanford University Libraries has been engaged in web archiving projects since 2007 and started establishing a web archiving program in 2013.
Columbia University Libraries Web Resources Collection Program[52] USA 2009 Archive-it service 2 >1 Part-time consultation/supervision from other librarians adding up to about 1 FTE.
Cornell University Library USA 2011 Archive-it service .5 >1
North Carolina State Government Web Site Archives[53] USA 2005 Archive-it service 3
Latin American Web Archiving Project[54] USA 2005 Archive-it service
Web Archiving Project for the Pacific Islands[55] USA Archive-it service 4
Library of Congress Web Archives[56] USA 2000 Heritrix, Wayback Machine, and the DigiBoard, an in-house curatorial/permissions tool 5 80 The part time workers spend a few hours per month (on average) selecting content for the collections.
Harvard University Library: the Web Archive Collection Service (WAX)[57] USA 2006 Heritrix, Wayback Machine, NutchWAX and WAXi, an in-house curatorial interface. >6 3 part time on IT support. External curators within 3 units but don't know the size of them.
Web Archiving Service from California Digital Library (WAS service)[58] USA 2005 Heritix, Wayback Machine, NutchWAX 4 >1 The number of hours that curators devote to the service is very variable.
Bentley Historical Library (University of Michigan) Web Archives[59] USA 2000 HTTrack, Teleport Pro, WAS service (2010-) 2
University of Texas at San Antonio Web Archives[60] USA 2009 Archive-It 3 The number of hours varies dependent upon how the crawls are scheduled.
qumram[61] Switzerland 2010 qumram Web Archiving / Web Information Governance Software Suite Commercial web archiving / web information governance software suite. Provides both remote harvesting as well as transactional web archiving. Allows integrations with any possible web application (WCMS, Portal, Sharepoint, eShop, custom applications) as well as repository (database, file system, electronic archive or records management system, cloud-based solution). Allows capturing and reproduction of public information as well as specific user interactions.
SAPERION[62] Germany 2011 SAPERION ECM Web Content Archive Commercial enterprise content management suite specializes on regulatory compliance. The product provides both harvesting as well as transactional web archiving based on the integration of qumram´s[61] Chronos Web Archiving Software Suite. Web content is just another channel from which content is reaching SAPERION. Others may be scanner, fax, e-mail, mobiles devices, office suites or any other system creating content like ERP systems.
Bibliotheca Alexandrina's Internet Archive Egypt 2002 Heritrix, Wayback Machine 3 Current crawling interests: Egypt beyond January 25, Arab League ccTLDs
AUEB Web Archive[63] Greece 2010 Heritrix, Wayback Machine and NutchWax. 1 1 This project is part of the function of the University Library.[64]
World Bank Web Archives[65] USA 2007 HTTrack crawler, Oracle RDBMS, Google Search Appliance 0 3
OpenGovData Russia Archives[66] Russia 2010 HTTrack crawler, custom tools developed for social media archiving. Experimenting: Heritrix, Wayback Machine About 37 government websites collected (March 2012) using HTTrack and provided as archives for downloading.
Archive Team Worldwide 2009 wget, ad hoc scripts 0 0 Volunteers group. They partially archived GeoCities, Yahoo! Videos, Google Video and others.
WikiTeam Worldwide 2011 ad hoc scripts 0 0 Volunteers group. Over 4,500 wikis preserved.
University of North Texas CyberCemetery[67] USA 1995 Heritrix, Wayback Machine; formerly HTTrack 2 The CyberCemetery is an archive of government websites that have ceased operation (usually websites of defunct government agencies and commissions that have issued a final report). This collection features a variety of topics indicative of the broad nature of government information. In particular, this collection features websites that cover topics supporting the university’s curriculum and particular program strengths.
Archive today[68] Worldwide 2012 Apache Accumulo, HDFS, ad hoc scripts 1 1 Saves external links from community web-sites (wikis, forums, blogs, ...). Can save snapshots of Web 2.0 pages.
Tamiment Library and Robert F. Wagner Labor Archives at New York University[69] USA 2007 WAS Service 1 1 Archives websites related to New York City and National Labor and Left Movements. Projects include: Alternative Mass Media / News; Anarchism; Animal Rights; Arts and Cultural Left; Civil Rights and Civil Liberties; Communism, Socialism, Trotskyism; Economic and Social Justice (Including Occupy Wall Street); Education and Student Movements; Electoral Politics and Parties / Political Action (U.S. Left); Environmentalism / Green Movement; Feminism and Women's Movements; Guantanamo Bay Detention Camp & War Crimes (U.S.); Housing; Internet/Cyberspace Democracy; Jewish American Progressive & Left Activity; Labor Unions and Organizations (U.S.); Left Academia and Theory, Intellectuals and Other Notables; LGBT Rights; Other Left Activism; Peace Movements; Prisoners Rights and Political Prisoners; Progressive Policy/ Educational Organizations.
Preservica[70] Worldwide 2012 Heritrix, Tessella's SDB, Wayback Cloud-based heterogeneous archiving service that allows ingest from multiple sources (including web archiving ingest via Heritrix). Ability to migrate content within WARC files and render in Wayback. Ingest runs as workflow so very little effort needed to run it. Developed, supported and run by Tessella's Archiving Division.
Central State Electronic Archives of Ukraine Ukraine 2007 HTTrack, Wget 2 Archives interested in keeping websites and creating the thematic collections of such websites, Is presently in storage the Archives collections of websites which includes the topic of presidential elections in Ukraine from 2010 until today, about the Chornobyl disaster, the local elections, of Euro 2012 in Ukraine, UNESCO World Heritage sites in Ukraine, the 200th anniversary of the birth of Taras Shevchenko.[71][72]

Archived data[edit]

Name Archived Contents (millions) Disk Space Occupied (TB) Archive Format TLD/Broad Crawls Selective Crawls (Yes/No) Comments
Australia's Web Archive[3] 3100 104.5 ARC/WARC .AU Y .AU crawls (2005-2009): 3 billion files (100 TB). Selective crawls (1996-today): 100 million files (4.5 TB). There are 3 copies of each content.
Our digital island, a Tasmanian Web Archive[6] 0.336 HTTrack Y Preserves online contents related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of ‘Book’ in the Tasmanian Library Act 1984.[73] Thus, no permission to capture from publishers is required.
Web@rchive Austria[9] 1500 22 ARC .AT Y A copy of the data will be stored in a high security data storage unit.
DILIMAG (Digital Literature Magazines)[10] 0.03 0.996 ARC Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines.
Bibliothèque et Archives nationales du Québec (BAnQ)[11] 30 6 ARC/WARC Y Harvesting began in 2009. Selective crawls of Quebec websites.
Government of Canada Web Archive (GCWA)[12] 170 7 Y Selective crawls of the web domain of the Federal Government of Canada (.GC.CA)
Web Information Collection and Preservation - WICP (Chinese Web Archive)[13] .GOV.CN Y Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn'

domain.

Croatian Web Archive (Hrvatski arhiv weba - HAW)[14] 231 13 Mirror, WARC .HR Y Since 2004 selective harvesting over 5000 web resources. Since 2011 annual harvesting of national .hr domain as well as thematic harvesting. All archived content is publicly available via HAW website.
WebArchiv (National Library of the Czech Republic)[15] 526 24 .CZ Y Harvesting began in 2001.
Netarkivet.dk[16] 16000 502 ARC/WARC .DK Y It uses NetarchiveSuite.dk was developed by two Danish libraries and Heritrix.
Estonian Web Archive[17] 31 1.6 ARC/WARC .EE Y ARC files in the archive are uncompressed. Archive consists selective crawls since 2010. The first broad crawl will be conducted in 2014.
Finnish Web Archive[18] 494 23 .FI, .AX Y Also crawls contents hosted on machines physically located in Finland, independently from their domain.
BnF - BnF Web Legal Deposit[19] 18800 370 ARC/WARC .FR + all sites hosted in France Y BnF is making full copies[19] of all sites in the .FR TLD, as well as all sites hosted in France, ignoring both the Robots exclusion standard and the licenses of the documents.
Ina (Institut National de l'Audiovisuel)[20] 30000 270 (see comments) DAFF Y DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be approximately 2 PB
E-diaspora (Télécom ParisTech, FMSH)[21] 1030 13 (see comments) DAFF Y DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be approximately 51 TB
Internet memory Foundation 180 WARC Can be done by partners Y Formerly European Archive.[74] Collaborate with Internet Memory Research, which rovides the ArchiveTheNet Service (ATN Service). Selective crawls (140 TB), Domain crawls (40 TB), expect to grow to 1PB in 2012. New datacenter and a new crawler in 2012.
Bibliotheksservice-Zentrum Baden-Württemberg[23] 1 HTTrack Y Bibliotheksservice-Zentrum Baden-Württemberg -German is operating following Web-Archives:
1- Baden-Württembergisches Online-Archiv (BOA)
2- Saardok
3- Literatur im Netz des Deutschen Literaturarchivs Marbach.[75]
Web archive of the German Bundestag[24] Y German Federal Parliament. Selective. At regular intervals or at certain events are snapshots (snapshots) of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available.
Iceland[25]
Israel Web Archive ARC/WARC .IL Y .IL crawls (2006-2011): Pilots Crawls (500 GB). Selective crawls (1996, 2011)
Japan Web Archiving Project[27] 319.8 38.2 WARC - Y 15 TB of selective crawls based on permission (2002–2010). Started the web archiving of official institution sites based on the legislation from April 2010.
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource)[29] 24 Y Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web arching system will be rebuild.
Koninklijke Bibliotheek[30] 5 ARC Y
New Zealand Web Archive[32] 346 13 .NZ Y .NZ crawls: 105 million URLs (4.1 TB) in 2008, 170 million URLs (6.1 TB) in 2010. Selective crawls of 7 599 websites in the National Digital Heritage Archive (2.8 TB), 71 million contents estimated. Legal deposit covers born digital material (including websites).
The National Library of Norway[33]
Portuguese Web Archive[2] 1 731 52 ARC .PT, .CV, .AO, .MZ Y TLD crawls and integration of external collections since 2007, daily crawls of a selection of online publications of since 2010.
Web archive of Cacak[34] 0.255 0.013 HTTrack Y Selective crawls of 130 sites related to the city of Cacak. Collaboration with the WebArchiv team from the National Library of the Czech Republic.
Web Archive Singapore[35] .SG Y Selective crawls of 1000 Singapore-related sites, with the written consent of the owners. Whole .SG domain archiving.
Slovenian Web Archive[36] 1.5 WARC Selective crawls
Archivo de la Web Española[37] 2.421 95 ARC .ES Y Collaboration with Internet Archive. Domain crawl of .ES, harvested annually. Not launched publicly yet.
PADICAT : The Web Archive of Catalonia[38] 349 13 ARC/WARC .CAT Y In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet (.cat); Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life (elections, museums, etc.)
Basque Digital Heritage Archive[39] 21 0.8 ARC Y
Sweden (Kulturarw3)[40] 1710 223 Multipart MIME .se, Swedish .nu and geolocation for other tld's Y Bulk crawls approximately twice a year.
Selective crawls of about 140 newspapers every day.
Aleph Archives[41] 23 WARC, WARC2, ARC and HTTrack to WARC migration tools Y Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...).
Web Archive Switzerland[43] 0.5 ARC Y Mainly selected .ch crawls
NTU Web Archiving System, NTUWAS[44] 200 14 Y
Web Archive Taiwan[45]
The UK Web Archive[46] 20.6 WARC Y Selective crawls with previous permission. Now also conducting wholesale UK domain-scale crawls under Non-Print Legal Deposit legislation, enacted April 2013. This content will only be available on premises controlled by one of the six legal deposit libraries. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Hanzo Archives[47] 7 WARC Y Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive[48] 1,000+ 80+ ARC Between 2003 - 2005 the Internet Archive undertook the technical side of web archiving on behalf of The UK Government Web Archive. Since 2005 the technical side of the web archiving service has been contracted out to the Internet Memory Foundation. The UK Government Web Archive was part of the UK Web Archiving Consortium from 2004 - 2009.
Internet Archive (provides Archive-it service)[49] 150000 5500 World-wide Y Provides the Archive-it service and leads the Archive-access project (Internet Archive ARC access tools). Collection is mirrored at Bibliotheca of Alexandrina in Egypt.
Reed Archives[50]
Columbia University Libraries Web Resources Collection Program[52] 93.9 6.6 ARC/WARC Y Selective crawls with permission or notification. Thematic collections in: Human rights; Historic preservation and urban planning; New York City religions. Also capture Columbia University web domain.
North Carolina State Government Web Site Archives[53] 51.5 3.8 WARC Y
Latin American Web Archiving Project[54] Y
Web Archiving Project for the Pacific Islands[55] 5.5 ARC/WARC Y Includes sites of 18 countries.
Library of Congress Web Archives[56] 7741 420 ARC/WARC Y Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections.
Harvard University Library: the Web Archive Collection Service (WAX)[57] 19 0.661 ARC Y Selective crawls with no previous authorization.
Web Archiving Service from California Digital Library (WAS service)[58] 216 25.2 ARC/WARC Can be done by partners Y Provides Web Archiving Service (WAS) to partners world-wide. Was developed at the California Digital Library.
Bentley Historical Library (University of Michigan) Web Archives[59] 34.5 2.6 ARC/WARC Y WAS service since 2010.
University of Texas at San Antonio Web Archives[60] 26 1.135 ARC/WARC Y University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues .
AUEB Web Archive[63] 3 WARC aueb.gr N The amount of data crawled from the domain aueb.gr ranges between 10GB and 14.9GB . The data is stored on disk compressed and requires between 8.8GB and 9.7GB, resulting in space savings between 12% and 35%. In the case of new crawl, we can only store on disk the Web pages that change since the previous crawl. Consequently, we crawled 13.1GB from the domain aueb.gr, but we only stored on disk 1.6GB, resulting in space savings of 88%.
World Bank Web Archives[65] 143 GB HTTrack no, so far Y 450 sites with historical or research value have been harvested since 2007, each archived before being taken offline or before a major upgrade.
University of North Texas CyberCemetery[67] 0.887 WARC .gov Y
Bibliotheca Alexandrina's Internet Archive 80000 1000 ARC/WARC Egyptian news and politics Y

Access methods[edit]

Name URL history (Yes/No) Meta-data (catalog/advanced) search (Yes/No) Full-text search (Yes/No) Memento Compliance (No/Native/Proxy) Comments
Australia's Web Archive[3] N Y Y No Selected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive is indexed and searchable through the NLA's single search service Trove.[76]
The Australian Domain Harvests are full-text indexed but are not currently publicly available.
Our digital island, a Tasmanian Web Archive[6] Y Y N No Presents thumbnails generated through Html To Image supplemented in HTTrack. Information is organized in directory: A-Z Subject listing, A-Z Title listing.
Web@rchive Austria[9] Y N N No Only accessible on special terminals at the Austrian National Library. Presents thumbnail previews of archived pages and supports keyword search within URL.
DILIMAG (Digital Literature Magazines)[10] Y Y N No Metadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search was not implemented due to lack of resources.
Bibliothèque et Archives nationales du Québec (BAnQ)[11] Y N N No Provides access according to partner policy.[77]
Government of Canada Web Archive (GCWA)[12] Y Y Y Proxy Technical details available.[78]
Web Information Collection and Preservation - WICP (Chinese Web Archive)[13] Y No Archive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection.
Croatian Web Archive (Hrvatski arhiv weba - HAW)[14] Y Y Y Proxy Full open access.
WebArchiv (National Library of the Czech Republic)[15] Y Y Proxy Due to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in WebArchiv is available from public terminals in the National Library.
Netarkivet.dk[16] Y N N No Online access granted only to researchers using a proxy solution that accesses an archive through the Wayback Machine. It has established a framework for running batch jobs with the possibility of data mining.
Estonian Web Archive[17] Y Y N No In November 2013 there were over 1000 records of websites in the subject catalog and two special collections (Estonians Outside Estonia and Elections 2013). Owner of the site has the right to restrict access to the website on the public archive but it remains accessible for the researchers in-house.
Finnish Web Archive[18] Y N 30% of material. No URL search but onsite access to contents. Full-text search is available to 30% of material.
BnF - BnF Web Legal Deposit[19] Y N 15% of the collection No Accessible to authorized users of the BnF, through the reading rooms of the Research Library located in Paris and Avignon. Wayback Machine interface was translated to French. Full Text search only for a relatively small portion of the collection (15% of 200 TB) indexed by Internet Archive. No current full text search implemented in workflow. Builds special collection galleries based on a selection from the archive on a given topic.
Ina (Institut National de l'Audiovisuel)[20] Y Y Y No Full text indexing is based on Lucene. To accommodate results from frequent crawls (several crawls per hour for some pages) clustering is operated to handle similar versions of pages
E-diaspora (Télécom ParisTech, FMSH)[21] Y N N No 1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive Ina is handling crawls and storage
Internet memory Foundation Y Y Y No Provides access and search services according to partners policy.
Bibliotheksservice-Zentrum Baden-Württemberg[23] Y Y Y No Search available (on development).[79]
Web archive of the German Bundestag[24] Y N N No Web archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years.[80]
Iceland[25] Native
Israel Web Archive N Y N No Still in development and pilots
Japan Web Archiving Project[27] Y Y Y No Public access to sites after permission of the site owners. Open access to important publications such as white papers.
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource)[29] Y Y Y No 100% of the archive is indexed. Enables search by topic classification (e.g. Religion, Science, Arts). Search available.[81]
Koninklijke Bibliotheek[30] No The web archive will become available online during the first half of the year 2010.
New Zealand Web Archive[32] Y Y N No Domain harvests are available to selected staff only using Wayback and limited to URL searchers. Selected harvestings, each website is described in the catalogue (providing subject, author, title and URL searches) and can be viewed by the public via the Internet by clicking on the link to the archived copy. The websites themselves however are not indexed.
The National Library of Norway[33] N Y No Sites are integrated in the Catalog. Left bar enables facet navigation with drill-down.[82]
Portuguese Web Archive[2] Y Y Y No 64% of the archive is indexed and an experimental full-text service is freely available. Archived data can be mined through an Hadoop platform.
Web archive of Cacak[34] N N N No Plans to develop a search engine in the future. One bad characteristic of HTTrack is that it renames files during the archiving, so the original structure of the website is lost, as well file names.
Web Archive Singapore[35] No
Slovenian Web Archive[36] Y N N No The archive is not public yet. Plans to implement full-text search.
Archivo de la Web Española[37] Y (Future) Y (Future) Y (Future) No Plan to grant access through computers available at a given hall.
PADICAT: The Web Archive of Catalonia[38] Y Y Y No Full open access.
Basque Digital Heritage Archive[39] Y Y Y No
Sweden (Kulturarw3)[40] Y N N No Public access through dedicated machines in the library building.
Aleph Archives[41] Y Y Y No The full text search engine support automatic metadata extraction, and native results deduplication. Also included: antivirus checker (~250mil. pages/day), archives statistics, text summarizer, archives exports (PDF, PNG, TIFF), etc.
Web Archive Switzerland[43] Y Y Y No Web Archive Switzerland is the collection of the Swiss National Library containing websites with a bearing on Switzerland. Web Archive Switzerland has been integrated in e-Helvetica,[83] the access system of the Swiss National Library, giving access to the entire digital collection. So now you can do full text searching in Web Archive. But the archived versions of websites can only be viewed in the reading rooms of the Swiss National Library and of our partner libraries who help us build the collection of Swiss websites. But you can view the metadata of the archived versions from anywhere.
NTU Web Archiving System, NTUWAS[44] Y Y Y No Presents page thumbnails, archived pages mapped to geographical locations.
Web Archive Taiwan[45] Y Y Y No
PageFreezer[7] Y Y Y No Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry.
The UK Web Archive[46] Y Y N Native
Hanzo Archives[47] Y Y Y No Commercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive[48] Y Y Y Native Full text search is operational on the UK Government Web Archive.[84] Users can browse the collection using a full A-Z list of all sites[85] and a set of categories.[86]
Internet Archive (provides Archive-it service)[49] Y Y Y Native URL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools

av_tools and p2 platform for parallel processing.[87] It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing.[88]

Reed Archives[50] No
Columbia University Libraries Web Resources Collection Program[52] Y Y Y No Accessible through Archive-it service.[89]

Enhanced access to Human Rights collection available at: Human Rights Web Archive.[90]

North Carolina State Government Web Site Archives[53] Y Y Y No Accessible through Archive-it service.[89]
Latin American Web Archiving Project[54] Y Y Y No Content can be accessed via full-text search, or by browsing by country or by specialized sample collection.
Web Archiving Project for the Pacific Islands[55] Y Y Y No Supported by Archive-it service.
Library of Congress Web Archives[56] Y Y N Proxy Access provided via LCWA. Records in MODS (Metadata Object Descriptive Schema) format.
Harvard University Library: the Web Archive Collection Service (WAX)[57] Y Y Y No
Web Archiving Service from California Digital Library (WAS service)[58] Y Y Y No Access for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search.
Bentley Historical Library (University of Michigan) Web Archives[59] Y Y Y No Powered by the WAS from the California Digital Library.[91] Access is public but usage is restricted for private study, scholarship and research.
University of Texas at San Antonio Web Archives[60] Y Y Y Native Accessible through Archive-it service[92] and the Texas Archival Repositories Online database[93]
AUEB Web Archive[63] Y Y Y No
World Bank Web Archives[65] Y Y Y No URL history provided via open access to collection via standard web browser. Full text search is only available within each individual site. Search on metadata is available via advanced search within Web Archives collection.
University of North Texas CyberCemetery[67] N Y Y No
Alabama State Government and Politics Web Site and Social Media Archives[94] USA 2005 Archive-it service No 1
Tamiment Library and Robert F. Wagner Labor Archives at New York University[95] Y Y Y No Access is provided through the WAS service[96] as well as through finding aids that are searchable through NYU's finding aids portal.[97]

References[edit]

  1. ^ Daniel Gomes; João Miranda; Miguel Costa (25--29 September 2011). "A survey on web archiving initiatives". International Conference on Theory and Practice of Digital Libraries 2011. Springer. Retrieved 23 October 2012.  Check date values in: |date= (help)
  2. ^ a b c d Foundation for National Scientific Computing (FCCN) (23 October 2012). "Portuguese Web Archive: search the past". Foundation for National Scientific Computing (FCCN). Retrieved 23 October 2012. 
  3. ^ a b c "Pandora - Australia's Web Archive". nla.gov.au. Retrieved 2013-11-17. 
  4. ^ "AGWA - Australian Government Web Archive". nla.gov.au. Retrieved 2014-10-08. 
  5. ^ "PANDORA Archive". nla.gov.au. Retrieved 2014-10-08. 
  6. ^ a b c "Our digital island, a Tasmanian Web Archive". tas.gov.au. Retrieved 2014-05-29. 
  7. ^ a b "PageFreezer". pagefreezer.com. 2011-01-20. Retrieved 2013-11-17. 
  8. ^ "Oocities.org Cached GeoCities pages"[dead link]
  9. ^ a b c "Web@rchive Austria". Onb.ac.at. Retrieved 2013-11-17. 
  10. ^ a b c "DILIMAG (Digital Literature Magazines". dilimag.literature.at. Retrieved 2013-11-17. 
  11. ^ a b c "Bibliothèque et Archives nationales du Québec (BAnQ)". banq.qc.ca. Retrieved 2013-11-17. 
  12. ^ a b c "Government of Canada Web Archive (GCWA)". Collectionscanada.gc.ca. Retrieved 2013-11-17. 
  13. ^ a b c "Web Information Collection and Preservation - WICP (Chinese Web Archive)"
  14. ^ a b c "Croatian Web Archive (Hrvatski arhiv weba - HAW)". Haw.nsk.hr. 2004-10-01. Retrieved 2013-11-17. 
  15. ^ a b c "WebArchiv (National Library of the Czech Republic)". en.webarchiv.cz. Retrieved 2013-11-17. 
  16. ^ a b c "Netarkivet.dk". Netarkivet.dk. 2013-10-17. Retrieved 2013-11-17. 
  17. ^ a b c "Estonian Web Archive". National Library of Estonia. 2014-01-09. Retrieved 2014-01-09. 
  18. ^ a b c "Finnish Web Archive". kansalliskirjasto.fi. Retrieved 2013-11-17. 
  19. ^ a b c "Ina (Institut National de l'Audiovisuel)" (in French). Ina.fr. Retrieved 2013-11-17. 
  20. ^ a b c "E-diasporas (Télécom ParisTech, FMSH)". ediasporas.ticmigrations.fr. Retrieved 2013-11-17. 
  21. ^ "Internet Memory Research (ATN service)". archivethe.net. Retrieved 2013-11-17. 
  22. ^ a b c "Bibliotheksservice-Zentrum Baden-Württemberg". Bsz-bw.de. Retrieved 2013-11-17. 
  23. ^ a b c "Web archive of the German Bundestag". Webarchiv.bundestag.de. Retrieved 2013-11-17. 
  24. ^ a b c "Iceland - VEFSAFN". Vefsafn.is. Retrieved 2013-11-17. 
  25. ^ "The National Library of Israel". nli.org.il. Retrieved 2013-08-19. 
  26. ^ a b c "Japan Web Archiving Project". da.ndl.go.jp. Retrieved 2013-11-17. 
  27. ^ "Web Archiving Project - CDNLAO 2010 meeting" (PDF). NDL.go.jp. Retrieved 2013-11-17. 
  28. ^ a b c "National Library of Korea - OASIS (Online Archiving & Searching Internet Resource)". Oasis.go.kr. 2013-08-01. Retrieved 2013-11-17. 
  29. ^ a b c Koninklijke Bibliotheek, National Library of the Netherlands
  30. ^ "National Library of Latvia"
  31. ^ a b c "New Zealand Web Archive". Natlib.govt.nz. Retrieved 2013-11-17. 
  32. ^ a b c "The National Library of Norway" (in Norwegian). NB.no. Retrieved 2013-11-17. 
  33. ^ a b c Web archive of Cacak. digital.cacak.dis.rs
  34. ^ a b c "Web Archive Singapore". Was.nl.sg. Retrieved 2013-11-17. 
  35. ^ a b c "Slovenian Web Archive". zal-lj.si. 2013-11-05. Retrieved 2013-11-17. 
  36. ^ a b c "Archivo de la Web Española"
  37. ^ a b c National Library of Catalonia (16 November 2012). "PADICAT: The Web Archive of Catalonia". National Library of Catalonia. Retrieved 16 November 2012. 
  38. ^ a b c Kai Oswald Seidler. "Basque Digital Heritage Archive (ONDARENET)". euskadi.net. Retrieved 2013-11-17. 
  39. ^ a b c Krister Persson (2008-04-20). "National Library of Sweden - Sweden (Kulturarw3)". Kb.se. Retrieved 2013-11-17. 
  40. ^ a b c AAW Designs. "Aleph Archives". aleph-archives.com. Retrieved 2013-11-17. 
  41. ^ "Web Archiving Bucket". webarchivingbucket.com. Retrieved 2013-11-17. 
  42. ^ a b c "Web Archive Switzerland". E-helvetica.nb.admin.ch. Retrieved 2013-11-17. 
  43. ^ a b c "NTU Web Archiving System, NTUWAS". ntu.edu.tw. Retrieved 2013-11-17. 
  44. ^ a b c "Web Archive Taiwan". ncl.edu.tw. Retrieved 2013-11-17. 
  45. ^ a b c "UK Web Archive". Webarchive.org.uk. 2005-07-07. Retrieved 2013-11-17. 
  46. ^ a b c "Hanzo Archives". hanzoarchives.com. Retrieved 2013-11-17. 
  47. ^ a b c "UK Government Web Archive". Nationalarchives.gov.uk. Retrieved 2013-11-17. 
  48. ^ a b c "Internet Archive (provides Archive-it service)". archive.org. 2001-03-10. Retrieved 2013-11-17. 
  49. ^ a b c "Reed Archives". ReedArchives.com. Retrieved 2013-11-17. 
  50. ^ "Web Archiving | Stanford University Libraries". Retrieved 2014-03-26. 
  51. ^ a b c "Columbia University Libraries Web Resources Collection Program". columbia.edu. Retrieved 2013-11-17. 
  52. ^ a b c "North Carolina State Government Web Site Archives". ncdcr.gov. Retrieved 2013-11-17. 
  53. ^ a b c "Latin American Web Archiving Project". utexas.edu. Retrieved 2013-11-17. 
  54. ^ a b c "Web Archiving Project for the Pacific Islands". hawaii.edu. Retrieved 2013-11-17. 
  55. ^ a b c "Library of Congress Web Archives". Loc.gov. Retrieved 2013-11-17. 
  56. ^ a b c "Harvard University Library: the Web Archive Collection Service (WAX)". harvard.edu. Retrieved 2013-11-17. 
  57. ^ a b c "Web Archiving Service from California Digital Library (WAS service)". cdlib.org. 2013-10-16. Retrieved 2013-11-17. 
  58. ^ a b c "Bentley Historical Library (University of Michigan) Web Archives". umich.edu. Retrieved 2013-11-17. 
  59. ^ a b c "University of Texas at San Antonio Web Archives". Archive-it.org. Retrieved 2013-11-17. 
  60. ^ a b "Qumram". Qumram.ch. 2011-06-30. Retrieved 2013-11-17. 
  61. ^ SAPERION AG, Berlin. "Saperion ECM Web Content Archive". saperion.com. Retrieved 2013-11-17. 
  62. ^ a b c "AUEB Web Archive". aueb.gr. 2011-10-21. Retrieved 2013-11-17. 
  63. ^ "Archiving the Web sites of Athens University of Economics and Business". aueb.gr. Retrieved 2013-11-17. 
  64. ^ a b c "World Bank Web Archives0". worldbank.org. 2012-12-20. Retrieved 2013-11-17. 
  65. ^ "OpenGovData Russia project Archives"
  66. ^ a b c Government Documents Department, University of North Texas Libraries, State of Texas (2009-02-02). "University of North Texas CyberCemetery". unt.edu. Retrieved 2013-11-17. 
  67. ^ "[ウェブサービスレビュー]ZIPや画像のダウンロードにも対応した魚拓サービス「Archive today」 - CNET Japan". CNET Japan. Retrieved 2014-09-02. 
  68. ^ "NYU Libraries | Tamiment Library & Robert F. Wagner Labor Archives". Nyu.edu. Retrieved 2013-08-19. 
  69. ^ "How Preservica Works - Preservica". preservica.com. May 12, 2014. Archived from the original on May 12, 2014. Retrieved May 12, 2014. 
  70. ^ Central State Electronic Archives of Ukraine (CSEA Ukraine)
  71. ^ Information Booklet CSEA Ukraine
  72. ^ "LINC Tasmania Online - Home page". Statelibrary.tas.gov.au. 2012-06-26. Retrieved 2012-07-17. 
  73. ^ "European Archive". Europarchive.org. Retrieved 2013-11-17. 
  74. ^ "Literatur im Netz des Deutschen Literaturarchivs Marbach". boa-bw.de. Retrieved 2013-11-17. 
  75. ^ "Trove (Pandora Archive search)". nla.gov.au. Retrieved 2013-11-17. 
  76. ^ "Bibliothèque et Archives nationales du Québec (BAnQ)". banq.qc.ca. 
  77. ^ "Government of Canada Web Archive (GCWA)". gc.ca. 2007-11-15. Retrieved 2013-11-17. 
  78. ^ "BOA - Baden-Würtembergisches Online-Archiv". boa-bw.de. Retrieved 2012-07-17. 
  79. ^ "Web archive of the German Bundestag". bundestag.de. Retrieved 2013-11-17. 
  80. ^ "National Library of Korea - OASIS". go.kr. 2013-08-01. Retrieved 2013-11-17. 
  81. ^ "National Library of Norway Search". nb.no.
  82. ^ "Web Archive Switzerland - e-Helvetica". nb.admin.ch. Retrieved 2013-11-17. 
  83. ^ "UK Government Web Archive - Collections, Europarchive". europarchive.org. Retrieved 2013-11-17. 
  84. ^ "UK National Archives Webarchive". gov.uk. Retrieved 2013-11-17. 
  85. ^ "UK National Archives". gov.uk. Retrieved 2013-11-17. 
  86. ^ "Researcher - Documentation". archive.org.
  87. ^ "Using Archive.org". archive.org.
  88. ^ a b "Archive-it: Columbia University Libraries". archive-it.org.
  89. ^ "Human Rights Web Archive at Columbia University". columbia.edu.
  90. ^ "California Digital Library Alternative Mass Media". cdlib.org.
  91. ^ "Archive-it Partners". archive-it.org
  92. ^ "Texas Archival Repositories Online". utexas.edu.
  93. ^ "Alabama State Government and Politics Web Site and Social Media Archives"
  94. ^ "Tamiment Library Web Archiving Project"[dead link]
  95. ^ "Institution: New York University Libraries / Tamiment Library (Labor & the Left)". cdlib.org. Retrieved 2013-08-19. 
  96. ^ "Search Finding Aids Hosted at New York University". nyu.edu. Retrieved 2013-08-19.