User:Mdupont/Open content: Difference between revisions
No edit summary |
|||
Line 113: | Line 113: | ||
=== eclipse foundation === |
=== eclipse foundation === |
||
http://projects.eclipse.org/search/projects?page=1 |
http://projects.eclipse.org/search/projects?page=1 |
||
=== php === |
|||
https://packagist.org/ |
|||
=== fossil === |
=== fossil === |
Revision as of 15:16, 30 January 2016
Implementation
Source code :
https://github.com/h4ck3rm1k3/open-everything-library/tree/extractor
and download helpers:
https://github.com/h4ck3rm1k3/open-everything-library/tree/helpers
Hosting
Algorithm
Starting with Category:Open content and following all subcats and pages. Extract all external links. Fetch all external pages.
Look at external websites, determine which are open content.
Look at software projects, extract information, cross reference to metadata from the following sources (see list below)
Storage of the data in json format in a mongo db, currently have 200gb plus data.
Merge various data sources base on : external urls, names, source control repos.
Goal is to push the merged data into buckets of data on archive.org so you can download them in zipped data files/parts if need.
github
Git hub metadata is not free per se. it is limited by tos. There is an api you can use to get data. https://developer.github.com/v3/repos/#list-all-public-repositories
There is a dump of projects from the archiveteam that it outdated. https://archive.org/details/archiveteam-github-repository-index-201212 https://developer.github.com/v3/repos/#list-all-public-repositories
Pulling via authenticated api :https://api.github.com/repositories?since=%d&access_token=%s
Status : importing json, still downloading, at item id 46314762.
code for download https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/github/get.py takes the last id downloaded as parameter. Requires authentication token. Code for import https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_gh2.py
As of Jan 25 have all projects, last ID was 50342814.
The browse does not contain the homepage, and the api browse is very verbose, another call to get the project details is needed to get the homepage.
Here is an example of what we can collect from the api and put into a wiki for hosting/editing
http://freedom-and-openness.wikia.com/wiki/GitHub_Projects/browning/chronic
sf.net
First need a list of projects. Project export : http://sourceforge.net/blog/project-data-export/
Nonfree data from 2014
http://srda.cse.nd.edu/mediawiki/index.php/Main_Page
First we get the list of projects with this schema
starting with https://sourceforge.net/directory/os:linux/?page=1
http://sourceforge.net/directory/os%3Alinux/?page=${page}"
We have gotten 1982 pages. It says that there are 16771 pages, but after the 1982, the web server stops. TODO : access via more categories.
The doap files are then extracted with this script
https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/doap.sh
Then the import of the doap is done by import_sf_doap.py https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_sf_doap.py
cats
The pulling of the categories was done by a simple recursive scan: https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/cats/doit.sh
But this got very slow, so I wrote a scraper and started to import them into mongo : https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/process_sf_cats.py After this is done importing all the pages downloaded so far, including the pages, the subcats and the facets and the number of pages. After that import runs, will scan the cat pages for pages that have not been imported and run those dynamically.
debian
software packages
https://wiki.debian.org/qa.debian.org/pts/RdfInterface packages.qa.debian.org:/srv/packages.qa.debian.org/www/web/full-dump.tar.bz2
rdf.debian.net is the newest. Find the https://wiki.debian.org/UltimateDebianDatabase with a great deal of info.
udd
https://udd.debian.org/ contains an sql database
wnpp
The intent to package bug reports contain information about packages not yet in debian. https://www.debian.org/devel/wnpp/
openhub
api
https://github.com/blackducksoftware/ohloh_api/blob/master/reference/project.md
get key from https://www.openhub.net/accounts/<username>/api_keys
API_KEY=XXXXX for each page: curl --output projects.xml --verbose "https://www.openhub.net/projects.xml?api_key=$API_KEY&page=${page}"
bitbucket
https://bitbucket.org/api/2.0/repositories/ extract the next with : URL=`jq .next $OUT -r`
Status: Downloaded 71763 pages of 10 projects via json, importing all pages.
gitlab
eclipse foundation
http://projects.eclipse.org/search/projects?page=1
php
fossil
http://fossil.include-once.org/
freshcode
perl cpan
the 02packages.details.txt.gz file, is cached locally when you run cpan. ~/.cpan/sources/modules/02packages.details.txt.gz Source : http://www.cpan.org/modules/02packages.details.txt.gz
npm
The npm util caches package information
run :
npm search
http://www.sitepoint.com/beginners-guide-node-package-manager/
this will populate :
~/.npm/registry.npmjs.org/-/all/.cache.json
ruby gems
python packages
API: https://www.python.org/dev/peps/pep-0503/
Get index of all packages : wget -m -r -l1 https://pypi.python.org/simple/ That will get you a full list of packages and versions. Get main page for each package : https://pypi.python.org/pypi/${PKG}"
wget -m --no-parent -r -l1 https://pypi.python.org/pypi/ each python package contains doap information that can be fetched via the main index page that looks like this : 'https://pypi.python.org/pypi?:action=doap&name=${PACKAGENAME}'
prismbreak
projects are here https://github.com/nylira/prism-break/tree/master/source/db https://github.com/nylira/prism-break https://prism-break.org/en/
fsf software directory
http://directory.fsf.org/wiki/Main_Page Download is here: http://static.fsf.org/nosvn/directory/directory.xml See : http://lists.gnu.org/archive/html/directory-discuss/2013-09/msg00001.html
rapper -o turtle file:directory.xml > directory.ttl
Wikidata
Wikidata
https://www.wikidata.org/wiki/Wikidata:Database_download https://dumps.wikimedia.org/wikidatawiki/entities/20160111/
Apache
http://svn.apache.org/viewvc/ svn co https://svn.apache.org/repos/asf/comdev/projects.apache.org List of doap files http://svn.apache.org/viewvc/comdev/projects.apache.org/data/projects.xml?view=markup
golang
https://golang.org/pkg/ http://go-search.org/search?q=&p=1
rlang
Java
Maven
http://repo1.maven.org/maven2/ http://repo.maven.apache.org/maven2/ http://mvnrepository.com/open-source?p=2
Emacs
https://github.com/emacsmirror/emacswiki.org
git clone git://github.com/emacsmirror/emacswiki.org.git emacswiki git checkout master
wikiapiary
Libre Planet
https://libreplanet.org/wiki/Main_Page
Dump of wiki can be found https://archive.org/details/LibreplanetDotOrgWikiDump20160123 here.
Without System D
http://without-systemd.org/wiki/index.php/Init
Fdroid
data :
git clone https://gitlab.com/fdroid/fdroiddata.git
see https://gitlab.com/fdroid/fdroiddata
server :
git clone https://gitlab.com/fdroid/fdroidserver.git
see https://gitlab.com/fdroid/fdroidserver
Other projects
to review
cii-census
https://github.com/linuxfoundation/cii-census
open-frameworks-analyses
https://github.com/wikiteams/open-frameworks-analyses
This project has ready downloaded open hub projects.