Jump to content

Wikipedia:WikiProject Molecular Biology/Molecular and Cell Biology/Proposals/Archive 3

From Wikipedia, the free encyclopedia
Archive 1Archive 2Archive 3Archive 4

Wikipedia, Pfam/Interpro and SMART

Most of you know about Pfam/Interpro that provides brief but very systematic annotations (short summaries) for different protein families, and also about SMART that does the same for different protein domains. These summaries are at the level of "stubs" or better. I understand that Pfam/Interpro and SMART operate under the same "open access" policy as Wikipedia, which means that everyone can copy and modify the content. It would be possibile to identify a set of most important protein families and domains that are missing in Wikipedia but present in Interpro and SMART, and copy their summaries as the initial Wikipedia "stubs" with a reference and link to the corresponding Interpro or SMART entries. We could also ask people from Interpro and SMART what they think about such idea, and they might be even willing to help. Biophys 17:04, 16 November 2006 (UTC)

See list of SMART domains: [1]. Few of them can be found in Wikipedia. I think the summaries can be downloaded to Wikipedia automatically, but it is important to have a consent from SMART authors. Of course, the idea is to improve these short summaries in the future. Biophys 17:43, 16 November 2006 (UTC)
SMART uses annotation from InterPro which contains copyrighted information such as PROSITE annotation. I have e-mailed Pfam and asked about the copyright status of their database. TimVickers 19:42, 16 November 2006 (UTC)
Their reply was as follows:

Hi Tim,

Pfam is distributed under the terms of the GNU GPL license. According to that license any derivatives should also be distributed under GNU GPL. However, we tend to take a pragmatic view for small parts of the data to make Pfam maximally useful. Do you have an example of the kind of info you would take from Pfam?

Pfam is really a database of protein family annotations rather than for individual proteins. We would certainly be interested in providing links etc and whatever information we can.

Yours sincerely Alex Bateman

Good. If I understand correctly, Wikipedia operates under GNU license. What I mean is this. For example, Wikipedia has no article about C2 domains. I would go to SMART C2 domain annotation : [2], copy the annotation, maybe modify this annotation (but maybe not), make internal Wikipedia references within the annotation, and provide this link to SMART [3]. That would be a stab about C2 domains. Someone could improve in the future. Whould that be fine? I can do this for a couple of domains as an experiment, and then ask Alex Bateman if he likes it. Of course, it would be much better if people from SMART/PFAM team generate such Wikipedia stubs automatically (but one have to make sure that the corresponding article is not already in Wikipedia). Then, someone could look through these stubs and wikify them. Biophys 22:51, 11 December 2006 (UTC)
You can't do that with SMART, because as I said earlier, this contains copyrighted information from Prosite. However, you can do this with Pfam. TimVickers 23:21, 11 December 2006 (UTC)
Then I will use Pfam if needed. Actually, the annotation in SMART consists of two parts. One part is abstract from INTERPRO, and it is exactly the same as in Pfam. Another part is a kind of header ("Description"), which is not taken from PROSITE but can be found only in SMART. Biophys 00:56, 12 December 2006 (UTC)
I have created several new articles using this method. Pfam helps a lot, but some editing is usually required. Unfortunately, some Pfam entries are poorly annotated. Biophys 04:05, 12 December 2006 (UTC)
Pfam got back in touch this morning. TimVickers 16:58, 2 February 2007 (UTC)
Hi Tim,
I have speoken to several members of the Pfam consortium and there is unanimous support for you doing this. Please let me know if we can help with this.
Are you also interested in RNA families? I am also in charge of the Rfam database. One of our goals for the coming year is to make the annotation for Rfam into a community resource using a wiki. However if this were part of Wikipedia then so much the better. Do you think that is feasible?
Yours sincerely
Alex Bateman

Scientific citations

Would your WikiProject like to endorse Wikipedia:Scientific citation guidelines? If so, please let those editors at that guideline know. --ScienceApologist 19:07, 1 December 2006 (UTC)

I agree. Consistent referencing is important for all articles (although I can only be bothered to do Harvard and will leave it to some wikignome to convert Harvard to footnote references (which I do prefer over Harvard; just not enough to go through the bother)). I predict support from the rest of MCB and am now on my way to let those editors know that we endorse the guidelines (if enough people disagree (I think unlikely) we can always revoke our endorsement. --Username132 (talk) 22:20, 15 December 2006 (UTC)
Doesn't seem to be much activity on this issue; if there's an official stage of endorsement or whathaveyou, we can probably move on to that. Opabinia regalis 04:13, 20 December 2006 (UTC)
I'm not sure what would constitute an "official stage". These guidelines are already operational. The page currently starts off with:
This page is a guideline for Mathematics, Physics, and Chemistry.
It expresses the consensus of editors in those projects about specific details of inline citation. Editors in other scientific projects should follow the practice followed by those projects.
WikiProject Chemistry was just added today, following a "vote" of endorsement at Wikipedia talk:WikiProject Chemistry#Wikipedia:Scientific citation guidelines. The question here is: can Molecular and Cellular Biology be added to the projects explicitly listed on the guideline page? At the moment there is no indication of consensus, but only the absence of manifest opposition.  --LambiamTalk 08:37, 20 December 2006 (UTC)

Vote on proposal CLOSED I set up a vote on this. TimVickers 18:39, 20 December 2006 (UTC) Vote page

Proposal from Novartis/GNF

I got an interesting e-mail this morning.

Hi Tim,
Since it looks like you have some official (or at least very active) role in the MCB project at Wikipedia, I thought I'd try emailing you first. I'm wondering if there is a potential synergy between our two projects...
I lead the "Symatlas" project at GNF (http://symatlas.gnf.org/SymAtlas/). The goal of this application is two-fold. First, we want this to serve as a "gene portal" (with a mammalian bias) which collates all the relevant information in the public domain for all genes. Second, we use this application to release our data into the public domain. Right now, we primarily have gene expression data, centered around our "GeneAtlas" data set which measures expression across an anatomically diverse set of tissues. In the future, we will also post our data for large-scale siRNA screening.
Right now, we're in the process of rebuilding SymAtlas to improve the user interface, responsiveness, features -- pretty much everything. One of the things on our list of new features is a wiki. We were originally thinking of maintaining (and possibly coding) our own wiki for tighter integration with SymAtlas, but actually the MCB effort may be a good partner. You guys have seeded quite a bit of content and probably have a pretty broad audience. We have a bunch of custom data that we could contribute, and we also have a decent sized audience (3000 visitors and 50,000 pageviews per week).
Anyway, let me know if you think this might be mutually beneficial.
Cheers,
-andrew
Andrew Su, Ph.D.
Genomics Institute of the Novartis Research Foundation

I replied.

Hi Andrew
Thank you, this sounds like a good opportunity. We are always happy to co-ordinate with people who want to add content to Wikipedia. Obviously, anything that is added must be licensed under the GFDL and be verifiable (published elsewhere). The advantage to using Wikipedia for distributing data is that it has very high visibility and can be integrated with an unlimited number of other resources. The disadvantage is that, due to open editing, the data can be altered and is thus less reliable than information maintained on a third-party website.
With these advantages and disadvantages in mind, what information and what form of presentation were you considering? One possibility that come to mind is the Protein Infobox
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Molecular_and_Cellular_Biology/Style_guidelines#Infoboxes
It might be possible to import data from your project to this standard format and produce a basic summary for each gene/protein in your database.
I will post this on the MCB website and canvas for other ideas. Please feel free to join and participate directly!
Thank you again for your interest.
Dr Tim Vickers

So does anybody else have other suggestions as to how we could coordinate? This could be a very valuable collaboration. TimVickers 16:56, 5 January 2007 (UTC)

Hi Tim!
It's a very tempting prospect, and I'd support it all; way to go, big N! :) I'm worried only that the uploading of new (albeit verifiable) data to Wikipedia would violate WP:NOR. Could we get a special dispensation for uploading data that anyone could verify, e.g., the results of publicly available web-servers for a given protein sequence and similar stuff? Alternatively, they could "publish" at their site and then copy the data over to us (that's probably what you were thinking?) but the unnecessary duplication pains me somewhat; Occam's razor and all that.
Perhaps they'd be willing to open up their wiki to members of the MCB WikiProject as a GFDL resource, so that we and others could add content with a good conscience? I'd devote hours to describing my favourite proteins, as would others here, I believe; at least I recall several people on our membership list who wanted to make a page for every known protein... Willow 18:05, 5 January 2007 (UTC)


Thanks Tim for putting this in the appropriate place for discussion. I thought I'd supplement my rather general email above with a few specific proposals/ideas.
First, two ideas on how we might be able to contribute content to Wikipedia that could be accomplished pretty easily (if they were deemed to be desirable).
1) For all the gene entities that are currently in Wikipedia, we could add the gene expression profile from our SymAtlas database to the Wikipedia gene page. Yes, these data have been published previously. I would need to check with our legal department to confirm that we can release it under GFDL, but I'm confident that could happen. In terms of whether the data is appropriate/interesting, we feel that information on where a gene expressed in the human/mouse anatomy is a basic piece of gene annotation.
2) As mentioned above, one of the goals of SymAtlas is to be a gene portal, and as such we have a database of nonredundant mammalian genes, plus all the links to native data sources (Entrez Gene, Ensembl, Affymetrix chips, GO, etc.) In total, we have ~190K species-specific entries between mouse, human, and rat, which means just over 60k species-independent genes. (I know, counts are high, but so far we've been conservative with collapsing gene entities.) Anyway, we could do some sanity filtering, and then create stubs for a large number (thousands or tens of thousands) of genes with expression patterns and appropriate database links.
And now, two ideas on possible ways SymAtlas could utilize Wikipedia:
3) We'd like to embed a wiki section in a SymAtlas "gene report" (together with other gene annotation information). If the wiki were Wikipedia, then the link to edit a page would clearly just redirect to Wikipedia itself. But for display in the gene report, we'd need to figure out a way to capture the content without a lot of the surrounding elements. For example, we'd want to take off the left navigation bar and much of the top header. Is this a faux pas?
4) We'd like to create simple ways for people to add content in a structured way. For example, I'd like it if we had a simple text-box where people could enter a Pubmed ID or a URL, and hitting "submit" would trigger a bot to add that link to an appropriate area in the wiki. We get a lot of users at SymAtlas who don't have a lot of computer sophistication, so if we want them to contribute their biological info, we need a pretty darn low barrier of activation.
And finally, two ideas/thoughts on SymAtlas-specific needs that come to mind during this initial brainstorming process:
5) Although Wikipedia wants to catalog previously published findings, I think there's probably a use to have a wiki that allows contribution of less-substantiated results. (Perhaps this has already been discussed in this forum?) I think it would be cool if someone could, for example, take a list of 100 genes that were differentially expressed in their gene expression experiment, search for them in SymAtlas, create a tag that describes the preliminary finding, and post it to each of those 100 wiki pages (in a specific section, of course). This might be a decent way to foster new collaborations. However, given the less rigorous nature of these data, maybe that's an argument to set up a parallel wiki effort, and SymAtlas would combine content from them both.
6) And finally, (the answer the question of what does GNF/Novartis have to gain from all this) I'd really like an internal-only section for proprietary content. This would house data like "four small molecules targeting this gene all failed due to liver toxicity" with links to the appropriate internal reports. Perhaps this is an argument for a third parallel wiki (or building a custom wiki solution with a security model built-in).
Okay, I think that's it for now. Sorry for the long-winded reply, but we've been thinking about the possibilities for a while and are quite excited! AndrewGNF 18:27, 5 January 2007 (UTC)
A few quick notes. First, does the fact that these data have all been published satisfy the WP:NOR? Second, we don't yet have a wiki, but when we do have it set up, it will be open to the entire community. Whether we use wikipedia, the MediaWiki software, or build our own is still up for discussion. But we hope to have something in place in six months. Third, in case you want to see what a SymAtlas gene report currently looks like, this is the one for ITK (though as mentioned above, the user interface is currently undergoing a thorough refactoring, also targeted to be done in ~six months...) AndrewGNF 19:02, 5 January 2007 (UTC)
We have been discussing using Pfam summaries as the basis of short articles. If we could merge your expression data and links to other databases with the relevant Pfam description, this would be a good solid base for a gene page. TimVickers 19:46, 5 January 2007 (UTC)
That could be a great collaboration! But I just received a message from User:Where who said that Wikipedia is under the GFDL, and not the GPL, so he is not sure if we can use Pfam summaries. So, can we actually use them? Biophys 20:08, 5 January 2007 (UTC)
We do have links to protein domains, but actually through Interpro. Interpro is to protein domains what SymAtlas is to genes. Given many different data providers with different IDs for a single concept (genes/families), SymAtlas and Interpro seek to create a nonredundant index. Check out, for example, the SymAtlas page for CDK2. Half way down the annotation column on the right, you'll see a section for "Protein Family" and a link to "Protein Kinase". That InterPro ID links to Pfam, as well as the corresponding model for "Protein kinase" in Prodom and Prosite. But bottom line, our expression data is on a per-gene basis, so it wouldn't make sense to link on that level (since protein families contain multiple genes). These expression data (and all the other links to public databases we've assembled) I think would be most appropriately presented on a "gene" page (e.g., P53) AndrewGNF 20:20, 5 January 2007 (UTC)
This may be slightly off topic, but have you guys considered adding OMIM content? e.g,. P53. In terms of having highly curated and referenced annotation, I can't think of a better source. Their terms of use also appear to be very permissive. AndrewGNF 21:28, 5 January 2007 (UTC)
Sadly we can't use that. All content in Wikipedia must be licensed by the GFDL licence, which allows unrestricted copy, modification and use. TimVickers 21:45, 5 January 2007 (UTC)

This collaboration looks great but i do wonder how this will work with respect to copyright issues. I am unclear on what you mean, Andrew, by an internal-only section. If this is what you need then you will certainly have to have your own wiki. i wonder if your best bet is to use the wikipedia as a testing ground for what you want to do on your own wiki. This allows all scientist (actually anyone) to have some input with regard to the format and content. In other words develop it here and then tranfer the results to your own database. In that way you will be able to tap into opinions and expertise here without running foul of the open access. It seems you gain more than us but if you do a good job we will have access to a lot of great data and that is all that matters. David D. (Talk) 22:28, 5 January 2007 (UTC)

Obviously there are some issues to work out here, but this sounds like a great idea overall. To go through the list above..
  1. Is an excellent idea. We'll need to have a centralized list of all our gene articles (I'm sure there's some stubs in the wrong category/not in any category), but that can probably be arranged.
  2. Sounds like the MCB answer to Rambot's articles :) Andrew, have you given any specific thought to how this might be automated? (Maybe Rambot itself could be modified and repurposed.)
  3. You can reuse, redistribute, and modify Wikipedia content as much as you like as long as you stay within the terms of the GFDL. It sounds like you might want a Wikipedia mirror; note that you probably can't "remote-load" from Wikipedia's servers on-the-fly. There are instructions and information for obtaining regular static database dumps here and here if you want to have a look and see if this might suit your purposes. You can still link to Wikipedia for editing, but your local copy won't update until you load the next data dump containing any edits made by your users.
  4. I think this idea needs some fleshing out. What is the user intending to do with this PMID or URL? It's unlikely that people here will approve of users elsewhere being able to trigger a bot that systematically adds links to a large number of articles - that would be interpreted as (and has the potential for abuse as) spam, even if the links are well-meant. Adding links to appropriately titled external links sections of individual articles, however, is very easy, and possibly someone could write very simplified editing instructions just for this if you anticipate that it will be a common activity. (Bear in mind that, even if not bot-assisted, large numbers of users arriving from the same site to add related external links will set off some people's spam alerts anyway, so this may require some coordination.)
  5. I agree that this would be useful, but it doesn't sound like the sort of content that Wikipedia hosts, and doesn't sound like the sort of thing you'd want on a publicly editable wiki. (Vandalism to a list of genes could make a very big mess.)
  6. Internal data absolutely should be on your own internal wiki, not here. That's definitely not the sort of thing that Wikipedia hosts, and the nonprofit Wikimedia Foundation probably couldn't legally host that kind of information on its servers. Not to mention the fact that, from your perspective, it would be incredibly insecure to host internal data and documentation on external servers, maintained at least partially by volunteers, that also contain data routinely modified by the general public. Opabinia regalis 03:32, 6 January 2007 (UTC)

A few clarifications based on questions/issues above. First, by "internal-only section", I was meaning a place for GNF/Novartis scientists to contribute to a wiki that would not be visible to the public domain. Clearly this would be independent of Wikipedia, maybe even a separate MediaWiki instance that we would host locally (and within our firewall). I'd be interested to hear if anyone has done any work creating a single user interface that combined content from two (or more) wikis.

I hadn't given any specific thought on how to automate the creation of stubs, but I assume we could create some sort of bot to do it. If anyone has any specific suggestions or knowledge (beyond the Rambot links above which we will check out), please point us in the right direction...

Based on the feedback above, the current tentative plan (for item #3 above) would be for SymAtlas to link up three separate wikis. One would be Wikipedia, which hosts well-substantiated findings. The second would be a wiki for more speculative content, and this would be where we may try to put the simple URL / PMID / keyword tagging feature described in #4 and #5 above. We would host this wiki but it would be publicly accessible. The third wiki would be for "internal GNF/Novartis content" (described in #6 above), and we would host this internally within our firewall. SymAtlas would then aggregate all content from these three wiki sources (taking into account the remote-loading policies) and display it integrated within our gene reports. Any comments on this plan are welcome...

Finally, it sounds like there is reasonable consensus and support for using our content to seed Wikipedia stubs. This would contain links to NCBI, Ensembl, Interpro, etc. and also a chart showing the gene expression pattern across anatomic regions (#1 and #2 above). I'll propose that we start with creating just 5-10 gene stubs to get feedback. Anyone have any comment on plan? Also, I don't want to commit to any timeline yet until our SymAtlas development plan is clearer. AndrewGNF 02:30, 11 January 2007 (UTC)

This sounds like a good plan. Thanks Andrew. TimVickers 03:35, 11 January 2007 (UTC)

I suggest consultation, becase such a link could be misinterpreted. I think a good way to proceed if you want to use an internal wiki is to go boldly ahead and do so--MediaWiki is available for anyone, and many organizations have done so. One way to link WP to it would be a federated search engine , run by you, that merely searches WP (as mentioned). Alternatively, you could maintain a WP mirror, many organizations do, and use it however you please. You could certainly make it available to the public, and it would be very good to do so, but I suspect usingthe WP name for it would not be liked. You could also use it in any combination you like privately--there is no restriction on commercial use of WP. You could use any template or infobox in article space. WP will gladly accept PD content from anywhere, especially from such a reliable source as you, and I think would make arrangements to load the material as it has many other PD sources. We can use outside software that is explicitedly GFDL or PD.

Material that has been published in a peer-reviewed article is always acceptable. Information that is not, but has been taken from an authoritative web source that is known to screen material and maintains integrity can also be used--we use PubChem without concern for where exactly they obtained the data.

OK, you;ve got three similar views. I agree with the suggestion thatthe best step would be for your people to conribute to WP according to the usual WP standards. perhaps this is already being done? DGG 04:25, 11 January 2007 (UTC)

PD = Public Domain? When you say "explicitly GFDL or PD", is there a similarly specific definition of PD? AndrewGNF 21:37, 11 January 2007 (UTC)

Migration To White Background Images

When I ecounter an image on a black background, I find I have to turn up the brightness/contrast of my monitor which is unnecessary when the background is white. Do you find it acceptable that the MCB project should support the swapping of protein representations with black backgrounds to ones with white backgrounds? I'm not saying that we should take on the task of changing all pictures with a black backgrounds, but I think it is acceptable to do so (cf. with it being unacceptable in most cases, to swap a white-background image for a black-background version).

I've been told that most visualization programs default to black backgrounds because it's easier to make colors appear to blend with black than white when not using anti-aliasing. Observation bears this out, though I'm not sure why. Raytraced shadowing and depth cuing also look more realistic to me on black, although I agree that white looks nicer in articles (though I've never had any contrast issues) and that white should be preferred in most cases. I just don't always remember to switch ;)
More generally: the recommendations section of the pymol tutorial is currently way down the bottom; should it be further up and/or on a separate page? Opabinia regalis 01:08, 31 January 2007 (UTC)
I think it's best at the bottom, since people aren't ready for recommendations when they're still learning to use the program. I have my monitor brightness turned all the way down and contrast pretty low most of the time. I find it more comfortable (I think the monitor more closely resembles paper this way. --Seans Potato Business 21:43, 31 January 2007 (UTC)

Advice and guideline subpages

Why do we have both [the help page] and Wikipedia:WikiProject Molecular and Cellular Biology/Advice? The title sounds redundant and the current content doesn't really read like a place to get advice. The contents of the advice and external links subpages seem like they ought to be merged to something called "resources"; am I missing the point of these? Opabinia regalis 03:17, 1 February 2007 (UTC)

I think you're right. We need to consider the best way to deal with our advice pages:

Wikipedia:WikiProject Molecular and Cellular Biology/Advice
Wikipedia:WikiProject Molecular and Cellular Biology/Style guidelines
Wikipedia:WikiProject Molecular and Cellular Biology/External links
Wikipedia:WikiProject Molecular and Cellular Biology/References
Wikipedia:WikiProject Molecular and Cellular Biology/Pymol tutorial
Wikipedia:WikiProject Molecular and Cellular Biology/Diagram guide
They should be integrated in whatever manner deemed suitable rather than allowed to develop independantly of each other. --Seans Potato Business 17:18, 1 February 2007 (UTC)

Proposal: Delete Page: Articles Needing Attention

I propose that the needs of the articles needing attention page is met by the article worklist and should be removed. Unless I'm missing something of course... --Seans Potato Business 01:41, 14 February 2007 (UTC)

I think the idea there was to trigger immediate work on particularly abominable articles, though plainly that hasn't panned out yet. Possibly because it's hard to keep track of category contents? Opabinia regalis 02:04, 14 February 2007 (UTC)
But since the work list combines a rating of how important an article is with how complete it is, do we really need to keep this page? If we do, shouldn't we have it update automatically using the info from the worklist (i.e. high-importance yet low-state-of-completeness articles)? --Seans Potato Business 06:07, 14 February 2007 (UTC)
I would suggest to keep the page. There may be articles that need attention for other reasons, e.g. NPOV on controversial topics, incorrect statements, articles tagged by other editors/users as too technical or needing expert attention etc. - tameeria 16:00, 19 February 2007 (UTC)

MCB Template Text - Comments Too Small

When someone makes comments that are presented on an MCB-supported article talkpage template, they are far too small. I have to struggle to read it. I'm talking about, for example, where I say: Needs more yeast-based coverage. Some sections over-represent bacterial-based methods, relative to yeast. on [Talk:Two-hybrid_screening] - could someone increase the size to normal? I had a go but couldn't produce an effect. Thanks. --Seans Potato Business 05:49, 14 February 2007 (UTC)

Does it look better now? I increased the font size to 90%. If it doesn't look any different, try refreshing your cache?
On a related note, do we want to keep the recent change that adds the collaboration of the month to the template? Announcing the collaboration article on 4000 pages that may have only a tangential relation to the collaboration topic seems a little spammy to me, but maybe it'd draw in more contributors. Opabinia regalis 17:20, 19 February 2007 (UTC)


Userboxes

I propose that a userbox for WikiProject Molecular and Cellular Biology should be created. This may spread publicity about this project , bringing more people to work for related articles. It may also look nice on a userpage. Sodaplayer talk contributions 01:05, 23 February 2007 (UTC)

{{user Mol Cell Bio}} is probably what you're looking for? Opabinia regalis 01:10, 23 February 2007 (UTC)

new orthologs template

It's taken a while to get around to it, but I've put together a first-draft proposal of a protein info box that we (GNF) could populate in an automated fashion for ~10K genes. (Recall the previous discussion here.) I put this draft on my user page out of ignorance of a better place to put it (moved a working example to ITK (gene) AndrewGNF 19:27, 13 March 2007 (UTC)). Also, I created/modified two templates (Template:GNF_Ortholog_box and Template:GNF_Protein_box) to create this example; the "GNF" prefix was to make sure I wouldn't muck up anything existing.

In the example, I tried to integrate the ortholog box into the main protein box here, but it obviously didn't work. Anyone have thoughts on how to accomplish this? Possibly a related question, how do I find out how the "drugInfoBox" works?

Finally, any other comments/questions/suggestions would be welcome. If people generally like this, then the next step on our end would be to write a bot to create 5-10 of these stubs for further comment. (And in case it's not clear, we're definitely Wikipedia newbies, so any and all suggestions are welcome...)

AndrewGNF 18:06, 7 March 2007 (UTC)

Excellent. I'm afraid I'm not a technical person, but I will try to help in any administrator or proofreading way I can. TimVickers 19:25, 7 March 2007 (UTC)

Having a look at Wikipedia:Bot policy might help, we could also request for somebody to write it at Wikipedia:Bot requests. TimVickers 01:42, 9 March 2007 (UTC)

Thanks, we've definitely had a look at the Bot policy. Looks like we have the expertise here to handle it. Bandwidth is a little more uncertain, but I'm trying to get a couple interns to consider this project. If not, then perhaps I will post it over there as a bot request. (Or, if there is anyone here who's interested in collaborating on writing the bot, let me know...)
BTW, I moved the example template to ITK (gene)... AndrewGNF 22:29, 9 March 2007 (UTC)
Andrew, did you get the templates to tile as you intended? Now that she's back, you might want to ask Willow if it's still not working; she's done some nice and fancy template work. Opabinia regalis 00:04, 14 March 2007 (UTC)
Nope, still haven't gotten it resolved. Thanks for the tip -- I will see if Willow can work her magic... AndrewGNF 00:53, 15 March 2007 (UTC)

FYI, I figured out the whole nested table issue and updated my example gene (ITK). Since my last post here, I've also added a bunch of other information that we have in our database. It's not pretty, but I think it has a lot of useful information. I think we're close to finding a student to take on the project of writing the bot (in collaboration with the bioinformatics program at SDSU). As always, I'd love to get any feedback... AndrewGNF 01:12, 30 March 2007 (UTC)

Oh, and in full disclosure, I also just approached the CZ folks with the same idea. Personally I'm pretty agnostic with respect to where we do this, and it's certainly not mutually exclusive either. Anyway... the CZ forum post AndrewGNF 01:17, 30 March 2007 (UTC)

Matching page titles with HUGO names

A colleage and I were looking at the entry for his favorite protein, initially called Zif268 and later renamed as Egr1. Egr1 is now the official name at HUGO (HGNC:3238), and HUGO is the "official" source for gene names. Right now, Egr1 redirects to Zif268, but I wonder if we should reverse this so that Zif268 redirects to Egr1. We'd also need to update ZENK and Early Growth Response Protein 1, other alternate symbols/names which redirect to Zif268, to avoid double redirects. Two questions -- is this the correct thing to do, and is it an important thing to do? Presumably there are many cases where the main page is not found under the HUGO title, and perhaps this is another candidate for a bot to fix (and certainly our stub creation bot should be aware of the best practice in this regard). AndrewGNF 02:26, 10 March 2007 (UTC)

I think the rule is that the gene/protein should be found under it's most common name. Obviously, for most genes this is a moot point, as they have no real common name. Otherwise, of a gene has a protein product that has a famous name (such as trypsin or PRSS1) then the info should really be integrated into any existing content at the trypsin page. In practice, 99% of human genes will have no Wikipedia entry, so I think adding the genes under the HUGO names and ignoring trying to automatically fix redirects would be safest. As you say, changing a page to a redirect is best done manually, as there are dependencies in other pages that may need altered. TimVickers 05:05, 10 March 2007 (UTC)

ProteinBoxBot specs

Now that our test gene, ITK, is nearing completion (thanks in no small part to recent efforts by David D.), I'm preparing to request approval for development and a trial run of the ProteinBoxBot. I've moved the proposed specs over to the ProteinBoxBot's user page. Comments and feedback are welcome. I'm tentatively going to put in the request for approval at the end of the week... Cheers, AndrewGNF 22:10, 3 April 2007 (UTC)

Hi Andrew,
ITK looks beautiful! Would you be so kind as to post the source code that you're using the ProteinBoxBot? It might be helpful to me and probably others.
I'd also recommend making a special category for the ProteinBoxBot stubs, like Category:ProteinBoxBot stubs as a sub-category of Category:Protein stubs. There are stubs and there are stubs, and it'd be helpful to have a separate list of the stubs produced by the ProteinBoxBot. I'll do the same for Daisy's taxonomic stubs, once I finish proofreading the taxonomic files.
Well done and good luck; the fun is about to start! :) Willow 22:24, 3 April 2007 (UTC)
Hi Willow... I'd be happy to share the source code with whomever is interested. (Although Rambot posts in his FAQ that he won't give out source code out to discourage "script kiddies", so perhaps our code shouldn't be posted freely.) Also, to be clear, we've yet to write a single line of code, and we'll probably extensively use one of the available libraries. A ProteinBoxBot stub should be simple enough to add. Thanks for the encouragement! AndrewGNF 01:21, 4 April 2007 (UTC)

Proposal from Rfam database

Hi everybody, I got a interesting e-mail from the Rfam database curator today.

Dear Tim

I work on the Rfam database run by Alex Bateman at the Sanger Institute in the UK. You contacted Alex recently regarding a proposal for Pfam annotations. One of our goals for Rfam this year is to make our family annotations a community resource. We would prefer if this annotation was implemented using Wikipedia and after some browsing we think the Molecular and Cellular Biology Project format would really suit our requirements. In addition we think MCB project would benefit from a daughter ncRNA project.

Would the MCB be interested in this contribution from Rfam?

I am not sure what is required for us to implement this within Wikipedia but our intention for Rfam is to download the relevant Wikipedia entry for each family and display the Wikipedia information ( clearly identified as Wikipedia ) and to provide links back to Wikipedia to encourage the experts in our user community to contribute to these annotations. Currently there are only really generic Wikipedia entries for most of our families such as 'ribozyme' but we would like to extended to create entries/stubs specifically for all our families.

We could see the role of Rfam as coordinating the effort and championing the use of Wikipedia. We are planning to attend the annual RNA meeting at the end of May to drum up support for this effort.

Could you let us know if MCB would be interested in this contribution from us? And if so could we discuss how we need to go about implanting this Wikipedia?

I hope to hear from you soon

Regards

Jennifer Daub

I responded by saying that we were interested and putting her in touch with Andrew Su from GNF, who has already done some of the groundwork for this. TimVickers 16:31, 4 May 2007 (UTC)
It's so flattering that they want to collaborate with us! :)
I tried my hand at crafting an initial draft of Bicoid 3'-UTR regulatory element. Is this what Jennifer had in mind? Unfortunately, I have to travel soon to see my sister graduate from college, so I won't be able to reply for a while! Willow 17:26, 4 May 2007 (UTC)
PS. Everyone should know that X-ray crystallography is the Science Collaboration of the Month for May. I added a bit yesterday, but it could still use lots of work! Thanks, everyone! :)
From the perspective of trying to get a similar daughter "gene wiki" project off the ground, we'd be happy to provide feedback or collaborate on a sibling ncRNA project.  ;) Jennifer, this talk page is probably the best forum to talk over the ideas -- great way to get feedback from the MCB folks, and if you haven't edited in wikitext before, it's a gentle introduction (that's how I got started).
From the gene wiki side, we're pretty close to getting started on the 10-gene test (to expand on the initial ITK (gene) example). The bot is approved for trial, we're almost done extracting the data from our database, and we have a student who is going to take on the coding of the bot. More to come soon! Jennifer, if you have your data in a well-structured format, should be no problem to adjust our bot to populate rfam stubs too. Or, if you want to have a go at it yourself, the Wikipedia:Bot_policy is a good place to start. Cheers, AndrewGNF 20:51, 4 May 2007 (UTC)
Hi Andrew, your bot got approved, that is good news. Look forward to seeing what your intern can do. Willow, did daisy get approved? David D. (Talk) 21:13, 4 May 2007 (UTC)
I'm still proofreading the reference files (I'm on 4/19), so it may be a while yet before Daisy would be ready to apply. She should make the files only once, so I'd like to confirm that the PMID links are all correct, and that every comma, etc. is in the right place.
Could the Right People please sign off on Bicoid 3'-UTR regulatory element, or make recommendations for the Rfam pages? If I don't hear anything soon, I'll just make the full set of 574 families; the code to do so will likely be ready tomorrow sometime. Thanks, Willow 18:17, 8 May 2007 (UTC)
Superb work, thank you. On the bicoid page, that RF00551.jpg image doesn't appear to be uploaded, is that correct? TimVickers 19:12, 8 May 2007 (UTC)
Thanks, Tim! :) I just discovered that the images are themselves already in the public domain; I had been worried that the Rfam people might be mad if we uploaded their images without their permissions. Give me an hour or two and I'll upload them all to the Commons.
Consensus secondary structure for the U12 RNA
P.S. (to everyone here) It's awfully lonely over at X-ray crystallography. There's tons left to be done, and much that doesn't require any expertise, such as wiki-linking, making images, correcting and clarifying wording, etc. Serious bonus points for adding references! :D I'm just dashing stuff off, but it's embarrasingly colloquial and hardly encyclopedic in quality and would really benefit from all your inputs. Thank you, thank you, thank you! :) Willow 19:56, 8 May 2007 (UTC)
You're doing wonders, Tim! :) I've started uploading the Rfam images, which should be done in maybe half an hour or so. The problem is that I don't know how to specify that an image is in the public domain once the image has been uploaded. Does anyone know how to do that? We'll want to add categories, etc. for each page as well, but I think I can handle that tomorrow. Any other thoughts or ideas would be most welcome! :) See you around, I have to dash off to work soon! :) Willow 22:27, 8 May 2007 (UTC)
I figured out how to add the license, summary and category information to the Rfam images, which I did for the first 10 images. Please check out the miage at the right; does its page seem OK? Is there any other information that we should add? Thanks for your time! :) Willow 05:26, 9 May 2007 (UTC)

Hi All at MCB. You have been busy!. The images look great. Sorry for my slow reply I had not anticipated things would kick of quite so quickly.

We are all really pleased to hear your positive response to receiving contributions from Rfam . We have been discussing how we want to impliment this given your previous comments about using the ProteinBoxBot and the example page made for Bicoid 3'-UTR by made by Willow. I think I am slightly unclear exactly how some of this is implimented so please correct me as I go. Some of our comments:

(1) The amount and type of data we have for each of our families will vary greatly. Some of them there is very little literature while others there is a large body of published structural, phylogenetic or expression data. As a result currently we do not invisage the use of template as such as the ITK-gene example template. Initially we thought we would create stubs for a small sample set (10-20) of well known/loved ncRNA familes and encourage some of our research communuity to provide annotation. Depending on the response and the type of data that was provided we thought we would then reconsider a more structured template for our stubs. The use of the ProteinBoxBot would be hugely appreciated when it comes to this.

(2) The Bicoid 3'-UTR example created by Willow is exactly the type of entry we intially imagined we would create. There is other data in our database (GO annotations, database cross references) that can be slurped in later but initially I think this is how we wanted to go. Alex has already added some more annotation for hammerhead (http://en.wikipedia.org/wiki/Hammerhead_ribozyme) and begun conctacted researches for contributions. We are excited to see how the community will respond.

(3) You have uploaded images for all our families already. yes? Was this in preparation for generating stubs for all our families? You seem to have this in hand already? Is this to create stubs from our existing annotations ?

Again we wanted to say we think this is a really exciting project and pleased you are keen for Rfam contributions. Please bear in mind I am new to working in the wikipedia community so I may need pointed in the right direction how this gets co-ordinated. I have further questions about adding ncRNA catagories to the MCB pages but I will wait to hear back from you first. Thanks Jennifer Rfm 09:50, 9 May 2007 (UTC)

Hi Jennifer,
I did upload the Rfam images to the Wikipedia Commons, albeit without their "public domain" license, which was a Wiki-peccadillo that I'm trying to fix. I'm also writing a computer program that translates the Rfam flatfile from its Stockholm format into Wikipedia pages. Once generated, I'll upload the pages and perhaps tweak a few before saving them. The bicoid example was one such page, but now I'm slightly more ambitious. ;) Unfortunately, I had a lot of errands to run today, so the program isn't finished; please be patient! :) Willow 21:50, 9 May 2007 (UTC)
Hi Willow, please don't think my last comments were meant impatiently in anyway. We are suprised and pleased that you are helping us set this in motion so quickly. I had fully expected I would have to deal with getting the images and pages uploaded myself. Given I am only just learning how this works it would take me MUCH longer. Your help is really appreciated. Could you let me know what it is your are planning to do so we can contribute and not duplicate your efforts? also should I move this conversation to my/your talk page and off the proposals? Jennifer 10:41, 10 May 2007 (UTC)
The figure looks good. I have added a bit more text in the figure caption a la the Rfam website, rather than just the undescriptive RF00007. Alexbateman 12:39, 10 May 2007 (UTC)
Dear Jennifer,
Please don't worry at all; I never thought that you were being impatient. I was just being a little impatient with myself, that's all. :( I'm trying to get a lot done before tomorrow (when I leave for a friend's graduation) and my energy is at low ebb, so I'm not as cheery as usual.
I'll be happy to do (or not do) whatever you all at Rfam would like. Any advice or directions would be most welcome, especially since I really am a clueless Chloe about RNA — although I hope to learn more through our collaboration! :) My initial thought was to generate the 574 stubs semi-automatically, and then we all — including your larger RNA community — would gradually refine them by hand. Is that still an OK plan?
Why don't we move to your talk page, where anyone interested can join in the conversation? Talk to you soon, Willow 13:11, 10 May 2007 (UTC)

Recruitment drive and expert review

We need more editors. Could everybody please invite at least one person who they think would be a good contributor to come and join the project? You can also approach this by asking experts to look over a page in their field of interest. So far I've invited a crystallographer from Australia to look over the X-ray crystallography page and several enzymologists to review the enzymes pages. Let's use our contacts! TimVickers 02:38, 9 May 2007 (UTC)

I've always thoguht the the emeritus faculty might be interested. A good example is Kimball who has been moving his book into an online format. It will be interesting to see how Fersht responds to the enzyme article. He must be close to retirement of even retired? David D. (Talk) 16:33, 9 May 2007 (UTC)

He's still listed as current staff on the Departmental webpage, maybe they are refusing to let him go! TimVickers 17:37, 9 May 2007 (UTC)

I just looked on his wikipedia article and he is younger than i expected, although nearing retirement age. But, as we know, many scientists carry on like the energizer bunnies if health permits. David D. (Talk) 17:52, 9 May 2007 (UTC)
Does anyone have a good template letter to invite external experts to contribute to an article? I guess it needs to set the right tone, but also possibly introduce wikipedia in a nutshell. I've sent invites to a couple of people so far, through the ncRNA project we might well ask a lot more Alexbateman 15:38, 18 May 2007 (UTC)

Daisy's Decasmon ;)

Our musical friend Daisy

Hi all,

The beautiful spring weather inspired our musical friend Daisy to create Wikipedia stubs for the 574 families of Rfam today, ten of which I uploaded for her:

Would you be so kind as to look them over and make suggestions? No detail or suggestion is too small or too great, and all will be received gratefully. For example, should we use {{molecular-cell-biology-stub}} or {{molecular-biology-stub}}? A more important point is to come up with a set of keywords to be wiki-linked in the articles, e.g., RNA, gene regulation, intron, etc. Jennifer and the others at Rfam will surely have many ideas, but we should help them by being proactive collaborators, don't you agree? Thank you one and all for your ideas, Willow 02:29, 16 May 2007 (UTC)

Speaking as a quasi-lay person (it's been awhile since I did any biochemistry) these initial articles seem remarkably complicated with little to no context to the subject at large. Will these pages cover only non coding RNA families or will they also cover other structured RNA elements listed in the Rfam database? A quick glance reveals terms such as enzymatically active ribonucleoprotein, microRNA, small nucleolar RNA, Y RNA and amino acid operons - how do these differ from each other, can they be grouped? An introductory sentence along the lines of "Page Name is one of the 574 known families of non-coding RNA..." where "574 known families" is wikilinked to a complete list of known families and "non-coding RNA" is also wikilinked. The next sentence should explain what type of family it belongs too (if relevant), with the remaining article going into more detail hopefully in a similar manner to the other Rfam pages.
I think also templates carefully designed will help link the Rfam articles together and put them in context with one another. Much like the "Nucleic acids" template does, rather than just having one giant list. CheekyMonkey 12:35, 16 May 2007 (UTC)

Those are excellent ideas, CheekyMonkey! :) If we could come up with "boilerplate" prose for various types of RNA families, that might help us explain them better to our readers, setting them in context and significance. Unfortunately, I'm not really the one to come up with those SNPets of prose, although I'll be happy to teach them to Daisy once we've agreed on them.

I will try to give some thought to producing navigational templates, although we might need more than one level (or maybe a show/hide thing) to cover all the relevant families. Willow 18:29, 16 May 2007 (UTC)

Hi Willow. This is really great and the general repsonse from out community about improving the ncRNA annotations in Wikipedia has so far been resoundingly positive. Thank you again for your efforts and heres hoping we will get lots of keen annotators.
RE: comments from CheekyMonkey and providing context and organisation. Yes this is still something we need to address and comments are really appreciated. To explain more: these 10 test stubs are purely meant as starter pages (for families that there is nothing else relevant in wikipedia ) from which we really hope to direct users of our database to edit and provide wider context. For other families eg hammerhead_ribozyme where a relevant entry already exist we have put some effort into improving this page and then linking this to our familes.
We do hope to have a comprehensive representation for all of our families in Wikipeida. One of our aims is to help co-ordinate some sort of organisation and catagorisation for existing RNA entries and the new pages we want to introduce. There has been some new entries to Category:RNA but we feel this needs work. This bring me to Willows comments on Stubs and Categories. Currently the MBC project pages are soley focused on proteins and we would really like to increase the profile of ncRNA ( a growing research field) on these pages. How does the rest of the MCB feel about this? We would at the very least like to introduce the Cat:RNA onto the MCB home page but also more text or a section relating to RNA in order to help direct users to it?
As for a list of keywords to mark up our stubs we can definitely put some effort into generating this Jennifer 13:44, 16 May 2007 (UTC)
I realise it's early days and I'll be watching this project develop with interest. Good luck :o) CheekyMonkey 14:00, 16 May 2007 (UTC)
btw your comments are really appreciated Jennifer 16:18, 16 May 2007 (UTC)
Great job Willow, the pages look fantastic! in response to your queries: either stub template seems fine, I see that most are tagged with the molecular-cell-biology-stub template, I would just keep them consistent. The wikilinks will prove to be very important here: 2° structure, Seed alignment, and Avg identity in the boxes should probably be linked, eukaryotes and Stem-loop (i.e. hairpin), were a few others, but obviously some terms will need to be wikilinked on a case by case basis.
A few other suggestions: I agree that CheekyMonkey's context and organizational suggestions need to be addressed. Also, is there a way to have the bot italicize and link the scientific names of creatures (Bacillus subtilis)? I would guess the bot probably can not do this, but it will need to be done. The references in their current format might be difficult to work with should the articles be expanded, is there any way to make them in the WP:FOOT style? All in all great job, and I look forward to seeing the project bloom!
--DO11.10 16:28, 16 May 2007 (UTC)

Daisy is clever enough to recognize taxonomic names, so it might be feasible for her to italicize and wiki-link them automatically. Daisy was lazy about the references, but she could probably do that as well; would you mind having both a "References" section for the inline citations and a "Bibliography" section where all the pertinent references are listed? Willow 18:29, 16 May 2007 (UTC)

That's okay, but if they are "references" shouldn't they ultimately be used in the article? Also where would you draw the line at "pertinent"? It just seems that a bibliography section might become just a long list of marginally applicable/useful references, how do you sort out the really good ones?--DO11.10 19:07, 16 May 2007 (UTC)
Bot-driven reference integration could possibly be done post-insertion as long as the refs are consistent - mixed/inconsistent formatting has been the roadblock with converting most existing articles automagically. As for keyword-based linking, the bulk of it seems like it could be possible by linking the first occurrence in the article, though the usual English language variations issue will cause some problems. I've made a pass through the first two articles and come up with the following keyword-link list:
  • [[Ribonuclease|RNase]]
  • [[Enzyme|enzymatically]]
  • [[ribonucleoprotein]]
  • [[Eukaryote|eukaryotes]]
  • [[mitochondria]]
  • [[DNA replication]]
  • [[nucleus]]
  • [[Ribosomal RNA|rRNA]]
  • [[Transcription (genetics)|transcribed]]
  • [[Evolution|evolutionarily]]
  • [[RNase P]]
  • [[Escherichia coli|E. coli]]
  • [[non-coding RNA]]
  • [[Sequencing|sequenced]]
  • [[Nucleotide|nucleotides]]
  • [[Stem-loop|hairpin]]
  • [[RNA polymerase]]
  • [[Enzyme#Cofactors|holoenzyme]]
  • [[Sigma factor|sigma70]]
  • [[specificity factor]]
  • [[Bacterial growth|stationary phase]]
  • [[genome]]
  • [[Gram-positive]]
  • [[Bacillus subtilis]]
  • [[proteobacteria]]
I guess the question is what's worth having on a "master keyword list" for a bot versus hand-editing. -- MarcoTolo 17:29, 16 May 2007 (UTC)

Thank you, MarcoTolo! That's exactly the kind of list I was hoping for. I'll try to produce a few myself, following your example. Willow 18:29, 16 May 2007 (UTC)

Would it be useful to create a working page of these, i.e. a master keyword list for Daisy to work with? -- MarcoTolo 18:38, 16 May 2007 (UTC)

That's a great idea for coordinating all our efforts. Here's a start. :) Willow 18:45, 16 May 2007 (UTC)

Okay, I've added more keyword links to the list - that should be a rough cut for all ten of the samples you posted. -- MarcoTolo 19:32, 16 May 2007 (UTC)

Thanks, MarcoTolo, they look great! If everyone agrees, I'll make those terms automatic wikilinks in Daisy's files. Daisy will try hard not to overlink, despite her enthusiasm. ;) Willow 20:09, 16 May 2007 (UTC)

PS. I uploaded a not-too-redundant set of Rfam lines that we can all mine for good wiki-links. There seem to be too many for a master list; perhaps we should just identify the most common ones and fix the others by hand? But the more terms that people define now, the less hand-editing that we'll have later! Willow 20:09, 16 May 2007 (UTC)

A quick look at the types of RNA (I only looked at the place-your-letter-hereRNAs) at the not-too-redundant set of Rfam lines shows that Wikipedia is currently even lacking in this relatively simple department. I've added an alphabetical list at User:WillowW/Daisy Rfam wikilinks as a start but even this is lacking - does "anti sense RNA" have an acronym for instance? What is tmRNA? Hope this helps. CheekyMonkey 12:10, 17 May 2007 (UTC)

MCB daughter Project for ncRNA

Hi All, given our recent discussions about trying to provide some organisation and context for for the nCRNA pages we want to introduce we wondered if the MCB project would be open to having a daughter project for ncRNA (as there is for cell signalling etc). We think it might be a good structure for us to follow and perhaps a good way to ensure this effort doesn't come across as quite so 'Rfam' specific as that really isnt our aim. For us the whole point of this recent efforts is to get the range/depth of RNA annotations expanded and encourage our RNA community. A daughter project front end would be really useful entry site for other users and perhaps more accessible? Jennifer 14:01, 18 May 2007 (UTC)

Either way will probably work. There is not that much traffic here and you may catch more eyes. On the other hand I can see the attraction of having one page to focus the discussion. No one will stop you starting a new wikiproject, I would say go ahead. I'm not sure we could stop you even if we hated the idea ;) Be bold, as they say here, and we'll follow along. David D. (Talk)
OK we've moved in this direction and started a daughter project page. We started using the metabolic pathway sister project page as a template. Anyway for those that want to have a look at our humble beginnings please see Wikipedia:WikiProject_RNA. Thanks to everyone who has already made edits to our pages. So far we have 57 pages touched (about 10% of the total) by 23 different users. There is still a lot to be done!!! Alexbateman 16:24, 14 June 2007 (UTC)

RNA family stubs

Hey all,

I uploaded almost all of the RNA family stubs; I ran out of time to do the last 12 out of the 574. Unfortunately, and I'm really sorry about this, I had trouble doing the automatic wiki-linking and taxonomy recognition; it was more complicated than I expected and I was running out of time since I'm leaving soon for yet another graduation, and I have lots to get ready for. So we'll have to do it by hand. :( I'll add the remaining 12 families and also more re-direct pages once I get a chance to catch my breath. Thanks, all! :) Willow 10:46, 26 May 2007 (UTC)

PS. In addition to wiki-linking, you might want to replace the references with proper inline ones, and add any additional information you have about each family. :) Good luck!

Wow thanks for all your hard work on this. Now its largely in wikipedia it starts to make a lot of sense. We now have a pretty comprehensive list of RNA control elements and genes, having the categories attached is really nice. Now all we have to do is persuade the RNA community out there to help us bring these into a better shape. We already have a few edits made :) Alexbateman 13:21, 29 May 2007 (UTC)

Thank you as well, Alex, for your faith in us and for the improvements to Wikipedia's coverage of RNA, both past and future. It's a fascinating area, as I'm sure you know, and many of us are eager to learn more. :) Please encourage your colleagues to contribute and assure them that they'll receive a warm welcome among us. :)

I've uploaded the last few families, as well as a few images that I had missed somehow. Right now, I'm in the middle of uploading the licensing information for the Rfam images, so they should be safe from the copyright policers on the Commons. Just in time, too, since I'm leaving early tomorrow morning for a few weeks to see another sister graduate. To my MCB friends, could you all please take on, say, 20 Rfam families apiece and spruce them up? We want to be looking our best when the new contributors arrive. ;) I'll tackle a bunch of families when I return as well. Willow 16:21, 29 May 2007 (UTC)

Trial run for the ProteinBoxBot complete

FYI, the previously discussed ProteinBoxBot completed its first trial run last night. Eight pages were created, logged here: User:ProteinBoxBot/PBB_Log_Wiki_Live_Run#Condensed_Log_-_Date:_00:39.2C_13_August_2007_.28UTC.29.

Note that only the eight pages under the "Created Protein Pages" were created. The entries under "Skipped Proteins" and "Redirected Proteins" hit existing articles in the main WP namespace and hence were flagged for manual inspection to merge the new protein box. Will do a couple of those this afternoon. More information on the flow and usage of the bot can be found at User:ProteinBoxBot.

Also note that one page for PPARG was speedily-deleted. Discussion ongoing over on the bot approval page.

Comments and suggestions are most welcome! Also, special acknowledgement of User:JonSDSUGrad, who actually wrote the bot. He'll of course also be actively involved in digesting any feedback and incorporating it into PBB for future runs. AndrewGNF 20:02, 13 August 2007 (UTC)

I've also added two semi-automated edits for Apolipoprotein_E and Amyloid_precursor_protein. These were existing pages that the ProteinBoxBot identified and flagged for manual integration with the new ProteinBox. Comments on these changes are also welcome (in particular, the best way to dealing with the "Summary" section that the ProteinBoxBot adds). Cheers, AndrewGNF 21:54, 13 August 2007 (UTC)
Also, Tim reminded us that adding references would be very beneficial. Easily done, the only problem is selecting which references. For example, take PPARG. From the Bibliography section of the Entrez Gene page, you can click a link to Pubmed which retrieves 431 linked publications. Seems like too many to add to the WP page. If anyone has a suggestion on how to pick which we should add and how many we should add, we're very open to suggestions. Cheers, AndrewGNF 22:00, 13 August 2007 (UTC)

NOTE: For simplicity, let's move all subsequent discussions over to the bot approval page (Yes, I realize it's just been me discussing with myself over here...) Cheers, AndrewGNF 23:37, 13 August 2007 (UTC)

Daisy woke up after a long nap :)

Hi, you might recall my friend Daisy who wanted to make referenced stub articles for every taxon in the NCBI last March? Well, she woke up again and has improved herself slightly. If you have a free moment, please review the nine pages at Category:Archaea taxonomic classes. Daisy would appreciate any advice on her output before she starts making lots of articles. Her plan is to work "top-down", i.e., do all the known phyla, then all the known classes, etc. to keep the scope of her work modest.

She still has a few bugs in her throat, such as the page numbers at Bergey's Manual. Any other alerts to problems or suggestions would be most welcome. At User:David D.'s suggestion, her pages also have two templates, one for database links and the other for references from PubMed, etc. Any improvement in those templates would be most welcome as well! :) Willow 21:15, 14 August 2007 (UTC)

Daisy is to be commended - overall the results are impressive. A few suggestions:
1. Under Further reading, put the Scientific databases sub-section below the journals and books - with the idea that database results may be more-data-than-is-useful-to-readers in many cases.
2. When formatting references, I'd like to see us stick to the PubMed-like style, specifically authors listed as Lastname Intial (e.g. Wilson EO, Chandra SK, Fuji R).
3. If DOIs are available, include them in the ref.
4. Rather than using id = ISBN xxx, the recommendation is now to use isbn = xxx (note lower case); same is true for PMIDs (pmid = xxx).
5. Halobacteria appears to have a parse error when dealing with the [No abstract available] term (Cavalier-Smith T (1986) ref).
6. In Methanobacteria, the second Bergey's ref has an odd way of dealing with a unknown ISBN - perhaps just leave it undisplayed?
-- MarcoTolo 21:42, 14 August 2007 (UTC)
7. (continued from above) The Tree of Life links seems to be erroring out (or is it just me...). -- MarcoTolo 23:41, 14 August 2007 (UTC)

Thanks, Marco! I think I may have fixed all of those errors. For #2, I guess you were objecting to the "and" that preceded the final author's name? I just added four more pages at Category:Archaea taxonomic phyla; any more comments would be welcome. I noticed that the NCBI cites a lot of literature that may not be so germane, but I suppose that we can always delete them afterwards. Willow 17:57, 15 August 2007 (UTC)

These look very good indeed. Great job! Tim Vickers 18:03, 15 August 2007 (UTC)
The "and" is part of it - the rest is a function of using the last, first, and coauthor tags rather than a blanket author. <Beginning personal bias> For academic publications - especially scientific papers - the notion of "author" versus "coauthor" can get messy. Rather than sort it out, I find it better to use a blanket "author" tag, leaving the niceties of author position up to the individual disciplines. For example, rather than
{{ cite journal | last = Palys | first = T | coauthors = Nakamura LK, Cohan FM |no-tracking=true | date = 1997 | title = Discovery and classification of ecological diversity in the bacterial world: the role of DNA sequence data | journal = Int. J. Syst. Bacteriol. | volume = 47 | pages = 1145–1156 | pmid = 9336922}}
which displays as
Palys, T (1997). "Discovery and classification of ecological diversity in the bacterial world: the role of DNA sequence data". Int. J. Syst. Bacteriol. 47: 1145–1156. PMID 9336922. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
I prefer something like this
{{cite journal |vauthors=Palys T, Nakamura LK, Cohan FM |title=Discovery and classification of ecological diversity in the bacterial world: the role of DNA sequence data |journal=Int. J. Syst. Bacteriol. |volume=47 |issue=4 |pages=1145–56 |year=1997 |pmid=9336922}}
which displays this way
Palys T, Nakamura LK, Cohan FM (1997). "Discovery and classification of ecological diversity in the bacterial world: the role of DNA sequence data". Int. J. Syst. Bacteriol. 47 (4): 1145–56. PMID 9336922.
In addition to avoiding extraneous "." after initials, this avoids the odd ";" versus "," usage that the cite template uses to separate authors and coauthors. </Beginning personal bias> I realize this is becoming rambling and probably much more than you wanted to hear - thanks for humoring me (regardless of whether you choose to use any of this). In any case, Daisy is doing great work - keep it up. -- MarcoTolo 18:20, 15 August 2007 (UTC)

That sounds fine to me and Daisy both. She's a very fair-minded passerine and I've never liked that peculiar semicolon and the reversal of names. Daisy had a little trouble learning to replace the - with – but otherwise she's ready to sing her next audition. Does anyone have any preferences? Perhaps the phyla of the Bacteria? the classes of the Fungi? something coloratura from Mozart? I have to dash off to work, but I'll upload her songs tomorrow. :) Willow 21:51, 15 August 2007 (UTC)

OK, please start at the genera Pyrococcus, Sulfolobus and Vulcanisaeta and work your way up the taxonomic ladder, via the taxobox. Do all the references, etc. seem good to you? We probably won't go any lower than the genera, at least for now. Thanks, all! :) Willow 21:35, 17 August 2007 (UTC)
PS. What does the "sp." stand for in species? Do we need to list them, e.g., in Thermococcus? Willow 22:40, 17 August 2007 (UTC)
Wow, a lot of redlinks here. Rhodobacteraceae Tim Vickers 00:51, 20 August 2007 (UTC)
There's fewer now. ;) How do they look? I was worried about the proper categories for them; what do you think? Willow 19:54, 20 August 2007 (UTC)
Looks good! Tim Vickers 20:02, 20 August 2007 (UTC)
I finished all the Rhodobacterales families and genera and all the Archaea up to the families. Would someone please look them over and see if Daisy is making any consistent mistakes? For example, in the taxoboxes on the genus pages, should she abbreviate the genus in the species names, e.g., R. sphaeroides in Rhodobacter? Any guidance on whether the pages are good and whether this approach is worth pursuing further would be most welcome. She'll probably add the Archaea genera next, but we'll wait to hear back from you all re:the species names. Thanks! :) Willow 23:32, 21 August 2007 (UTC)

Wow. Amazing work, thank you. One worry is that does the output list ALL the spp in a genus? That might get pretty unwieldy in bacteria. Tim Vickers 23:40, 21 August 2007 (UTC)

Ummm, Daisy lists all the species that the NCBI knows about. I'm sorry, but neither of us knows what a "spp" is; we're also confused about the meaning of all those sp. BLAH species, as in Thermococcus; do we need to include them? What do they mean? Are they just codes for laboratory species that haven't been named properly yet? or maybe not identified as one of the named species yet? Confused, Daisy and Willow 23:47, 21 August 2007 (UTC)
Couple of quick points: A) I think abbreviating the genus name in the taxobox is a Good Idea - saves a ton of space, if noting else; B) the Rhodobacter article has a id = ISBN attribute in one of the cite templates - it should be isbn =. Will comb through the rest of Rhodobacterales when I get a chance. -- MarcoTolo 00:13, 22 August 2007 (UTC)
Hi all, Daisy may have finished all of the genera of the Archaea. Please check out some of the 98 genera at Category:Archaea genera and please, please let us know if we should be doing something different! Daisy did abbreviate the genus name to its first letter in the species list in the genus taxoboxes, and also kept all the "sp. XYZ-12" species as well. Chirpy and cheery, Daisy and Willow 19:44, 29 August 2007 (UTC)
Fantastic work with bacterial systematics including correct categories!Biophys (talk) 06:14, 19 November 2007 (UTC)

Affiliated organisations

Hi there. I'm reading more and more about other Wiki-based biology/molecular biology/biochemistry organisations on the web. As a way to raise all of our profiles and share expertise, are there any objections to me creating a link to a new "Related projects and resources" page from our homepage that can trade links with these organisations and provide an overview of who is doing what? Tim Vickers 18:26, 15 August 2007 (UTC)

Go for it. I'd like to see what other WikiFolks are up to - and the MCB project page could use some new content.... -- MarcoTolo 18:37, 15 August 2007 (UTC)

Page created at Wikipedia:WikiProject Molecular and Cellular Biology/Related projects and resources. Feel free to add links! Tim Vickers 19:45, 16 August 2007 (UTC)

Cheers Tim. I was working on such a page on my own project here http://biodatabase.org/index.php/Related_projects_of_interest - I put that link here because that page of my project is directly trying to assess the same aim as this proposal (just in case you thought I was spamming a bit too hard!) I already found a mountain of related work, and I haven't even been looking very hard! I think this is a great proposal - A well organized well maintained and (near) comprehensive archive of 'Affiliated organisations' as you put it (or simply 'similar work') would be a really great resource. Now that I think about it, it is essential for any scientific project! Keep up the good work! --Dan|(talk) 11:37, 17 August 2007 (UTC)

Introduction to...

In response to a comment on the FAC review of the Oxidative phosphorylation article, I am wondering if it might be a good idea for us to produce an introductory article on biochemistry - similar to Introduction to genetics. We could link to it from the top of our main articles and have it give a broad overview of the subject. I've started a list of what would be included in my sandbox. (Link). Suggestions welcome! Tim Vickers 19:47, 26 August 2007 (UTC)

Assessment of Daisy's articles?

I was thinking of adding assessments to the Talk pages of Daisy's bacterial and archaeal taxonomic articles, but I was uncertain which Wikiproject(s) and which ratings would be best. Do you all think they'd be appropriate for the Microbiology WikiProject or maybe the Tree of Life one — or maybe this one? I was thinking of rating them as "stubs" and of "mid" importance; but maybe they should be of "low" importance, since only an expert would want to learn more about Thermococcus? Suggestions would be most welcome! :) Willow 22:17, 31 August 2007 (UTC)

Does anyone have a suggestion for me? My hands are flying blind, like little chiropters. ;) Willow 23:48, 6 September 2007 (UTC)

Hi all, I'm the editor of Reportergene. Such site, that to date have the form of a blog, aims to be a repository of updated informations about reporter genes, catched from research highlights. To do such work, I review up to 50 journals other than searching in literature databases for appropriate keywords. In fact, I'm a molecular pharmacologist doing research in academia, and I need the work previously described to stay updated (I'm not going to blindly promote my site). In the best of my knowledge, no open journals are devoted mainly to reporter genes, and usually new developments in reporter gene technology comes from historically separated fields (chemistry, medicine, physics, optics... other than obviously molecular biology). I strongly believe that my work could be very useful to people reading the wikipedia "reporter gene" page, so last month I tried to insert Reportergene link as a external link, but the Evil Spartan rejected my insertion, suggestin to have a discussion with you. Here I am. I think the site I'm developing complies with several wikipedia guidelines. Please check [4].

faithfully yours, 96well

Partnership with ACS

Hi there, we have a proposal from ACS Chemical Biology to form a partnership with our Wikiproject. This would involve at the beginning the MCB logo on this webpage and them featuring Wikipedia articles as links from their website (probably as "See also" links from relevant papers). Does anybody have any objection to this happening, or do you think this is OK? If this sounds good, does anybody have any other ideas as to how we can cooperate in the future? Tim Vickers (talk) 17:57, 5 January 2008 (UTC)

Do people think we should do this? Responses? Tim Vickers (talk) 17:49, 14 January 2008 (UTC)
Hmm, I have never heard of such a cooperation here on Wikipedia. Does this comply with the Wikipedia rules (are there any)? What would be the benefits of both sides? Is this going to be a simple 'link exchange' or what other forms of cooperation are planned? --Splette :) How's my driving? 17:55, 14 January 2008 (UTC)
To begin with it would be a simple link exchange, I would envisage links from sidebars in the journal articles to the relevant Wiki articles and links from our homepage and Wikipedia:WikiProject Molecular and Cellular Biology/Related projects and resources. Further interactions between our community and their readers depends on what people suggest. Since no money is being exchanged, only recruitment of expertise and highlighting resources, I don't see any conflict with Wikipedia's principles. Tim Vickers (talk) 18:09, 14 January 2008 (UTC)

As far as I can see they don't need out approval to link to Wikipedia articles. It rather seems they are asking us to place advertisement links to their page here. Sorry, I don't want to spoil this idea, but to me it somehow sounds like linkspam. Maybe I am paranoid. Maybe I would be more excited about the idea, if I could see an actual benefit of such a cooperation for both sides. Please feel free to ignore my opinion. After all, I haven't contributed much here in a while... --Splette :) How's my driving? 18:32, 14 January 2008 (UTC)

Personally, I have some doubts/questions about the details of how this would work in practice, but mostly those are on the journal side. (For example, will they link to a specific version in history, or the latest-and-greatest version which may "drift" in time?) But as far as I see, there's no downside to giving it a shot. Linkspam for me is not an issue if it's just a single link on Wikipedia:WikiProject Molecular and Cellular Biology/Related projects and resources. Of course, if they wanted to expand to more systematic linking from WP articles to their journal articles, perhaps that would need more scrutiny. AndrewGNF (talk) 18:43, 14 January 2008 (UTC)

Your point is a good one Splette, we need to avoid any appearance that we are "endorsing" a commercial site. However, the advantage to Wikipedia is that it will provide an opportunity to attract readers of the journal into editing articles that they have direct expertise in, as well as hopefully attracting more members to MCB. The advantage to ACS is that they are making part of their site a Wiki link so raising their profile on Wikipedia might allow then to recruit some editors with experience of working in a Wiki environment. In that sense ACS is a related project since they are beginning to use Wiki tools to involve their readers in a novel way. Tim Vickers (talk) 18:54, 14 January 2008 (UTC)

In principle this sounds fine. (Especially since ACS has sometimes not been famous for their devotion to open-sourcing information). Has there been an on-wiki discussion of this idea? Since we occasionally have spam concerns, it is good if we have someone from ACS we can raise the issue with, and if that person has joined in an on-wiki conversation it gives us somewhere to go to pursue that. Also a talk thread is a place where people who are worried about the relationship could go to be sure it was legit. So I'm suggesting that links to any relevant talk threads be posted. if the person Tim spoke to is a WP editor, can his name be mentioned here? Thanks, EdJohnston (talk) 19:15, 14 January 2008 (UTC)
This idea came from an e-mail to me from Anirban Mahapatra, who is one of the editors of the journal. I e-mailed him a link to this discussion last week and he said he'd comment here. Tim Vickers (talk) 19:25, 14 January 2008 (UTC)
I still don't quite see what ACS gets from all of this, but I'm okay trying it and seeing where it takes us. Maybe I lack imagination, but I can't imagine it doing more harm than good. And linkspam can always be removed if it comes to that. Forluvoft (talk) 19:08, 14 January 2008 (UTC)
Wow, 11 posts in 2 hours. This page hasn't seen that much traffic in a while. I guess we could need more editors, so let's give it a try. --Splette :) How's my driving? 19:14, 14 January 2008 (UTC)

On a more long-term note, as more organisations start to use reader-derived and Wiki content in their websites, we are probably going to see a lot more ideas about integrating their communities with Wikipedia, and applying the resources we have created in novel ways. The opportunities for exchanging expertise and mutual support - particularly in attracting and retaining experts - are very attractive. But we and the organisations involved need clear demarcations between our respective content - as much for their legal protection (copyright/vandalism) as ours. I think this is going to be an area where our project is going to become more important to the wider scientific community. Tim Vickers (talk) 19:18, 14 January 2008 (UTC)

Hello everyone. I am the Assistant Managing Editor for Peer Review at ACS Chemical Biology. I am really glad to see so much discussion on this topic. I'll post with details of my ideas this weekend. I'll keep it brief right now since I am at a GRC (note:idea for a wikipedia article!) on Biomolecular Interactions and Methods. I have been with the ACS for about three months but have been editing Wikipedia for much longer. I am embarrassed to say "contributor" since my recent contributions haven't actually been that many. I agree completely with Tim Vickers. As a Wikipedian, I'd love to see more scientists contributing to our pages. As a professional editor, I'd like to see more activity on the ACS Chemical Biology community pages (Ask the Expert!, wiki, and Events pages). I perceive no conflict-of-interest and see any sort of collaboration as an on-going ad hoc process subject to termination at any time from either end. I think I set the tone in a recent editorial that explained what wikis actually are to a lot of scientists who (I hope) are inquisitive about this. As a collaboration goes forward, I see scope for additional editorials in print on the people behind the pages and what we want to do. I'll post more in the upcoming days. Thanks again! Antorjal (talk) 23:13, 15 January 2008 (UTC)

I fell we should clarify this. .
The American Chemical Society is a major actor against open content. They are notorious among chemists and librarians for still continuing opposition to the recently enacted and rather minimal open access deposit requirement for US government sponsored research. They may be a non-profit organization, but in this respect they do not seem to act like one. They are one of the organizations who have for years prevented those of our editors not in major institutions from access to content that could be used to improve WP articles. They are the organization which tried to destroy PubChem, and came very need succeeding. That they have engaged in this actions in the past i would consider to give some concern--that they continue even now to oppose even the one year embargoed free access I consider seriously troublesome. The ACS is entitled to have what principles it chooses, but they directly contradict ours'.
Because of this, I would recommend to the WMF that they avoid any formal affiliation--and I remind Antorjal that we here are not authorized to do anything in the name of WP in any formal manner.
But even if they were more benign in this respect, they are still a publisher of journals in competition with other publishers. We can therefore not appropriately give them any preference. If they want their content to be highlighted in WP, there is only one fair and justifiable way: release it under GFDL--this is the general suggestion at WP:BFAQ and applies to everyone. Of course we will continue to link to individual articles as people editing on the article think appropriate. Obviously the place for Dr. Antjoral to suggest such links is on the relevant article talk pages.
On the other hand, they, just as anyone else, commercial or otherwise, are welcome to use our content, if they license it appropriately. And of course they, like all the world, can link to it, and I join the others here in encouraging them to do so. And certainly anyone here is totally free to edit their wiki, or any other, commercial or otherwise. I don't think doing so would be a conflict of interest. I note that their wiki is copyright, and copying wikipedia content from it would violate our licensing requirements. if one copies WP content onto it, one should indicate that it is from WP, that it is GFDL not ACS copyright, and give at least a link to the license
I have every confidence that it is not the intention of Dr Antjoral or anyone at ACS to do anything against our usual practices, and I do not mean to suggest that he is asking this of us. I will say to their credit generally that they seem never to have engaged in spamming us--unlike certain other commercial and non-commercial publishers who have needed to be reminded not to do so.DGG (talk) 18:05, 31 January 2008 (UTC)
DGG is right here that some distance needs to be maintained, though we do need to find ways to work with organisations like this without keeping them at too much of a distance. The best approach seems to be at the editor level, for people from here to go and edit on other wikis (contributing their expertise), and to encourage people there to come back here and edit here. Also, articles and editorials about Wikipedia and wikis are great and should be encouraged as much as possible. In terms of exposure and traffic, the best way for any website to get traffic from Wikipedia is to provide content and webpages that gets used in the references in articles, but to leave it to Wikipedia editors to add such references themselves as and where needed. I remember a similar debate about Scientific American a while back. Carcharoth (talk) 03:04, 20 February 2008 (UTC)

Carcaroth alerted me to this discussion, since I was involved in talks with CAS on a collaboration (which is mentioned right above). We had an IRC discussion on Feb 5th where we agreed to try this approach, and I said I would try talking to them (though the full story is more complex!). CAS will be checking all of our CAS numbers (~7000) for us free of charge, which will save us hundreds of hours of validation work. DGG is quite right about the (quite recent!) history, but at the same time some at ACS are very interested in what we do and are very supportive. I have spoken with people at ACS Chemical Biology, and this journal is considered by ACS people to be a "trailblazer" and as such it tends to be much more progressive in its outlook. From what I know of the people involved, I'd say the offer has been made in good faith. There are obvious concerns - WP can't be seen to be promoting one journal over another, and there is a natural suspicion based on events like the Pubchem spat, but the ACS people will understand these concerns. I think if the project explains the WP rules clearly at the start, and what the project is wanting, then perhaps progress can be made, and there is little to lose in the attempt. At best, it may bring some new experts into this project. Cheers, Walkerma (talk) 03:51, 13 March 2008 (UTC)

leaving regenerating sections wile revising cell articles

for people studying in the fields of bioregeneration and stem cell projects, it would be very useful if whom ever is researching the individual cell could take any notes on regenerating chemicals, pros and cons of reactions with the cell with different chemicals and environments, any thing at all that could aid in the search for extended cell life research. thanks Roy Stanley (talk) 17:46, 14 January 2008 (UTC)

High School Student Use of Molecular and Cellular Biology

As a sophomore in high school I am taking Biology 1, and may continue on to Bio 2/Ap Bio, and would A) like to thank you guys for doing what you do to help make sure that the information is complete, and correct, but also trying to use diagrams to help visual learners like myself in understanding the topics. The biology pages on Wikipedia are not only helpful with homework, but can also be helpful in learning general ideas, and more specific ideas when your teacher is not clear about what it is we are supposed to be"learning". On that same note, the biology pages of wikipedia also help greatly with specific facts and minute details of any topic. Because of you guys working so hard, i have found Biology to be one of the most complete and comprehensive subjects on wikipedia.

B) I would like to propose ( i have no idea if you folks are already doing this) that the articles surrounding ideas found in basic biology textbooks, and Bio 1,2,3 classes to be organized more on general topics for each page, with specifics below each topic, and then another section below that for obscure details and other ideas that a more advance biology researcher might use when using wikipedia, because even though our teachers may not, us bio students really appreciate having such a valuable resource such as the Biology section, and if the pages with basic bio ideas (Cell Function, Cell Replication, DNA Function and the like) were organized a little more comprehensively, then I think, no i guarantee that the page would get much more use, and it would be more appreciated. Thank You for reading my proposal, and thank you for working hard to do what you guys do. —Preceding unsigned comment added by 64.9.41.41 (talkcontribs)

I'm not sure that I follow your suggestion. Are you suggesting that the page is split into basic, intermediate and advanced sections? If so, then I don't think this can happen. All articles on Wikipedia, regardless of which subject they fall under, are expected to conform to certain criteria regarding their layout. Suppose the current situation gives subheadings A, B and C in which all information is mixed; if we separate out the basic, intermediate and advanced level topics we get a somewhat less coherent bA, bB, bC, iA, iB, iC, aA, aB, aC. Indeed, if that was going to happen, it would perhaps be better to have three completely different versions of the same article, each aimed at the different audiences. This may happen at some point in the future, but I wouldn't expect it in any less than five years time (and probably much, much longer). ----Seans Potato Business 20:42, 8 February 2008 (UTC)
Dig around Category:Introductions and the associated links and articles, and you will find plenty of discussion on whether we should have articles at different reader levels. I tend to agree that "not yet" is a good idea, though people willing to put in the work to set up the different articles, and do it well, tend to do OK. Carcharoth (talk) 02:54, 20 February 2008 (UTC)

mass creation of PDB images

Is anyone interested in helping to create images from PDB data en masse? It looks like the consensus is that the PDB image thumbnails that we download directly from the PDB (e.g., [5]) should not be used at WP. Is anyone interested in creating thumbnails for the ProteinBoxBot to use? (Or I even seem to remember someone suggesting that a public-domain thumbnail be made for all proteins in the PDB and deposited over at the commons.) If anyone is interested in scripting that up, please speak up... AndrewGNF (talk) 01:11, 20 February 2008 (UTC)

Dammit. That's bad news. Tim Vickers (talk) 01:17, 20 February 2008 (UTC)
How many are there? How resource intensive is it to generate the images ourselves? If it is a lot of images, or the process is resource intensive, you might want to get a second (or third or fourth) opinion before going ahead with this. I'm reasonably sure of what I've said, but it might be possible to argue that using a software program to generate an image from public domain data isn't sufficiently creative enough to make the resulting image copyrightable or restricted to the extent of being able to request non-commercial use. But that would be effectively stating that the PDB are claiming non-commercial restrictions they shouldn't be, and while that isn't impossible (people incorrectly claim copyright on public domain pictures all the time, for example), we would need to be a lot more sure of ourselves before saying that. They could argue that they don't want other people to profit from the time and expense of the people and resources used to create those images, and that is what will happen if we say they are free. On the other hand, that picture scanning landmark ruling I can never remember the name of said that time and effort involved in creating an exact copy doesn't impart any copyright. But then protein images are not technically an exact copy of the data, though in some senses they are. But probably not in the copyright sense. Hang on, I'll ask someone who might know. Carcharoth (talk) 02:23, 20 February 2008 (UTC)
As of today, there are 49048 total structures in PDB. ProteinBoxBot I think has uploaded about 3000 of the thumbnails. Image generation from the primary data I don't think is too terribly difficult, but then again it's been a while since I did it myself and I never tried to do it at a very high quality level. Looking forward to hearing what your source has to say. As a desperation effort, I also emailed the principal investigator that oversees the PDB, hoping that they might consider rewording their license. Unqualified public domain release of those images seems to me to be in the spirit of the overall effort... AndrewGNF (talk) 02:35, 20 February 2008 (UTC)
How many of the 49048 structures have PDB made images for? Maybe we could trade images with them? :-) And does ProteinBoxBot effectively take the protein datais 8 and generate the article (well, infobox) as well as using the image? That sort of thing is fine - it is the images that seem to be sticking point here. I agree, if the rest of the PDB project is in the spirit of making protein data available to the public and scientific community, then releasing the pictures freely would be in the same spirit. Let's hope so. Carcharoth (talk) 02:50, 20 February 2008 (UTC)
I think PDB generates thumbnails for all their protein structures as part of their standardpipeline. Not sure if it's a completely automated or semi-automated (with some human input) process though. ProteinBoxBot (also sometimes referred to as PBB, just to make things confusing) actually only takes the thumbnail image from PDB. The rest of the data comes from textual sources in public databases. The only other image data are created by me, and those image usage tags were recently addressed as well with the help of a couple of good samaritans. Yup, keeping my fingers crossed on PDB changing their license, since that by far is the easiest solution... AndrewGNF (talk) 03:21, 20 February 2008 (UTC)
I've been doing some searching and the EBI maintains the macromolecular structure database, and has generated their own images, which they state "The public databases of the EBI are freely available by any individual and for any purpose." I'll e-mail Jawahar Swaminathan, the guy who deals with this to check tomorrow, but this might be another option for us. Tim Vickers (talk) 03:15, 20 February 2008 (UTC)
Excellent idea Tim... If nothing else, it'll be something to hold over the PDB folks. "Well, the EBI folks allow us to use their images..."  ;) AndrewGNF (talk) 03:24, 20 February 2008 (UTC)
User:Jawahar.swaminathan might be quite sympathetic to the idea of Wikipedia using his images. Tim Vickers (talk) 03:49, 20 February 2008 (UTC)
Yeay! :) Well, it looks as though the problem is solved, but if it's needed, I'd be willing to write a program to make PyMol images of any desired subset of the PDB. But if it's to be automated, we would need to decide on a common look? For example, do we show ribbons or all-atoms? Do we color the ribbons by residue, by chain (if more than one), by secondary structure? Do we show metals, cofactors or other ligands? I'm sure that we could come up with reasonable rules of thumb. For example, perhaps we should align the proteins so that their longest axis is vertical and their second-longest axis is horizontal? Anyway, just some random thoughts, Willow (talk) 06:44, 20 February 2008 (UTC)
PS. Ummm, if I'm not mistaken, there's a lot of redundancy in the PDB; you know, close homologs, point mutants, apo- and holo-enzymes, and all that? Perhaps we should focus on a well-defined non-redundant subset, say, the domains of Pfam or somesuch? Willow (talk) 06:44, 20 February 2008 (UTC)
PPS. Even better, we could make little animations of the PDB structures revolving about their longest axis; then we wouldn't need to worry about finding the best perspective by computer! :) Willow (talk) 06:47, 20 February 2008 (UTC)
Clearly, we've found someone with far more experience with creating images from PDB data than I. (Not that it was particularly hard to exceed my experience level...) Let's see what the EBI folks say since that is the most expedient solution for PBB. But the animation idea? Wow, that would really make those gene pages pop! I'd support doing that regardless... We could help resolve the redundancy by selecting (albeit randomly) one structure per human gene. That would help for examples like Cdk2. Great idea! AndrewGNF (talk) 16:59, 20 February 2008 (UTC)
Great ideas all around, but I think we should concentrate at first on a simple solution for replacing the PDB image thumbnails, since it would be nearly impossible to automate picking the best structure, deciding which cofactors and/or ligands to include, and for multidomain proteins where a structure for the entire protein is lacking, figuring out the best ways combining several structures to produce a plausible picture of what the entire protein would look like. The best picture will ultimately have to be produced by hand on a case by case basis. I like the movie idea though. Boghog2 (talk) 20:43, 20 February 2008 (UTC)
Well, not more experience, I fear; but fools may rush in where scientists fear to tread. ;) Happily, I had a good teacher. :)
So, ummm, it kind of works. Meticulous as she is, Daisy is rather slow, but for a passerine, she's a pretty good artist. ;) Check out her Opus 1. :)
Unfortunately, her nice picture has a rather spare home to go to; perhaps we could spruce it up? Suggestions for Daisy's look, format, etc. would be very welcome; I'm just faking it as I go along. Willow (talk) 20:55, 20 February 2008 (UTC)
I think "automagically" is the word you're looking for. It s astounding Willow! Tim Vickers (talk) 21:02, 20 February 2008 (UTC)
It is hypnotizingly beautiful, no? I can understand why people might spend their lives working on such things. But the magic is properly placed with PyMol, which did all the heavy lifting; I hardly had to do anything. One concern I had, too, was that the file is rather large — almost 8 MB — but it doesn't look good (to my eyes) when I shrink it. Such large files mean slower uploading times for the enzyme pages, but maybe we could afford one splashy image per protein page? I'm finding it hard to bring myself to spoil the beauty. I'll bow to whatever other people suggest, though. :) Willow (talk) 21:08, 20 February 2008 (UTC)
Looks great! Regarding formatting, can we see how it would look if all alpha-helices were one color and all beta-sheets were another color? Just a minor thought, don't know at all whether it'd look good. If you'd like to create a model for a human protein (and perhaps are attached to the lactoylglutathione lyase family), how about 1bh5? That might go nicely in our GLO1 page... Well done! AndrewGNF (talk) 21:25, 20 February 2008 (UTC)
Oh, okay! I have to dash off to work soon, but I might get it done in time. I'm downloading the biounits from the PDB right now so that I can help out with whatever you all ask of me. :) Ta ta, Willow (talk) 23:47, 20 February 2008 (UTC)
I'm afraid that I won't make it tonight. :( It's taking Daisy about 1/2 an hour per protein animation and I need to dash off before then; sorry! But you'll have a nice animation waiting tomorrow morning. Willow (talk) 23:59, 20 February 2008 (UTC)
The animation is great, but really, the ideal is a click and drag display, where you can rotate it to the bit you want to see, using a mouse, and then zoom in and out. If you can't control the rotation, or zoom in on bits of it, then it isn't really as useful as it could be. In some ways, a static display is better than a non-controllable animation. You can take the time to look at the picture, rather than having constant motion. Does that make sense? Carcharoth (talk) 11:25, 21 February 2008 (UTC)
The human version of GLO1; should Daisy have chopped off that annoying N-terminus?
Hi, here's Daisy's next animation; it shows the glutathione part nicely, but the twin oxygens of the substrate proper are missing. The catalytic metal here is zinc instead of nickel, but she used purple for both.
I'm inclined to agree with Carcharoth that the user being able to rotate the structure themselves would be nice; the whole eye-hand coordination thing is important in truly seeing, I think. But I'm not sure how to do that and it seems as though it would be server-intensive? On the other hand, I would imagine that it's no more painful for the server than a 5MB animation. Could someone look into how we might go about doing that? Maybe the reader would have to have some kind of plug-in in their browser? Willow (talk) 15:10, 21 February 2008 (UTC)
Very cool... It appears that there are at least two plugins that allow for manipulations of structures in a browser. Not familiar enough with how they work though to say how they could interface with WP. I agree that this type of full-controls interface is the best visualization, but I do think the rotation animations are better than static images. It in part reduces the worry that we haven't chosen the right orientation, and it also provides three-dimensional perspective that you can't get from the 2D view. But in truth, I think for the vast majority of people, the structural image is pretty much just eye candy, and the animated view just seems like it will catch more eyes and get people thinking about protein structure and proteins in general. My two cents... (Tim, no reply yet from the EBI folks?) AndrewGNF (talk) 18:01, 21 February 2008 (UTC)
Very cool indeed! But of a word of caution concerning users with low band width and/or processor speeds. A static image as the default with a link to a movie would provide both fast download plus snazzy graphics on demand. I am currently viewing this page on a Macintosh PowerBook G4 and things are kind of slugish. Just my 2 cents worth. Boghog2 (talk) 23:58, 21 February 2008 (UTC)
No reply from Dr Swaminathan yet, I've e-mailed the MSD helpdesk as well. Tim Vickers (talk) 00:59, 22 February 2008 (UTC)
Well, if the goal is eye-candy and luring our readers down the fateful honey trail towards enlightenment, we could make a "teaser" animation at much smaller scale, but link to a fuller animation like the one above for Leishmania major GLO1. Conversely, I saw a trick yesterday on the web for animating a picture only when the mouse is placed over it, but I don't know if it could work here. The "click-to-animate" would work as well. Willow (talk) 01:14, 22 February 2008 (UTC)
The human version of GLO1 (1BH5), drawn using http://bioserv.rpbs.jussieu.fr/cgi-bin/PPG. 757kb
Alas, I think I have to agree with Boghog. In my enthusiasm over some very visually-compelling images, I just assumed that those large file-size issues could somehow be solved. I tried making an even smaller version (300px width, close to the standard 250 px PBB size, and "rocking" instead of full rotation) using this tool I found online. The image at right turned out to be 757kb. Getting close to reasonable maybe... But if we really would be stuck with animations that are that big, perhaps our mass effort should be on simply replacing the PDB images with free versions. (Using a static image or low-res animation to the high-res animation sounds good too. I think the question is whether we can get a teaser animation small enough...) AndrewGNF (talk) 01:22, 22 February 2008 (UTC)
I'll experiment a little. As an aside, we can get the animation smaller if we focus on just the biological complex, rather than the asymmetric unit of the crystal structure, which might have extra subunits, as we see at the right. :) I really have to dash off, though, my friends are waiting for me! :) Willow (talk) 01:33, 22 February 2008 (UTC)
and just for comparison's sake, Image:1axc_tricolor.png is 527 KB and approximately 1100px square. I down-sampled it to 300px square and that image is 85 KB. For whatever that's worth... AndrewGNF (talk) 02:43, 22 February 2008 (UTC)
Smaller, faster version of Opus 1. It's about 1 MB, roughly 8 times smaller than the original.

arbitrary break

Willow, I think that smaller version looks great. How do people feel about 1MB? This tool gives some download time estimates. That size doesn't bother me, but don't know where the majority of users fit. As an aside, I heard back from Phil Bourne, the co-director of the PDB, who said that they are looking into licensing those images under CC 3.0. Will post here as I hear more. (But as another aside, I'm noticing that the static EBI images tend to be much higher quality than the ones from PDB/RCSB. Compare Image:PBB_Protein_ITK_image.jpg to [6].) AndrewGNF (talk) 17:37, 22 February 2008 (UTC)

I think 1 MByte is a good compromise and down loading times should be reasonable for most folks. The movies are visually stunning. My only remaining reservation is the animation may be distracting. Ideally there should be a way of toggling the animation on and off. Cheers. Boghog2 (talk) 18:49, 22 February 2008 (UTC)
Thank you all! :) I've downloaded most of the biounits, so I'll be ready pretty soon to start producing the animations, once we figure out how to make them less annoying; maybe a test-run next week on some PBB pages? Of course, there's no rush, except that I'm traveling soon to help my sister, who's getting married this summer. :) Speaking for myself, I'm not married to the idea of animations, so if we prefer the idea of static images from EBI or the PDB, I'll be just as happy as a clam. (:)
Oh, by the way, here's that description of the java-script solution, but I don't really see how to do that here in Wikipedia. Can anyone figure it out? Willow (talk) 00:34, 23 February 2008 (UTC)
I don't know if one can imbed java-scripts inside wiki markup language. An alternative might be to convert the animated gif into ogg movie format using ffmpeg2theora. (I first needed to save the gif in mpeg4 format using Quicktime Pro). Below is an example. In Safari I had to install the QuickTime Ogg plugin for best results (the movie playback using the default java plugin doesn't look very good in Safari). The advantage of the ogg movie format is that with the right plugin, one has a lot more control over the animation (start, stop, manual scroll through the frames, etc.). Boghog2 (talk) 09:41, 23 February 2008 (UTC)
glyoxalase I
Identifiers
SymbolGLO1
NCBI gene2739
HGNC4323
RefSeqNM_006708
Other data
EC number4.4.1.5
This seems like a very good solution to me! We have the static image, but anyone interested may examine the structure in more detail by scrolling through the movie. One problem for me, though, was that the playback cut off the bottom of the animation; for example, I couldn't see the bottom turn of the yellow helix. Is that just my browser's fault? It used the Java player without offering me a choice. How can we ensure that semi-literate people like me can actually see the animations in their biochemical glory? It'd be really nice, if the static image were as small as it is now, but the movie player sprang up in a separate window showing the animation in full-size. :) Just a few stray thoughts, Willow (talk) 12:39, 23 February 2008 (UTC)
Concerning the bottom of the animation being cut off, I had precisely the same problem when using the default Cortado (Java) player that was installed both with Firefox and Safari on Mac OS X. I then installed the ogg QuickTime plugin which solved the problem. There is also a Windows version available from the same download page. After starting the movie, you should see a "more" link just below the image. Click on the link and this should provide you with options as to which plugin to use. It is unfortunate that the default ogg plugins don't seem to work very well. This gives me second thoughts about recommending this solution for the ≈ 3000 PDB graphic images that need to be replaced. Boghog2 (talk) 13:25, 23 February 2008 (UTC)
More extensive instructions for installing the necessary software to view ogg movies can be found here. Boghog2 (talk) 13:30, 23 February 2008 (UTC)

I have to admit I'm not a huge fan of the ogg version. The viewing issue with the default ogg viewer as Willow and Boghog mention above, the apparent page reload when you click the play button, and the relatively low quality of the static image all go to reduce my enthusiasm. Plus, as I was flipping through the "more" options a few times, my browser (firefox on XP) froze up and needed to be restarted. With all the options on the table now, I think my preference is leaning back toward high-quality static images. If Willow or someone wanted to create movies then I think it would be cool to display them on the static image pages (after you click on the static image), but of course it gets much less play there so may not be worth the effort. My two cents... AndrewGNF (talk) 15:22, 23 February 2008 (UTC)

Ogg movies are nice in theory, but there are too many problems in practice. I second Andrews motion to start with high-quality static images with optional links to high resolution movies. Boghog2 (talk) 20:55, 23 February 2008 (UTC)
I had the same trouble with the Quicktime plugin, unfortunately; I had to restart my computer to continue using Firefox. :( I guess I agree with the "static-image-linked-to-animated-GIF" solution. My one reservation is "how do we find the right perspective to see everything?" but I guess anyone who wants to see everything will click on the animation.
Why don't we wait for a day or two to give the EBI time to respond? It would be easier on Daisy, and I'm guessing it might foster nicer relations with the EBI scientists to use their work and to allow them to feel involved. On the other hand, if you'd like to e-mail me a list of, say, 100 PDB codes, Daisy could do a trial run, to warm up her vocal cords. :) Willow (talk) 12:58, 24 February 2008 (UTC)
PS. Maybe we could download the smaller 1MB animation, but wrap it within a "show/hide" template? That way, depending on the default, our readers could make the animation (dis)appear with a single click without ever leaving the article. Willow (talk) 13:05, 24 February 2008 (UTC)
This is why it's so much fun collaborating here. The whole is definitely greater than the sum of the parts... ;) I think the show/hide idea is fantastic -- best of all worlds. It would take some work to add that to the PBB template, and then some more work still to add the animations to all the existing pages (and make sure that PBB doesn't overwrite those changes), but I think the result would be pretty cool. I'll work on getting the PDB entries together, and possibly taking a stab at the template modifications (though more experienced wikicoders are certainly welcome there). And agreed, using the PDB or EBI static images should be our first preference... AndrewGNF (talk) 16:06, 24 February 2008 (UTC)
Okay, just to throw another wrench in it, check out http://www.proteopedia.org. The data takes a while to load, but that JMOL plugin has a lot of advantages too... That site is linked from http://pdbwiki.org, which was brought to my attention by Phil Bourne of the PDB. AndrewGNF (talk) 04:03, 25 February 2008 (UTC)
It's nice and the Jmol thing worked on the first try! :) I didn't like their ribbons as much, though. I tried some of the other viewers mentioned on the PDB wiki, and this Astex one seemed the best, although a little complicated. More generally, we should think about linking to the PDBwiki; they seem to be doing a good job! :) Umm, unless there's a better one? Willow (talk) 11:29, 25 February 2008 (UTC)
Good news, Dr Swaminathan got back to me.

Dear Tim, Apologies for the delay in addressing your question. I've been away on a holiday. The images are free to use and are not copyrighted by the EBI. Ideally, you may want to link to the pictures directly using the URL for the images. That way, when and if the images are re-made, you will have access to the latest image. Jawahar Swaminathan

Looks like we're fine with EBI. Tim Vickers (talk) 15:58, 25 February 2008 (UTC)
Hmmm, I wonder then if EBI would be okay with mass uploading of all pre-generated PDB images to wikicommons? That would make the utilization of PDB images really easy, if we could just assume that an image existed at a well-structured filename... Even better if they'd incorporate that into their standard analysis pipeline. perhaps I'll upload one image to wikicommons to get feedback on the correct license tagging. AndrewGNF (talk) 18:04, 26 February 2008 (UTC)
Okay, I'm no licensing genius here, but I think we'd hit the same problems as with the PDB itself. Looks like wikicommons also likes explicit statements relating to free licenses, and not just "terms of use" statements that seem compatible with free use. Also, the EBI terms of use also seems to contradict itself at certain points. For example: "1. The public databases of the EBI are freely available by any individual and for any purpose." "10. The EBI itself places no restrictions on the use or redistribution of the data available via its services..." and "12. The EBI requires attribution for any of its services or data that it is subsequently used in another product or service.". Ugh. I wonder if, like the PDB, we could convince them to release everything under an explicit open source license. AndrewGNF (talk) 18:20, 26 February 2008 (UTC)
Freely usable for any purpose, but with attribution. The problem we had with the RSCD was their restrictions on commercial use, which the EBI does not do. Tim Vickers (talk) 18:26, 26 February 2008 (UTC)
Folks over at the commons are fine with bulk loading snapshots of all ~50K structures in PDB (from EBI), but suggest that we categorize them. [7]. Thoughts? AndrewGNF (talk) 17:15, 28 February 2008 (UTC)
It seems that we have two basic choices: structure or function. Since these are graphics of structures, the structural classification would seem to make the most sense. The PDB already classifies each and every structure according to CATH and SCOP, so why not go with one of these? Boghog2 (talk) 18:37, 28 February 2008 (UTC)
Great idea, I've added SCOP categories to Image:PDB_1bh5.jpg. Any other ideas? And perhaps more importantly, anyone want to volunteer to write the download/upload program? (Jon has been laying low while he's writing up his thesis...) AndrewGNF (talk) 21:56, 28 February 2008 (UTC)
As an aside, just found this page: Wikipedia:Using_Jmol_to_display_molecular_models AndrewGNF (talk) 06:34, 4 March 2008 (UTC)

FTO gene

I was hoping to be able to work on the FTO gene stub page and be able to update it and work farther and expand on what is already there on the page. If there is any objections to this could you please notify me so I do not waste time working on it and have it erased. Hrpatel08

You should definitely feel free to work on the FTO gene page, or any other page for that matter. If anyone has any input on your edits, hopefully they will work with you to make things better. Highly unlikely that anyone will erase your edits out-of-hand. People here (especially at WP:MCB) tend to be a pretty friendly crowd. Welcome, and look forward to your contributions. Cheers, AndrewGNF (talk) 05:56, 27 February 2008 (UTC)

Proposal: task forces

We're an extremely large project, with an article set currently hovering around 12,000 articles, not including the several thousand proteins stubs that haven't been tagged yet. Perhaps we should consider additional levels of organization, such as one or more task forces? – ClockworkSoul 06:11, 24 March 2008 (UTC)

A focussed effort with a defined goal (and rewards for major contributors) has worked well at Good Articles to deal with their backlog in the past. I'd be wary of adding extra layers of bureaucracy though, unless our goal was clear. Tim Vickers (talk) 19:35, 25 March 2008 (UTC)
That's a good place for us to start. I briefly skimmed WP:GOOD, but I don't have the motivation to do a deep search. what kinds of rewards did they use? My motivation for suggesting this, though, is that the proliferation of PBB-added proteins – while indeed a very good thing – is flooding the "importance=low" categorization with articles that aren't likely to be the targets of a great deal of attention any time soon. The effect is an "importance inflation", such that articles that would have once been ranked as "low importance" no longer belong among its new "very low" importance peers. Rather than try to cram everything else into the remaining importance values (top/high/mid), I propose that take one of three routes: first, we could remove the article from the MCB project, which while it has the virtue of being easy to enact, feels to me like throwing out the baby with the bathwater. Second, we could start using the "NA importance" value, but that ranking is inconsistently implemented across projects and I would prefer to use a more standard solution. Third, we could assign all of the PBB articles to a seperate set of categories of their own, as if they were the domain of a task force. This last solution would require the addition of a flag to the project banner template (i.e., {{MCB|pdb=yes}} ), and would be fairly trivial to put into place. Thoughts? – ClockworkSoul 00:43, 26 March 2008 (UTC)
Having in some way contributed to this "problem", I feel compelled to chime in... ;) Personally I think that MCB articles can still be categorized into the existing top/high/med/low hierarchy. True, PBB has exploded the low category, but are they really much lower than the ones that were previously labeled low? Since I've been fascinated by power laws recently, I sort of think that the MCB categorizations should follow a similar distribution. If we have 9541 articles currently in those four categories (11027 total - 1486 "None") and we shoot for a half-log10 unit between each category, then that means the target distribution compared to actual would look something like this:
Category Target Current Difference
Low 6590 7933 -1343
Med 2084 865 1219
High 659 574 85
Top 208 169 39
If we go by this, then I think the plan would involve recategorizing a bunch of those Low articles into Medium (and a few to High and Top). If we wanted a systematic way to do this, I'd suggest ranking by the number of linked Pubmed citations as indexed by Entrez Gene (a metric that a coworker likes to call the "sexy factor"). Of course, that only applies to gene articles, and maybe it's the non-gene articles that should be first to be promoted. Okay, maybe I'm overthinking this one, but just a thought... AndrewGNF (talk) 03:15, 26 March 2008 (UTC)
I'm all in favour of getting all of our articles tagged by importance and class, but if you'll take my advice, I don't think we should spend too much time on making a "sub-basement" for Importance. I kind of feel that content is more important than categorization, and that we should each just find an article we're enthusiastic about, and bring it to GA or FA. If we wanted to make a task force for collaboration, maybe we could focus on one subject area, such as glycolysis or metalloenzymes or structural biology or microscopy or whatever, and make it a Featured Topic, with, say, roughly 10-15 FA's and maybe a featured list or two? We could each pick our favourite article and bring it to FA; of course, Tim would have to do four FA's for every one of ours, to make it an even playing field. ;)
What was decided about the PDB images? Should I upload all the images of the EBI? That seems like a lot, and many of them will be close homologs or just the same protein with small changes in conditions/ligands/whatever, won't they? Maybe a subset would do? Should I categorize them by SCOP family or something? I'm kind of booked right now, but I should be able to do whatever you all decide on in April sometime. :) Willow (talk) 04:18, 26 March 2008 (UTC)

PDB images, part deux

Thanks for reminding me about the PDB images, Willow... It took a while to connect, but Jawahar Swaminathan rubber stamped how I tagged Image:PDB_1bh5_EBI.jpg. Looks like we're in the clear for uploading all images en masse. It's probably somewhere above, but the template URL is http://www.ebi.ac.uk/msd-srv/msdlite/images/XXXX600.jpg. Do you have an easy way to get the SCOP classifications? (Looks like you can get it here: [8].) Once you write some upload code, it would be great if you could share it with me. That way we can set up some automated system to watch EBI's ftp site which has date-stamped updates. AndrewGNF (talk) 04:49, 26 March 2008 (UTC)

Ummm, OK, I guess now I have to start thinking. ;) My computer skills are really basic, so you have to promise not to laugh at my solutions. Honestly, I don't know how to query web-databases from a command line, and I don't know how to set up an automated system for checking updates. I'll be happy to share my code, but given all your experience with the PBB, I'm sure it's "coals to Newcastle".
Off the top of my head, I would separate the crafting of the files to be uploaded from their uploading to Wikipedia, which I'm sure you know how to do already. To avoid lots of calls to SCOP, the crafting iterations could be organized with the SCOP category being the outer loop, and the PDB files as the inner loop; we'd merely need to find/make a listing of the PDB codes within each SCOP category. For each PDB in the SCOP category, we'd fetch its image from the EBI, craft the information file with the right category, and then upload them in batches. But I need to think through how to do it; if you had any suggestions, that's be great! :) Willow (talk) 05:31, 26 March 2008 (UTC)

Oh, sorry, please don't confuse my overenthusiasm with hoisting everything on your plate Willow... Please work on it as your interest and time permit. And, it should be noted, that Jon has done 100% of the PBB coding. I haven't even looked at a single line of code there... (I can't decide whether I should say that with pride or embarrassment...)  ;) Oh, and one last nugget that stuck with me from lurking on the bioperl group a long time ago -- "Working code wins!" (regardless of how ugly we might think it is).

Having said all that, the plan you propose above sounds great. You might also consider (as either an alternative or some sort of hybrid) that SCOP releases all their classifications in parseable flat files. ([9]) Don't worry now about automation. In principle I can see how it would be done using cron, but the details we can figure out later... AndrewGNF (talk) 16:26, 26 March 2008 (UTC)

I'm very late to this discussion, but I'm looking at [10] and I can't find any commercial sale restrictions at all. Have they changed that page recently? Also, a Google site search for "not for sale" (a phrase from a quote given at the copyright questions thread) gives no hits.[11] --Itub (talk) 17:39, 26 March 2008 (UTC)

Yes, after we pointed out to RCSB that their terms were somewhat ambiguous (and more restrictive than intended), I believe they reworked their license statement. So while mass uploading of a new set of images isn't pressing given the change on their end, EBI's canned images just look better I think. AndrewGNF (talk) 18:33, 26 March 2008 (UTC)

Have you looked at | MMDB using NCBI's | Cn3D viewer? All of PDB, but cleaned up some, and public domain (USGovt work) 3D viewer. 69.115.94.176 (talk) 05:28, 27 March 2008 (UTC)


I was aware of it, but don't have any expertise using it. Can you clarify what "cleaned up some" means? Also, in the context of displaying ribbon diagrams in WP, I'm not sure how MMDB will help. We're trying to steer away from solutions that require external viewers, and I think the current emphasis is on high-quality 2D images (of the type that EBI provides). Or am I missing something here? Cheers, AndrewGNF (talk) 15:03, 27 March 2008 (UTC)

You might find it worth downloading and playing with Cn3D to view MMDB entries. The point of considering the idea of providing MMDB links that would launch true, interactive 3D images in Cn3D would be that it might provide a richer and more informative experience than a simple 2D image. As for "cleaned up", the MMDB curation effort corrects some of the errors sometimes found in PDB deposits (bad alpha-carbon connectivity, uranium atoms at coordinates 0,0,0, and other artefacts occasionally left behind by crystallographers). Maybe PDB has gotten better about catching these, but it used to be a source of frustration to molecular modellers and thus NCBI developed MMDB, which is also well-integrated into all other NCBI ENTREZ resources. 69.115.94.176 (talk) 19:29, 29 March 2008 (UTC)
I'm happy to upload both the EBI and PDB images, so that people can choose what they'd like. The SCOP flat file was all that I needed; thank you, Andrew! :) The rest should be easy.
Honestly, we probably shouldn't argue about which PDB images are better, EBI or PDB; both of them have some failings, don't you agree? The biological unit isn't always shown, the ribbon isn't colored instructively, the orientation sometimes hides important parts, and all those water molecules obscure the ribbon. But my impression is that people don't want to have them made afresh?
I'm concerned about uploading 50 thousand x 2 = 100,000 images, though, if we're not going to use them. It's a feather in our cap, but likewise ornamental until we put them to use in an article. Don't you think it'd be more sensible if we uploaded only a subset, say, towards some directed purpose? For example, maybe we could begin by uploading images for enzymes that already have an article? Just a thought, Willow (talk) 15:55, 27 March 2008 (UTC)
Yes, you're probably right, PDB/RCSB versus EBI isn't the primary issue here. Any automated process needs to make some general rules, and those rules will be more or less useful depending on the user/application. It's just been a while since I've looked at protein structures with an eye toward real science, so I was mostly favoring EBI based on image quality. But neither here nor there...
When I floated the idea over at wikicommons, nobody balked at uploading 50k images. It was surprising to me, but it sounds like people see the value in having the entire collection uploaded. Perhaps we should start with a reasonable hunk of images, say 100 images, and then do one last check over at the Wikicommons village pump to confirm that people are okay with it. At that time, we can also see how people feel about RCSB and EBI... AndrewGNF (talk) 16:40, 27 March 2008 (UTC)
So, do we have a permission from Wikimedia to download 50k images? If we do, let's follow a more systematic approach proposed by Andrew and make scripts or a bot to download then all and systematically update. It is hardly practical to select a smaller subset. What image to use can be only decided on a case to case basis. Also, let's use better images from the MSD. Biophys (talk) 16:10, 6 April 2008 (UTC)
Please remember that one PDB file (e.g. 2axt) can be found in numerous SCOP entries, because a PDB file may include several subunits/polypeptide chains, and each subunit may include several domains, which are classification units of SCOP. However, this is not a serious problem. One can introduce a system of protein structure Categories, so that each "Superfamily" of SCOP is a Category. Then, each downloaded PDB image could be automatically assigned to one or several Categories (SCOP superfamilies) using plain classification files of SCOP. That would be terrific! The required SCOP-PDB mapping files are downloadable from SCOP (here they are) and can be parsed and utilized even using Fortran. One could also check BioPerl libraries for something useful.Biophys (talk) 16:10, 6 April 2008 (UTC)
Sorry, I did not realize that SCOP classification has been already included in the PDB/MSD. So, one should take the classification (categories) from the MSD rather than from SCOP classification files.Biophys (talk) 16:35, 6 April 2008 (UTC)
Unfortunately, SCOP classification is unreliable. I checked a random example (2axt above) and found that SCOP includes only one subunit from this multi-subunit complex. The corresponding PDB entry has no link to SCOP 2axt, but it has links to Pfam families. So, maybe we should use Pfam families as categories, rather than SCOP...Biophys (talk) 17:00, 6 April 2008 (UTC)

Biochemistry daughter project or task force?

It appears that there is not a wikipedia project related to biochemistry, as opposed to molecular biology? Would it be appropriate to create a daughter project or task force, similar to what the Wikipedia project medicine does --Biophysik (talk) 23:08, 9 November 2008 (UTC)