Wikipedia:WikiProject Chemicals/Chembox validation: Difference between revisions
Appearance
Content deleted Content added
No edit summary |
|||
Line 1: | Line 1: | ||
{{Chemical data validation}} |
{{Chemical data validation}} |
||
The [[Wikipedia:WikiProject Chemicals|WikiProject Chemicals]], in cooperation with the [[Wikipedia:WikiProject Pharmacology|WikiProject Pharmacology]], is validating the content in the infoboxes {{tl|chembox}} and {{tl|drugbox}}. Values in the infobox are compared with values that are reported in |
The [[Wikipedia:WikiProject Chemicals|WikiProject Chemicals]], in cooperation with the [[Wikipedia:WikiProject Pharmacology|WikiProject Pharmacology]], is validating the content in the infoboxes {{tl|chembox}} and {{tl|drugbox}}. Values in the infobox are compared with values that are reported in li |
||
At the moment, we are verifying the [[CAS Registry number]] ('CASNo' in the {{tl|chembox}}, 'CAS_number' in the {{tl|drugbox}}), [[ChemSpider]]ID (ChemSpiderID), [[Unique Ingredient Identifier]] (UNII), [[InChI]], [[KEGG]], and [[ChEMBL]] by comparison with the data on http://commonchemistry.org (the CAS website), http://www.chemspider.com and http://fdasis.nlm.nih.gov/srs/srs.jsp (for the UNII) as well as from lists supplied by (CAS number, ChemSpiderID, InChI, UNII, ChEMBL and ChEBI) or downloaded from these websites (KEGG, DrugBank). In the meantime we are trying to add, update and/or check as a number of other identifiers (InChI, InChIKey) by comparison of the data with the [[ChemSpider]] website http://www.chemspider.com. |
|||
[[User:CheMoBot|CheMoBot]] is a bot that is following changes to these articles, and is set up to update the infoboxes. When it detects changes to values, it will change parameters in the infobox accordingly. These parameters are used by the template to show what the status of the fields in the box is. |
|||
Boxes which contain verified values which are the same as the values in the verified revision are tagged with a {{tick|12}} at the bottom, and boxes where some of these values are changed, are tagged with a {{cross|12}}. Moreover, the individual identifiers are tagged with {{tick|12}} or {{cross|12}} as well. If the boxes contain changes to these verified fields, they are also categorised in [[:Category:Chemboxes which contain changes to verified fields]]. Boxes that contain changes to other important fields are categorised in [[:Category:Chemboxes which contain changes to watched fields]]. For an example, see [http://en.wikipedia.org/w/index.php?title=Sodium_laureth_sulfate&diff=prev&oldid=365742782 this vandalism], quickly [http://en.wikipedia.org/w/index.php?title=Sodium_laureth_sulfate&action=historysubmit&diff=365744219&oldid=365742868 flagged by CheMoBot]. |
|||
If you encounter a page with a {{tl|chembox}} or {{tl|drugbox}} which shows a {{cross|12}}, then please check if the current value is wrong (in which case, it can just be changed back to the value in the verified revision, the bot will do the rest), or if there is a mistake in the verified revision (if so, it may need an update of the index; if you need help with that, please ask the appropriate wikiproject). |
|||
== Verification - tagging references == |
|||
CheMoBot adds a template to a _Ref parameter (e.g. for CASNo, CASNo_Ref will be filled with {{tlx|cascite|correct|XXX}}) when the bot finds the field correct. The first parameter of the template is 'correct', or 'changed', and the box will show a tick or a cross accordingly on CASNo. The second parameter is a field which contains a reference for ''where'' the parameter was verified. As we are at the moment verifying all fields against the CAS commonchemistry.org site, the bot replaces XXX with 'CAS' (i.e., {{{tlx|cascite|correct|CAS}}). When using another place to verify the CASNo, please adapt this parameter accordingly. At the moment, we have 'NIST' and 'ESIS'. The bot will try and retain this field throughout. When there will be significantly more verifications against non-commonchemistry.org-places, I will instruct the bot to fill the field standard with {{tlx|cascite|correct|??}} or something similar. |
|||
==Method of work== |
|||
Our approach is to start by checking that the CAS registry number and the structure match with the name. This will be used as a foundation upon which we can build a broader validation effort. Once we have the structure verified, we have the formula, and hence the molar mass, and we can also generate other machine representations such as SMILES, InChI and InChIKey. |
|||
===First 1000=== |
|||
After our [[Wikipedia:WikiProject Chemistry/IRC discussions|IRC meeting on January 13, 2009]], we used an Excel file to validate the first 1000 entries from the CAS XML file. This is available to project members [http://pluto.potsdam.edu/wikichem/index.php/File:CAS_Wikipedia_Combination_16Feb2009.xls here]. |
|||
==The work== |
|||
We are now beginning to work through the list of "problem articles" found by User:Beetstra, and listed at '''[[User:Beetstra/CASFoundCorrect]]'''. A description of the process will be added soon. |
|||
===Notes=== |
|||
*There are different CAS numbers for each form of a substance. For example, something simple like [[alanine]] will have one CAS# for the D form, another for L, another for "unspecified" and a fourth one for racemic. There would be another four CAS#s for the hydrochloride, four for the (1:1)sulfate, four for the (2:1)sulfate, etc. It is very important that we match the correct form CAS# to our Chemboxes! |
|||
*Be aware that CAS uses an unusual system for representing some formulae, which may seem "wrong" to us. These involve describing salts such as sodium nitrate as HNO<sub>3</sub>·Na, and organic salts follow a similar system. Do not use such formulae on WP, but they are not "wrong" since they are merely a representation, not a formal structure. This also results in incorrect MolarMass in the FW section of the SDF file for salts. |
|||
*For complex chiral structures, such as [[Bleomycin]], which may be drawn very differently in WP than in Common Chemistry, I found it best to assign R/S for each center and compare that way. (And yes, Farseer drew bleomycin perfectly!) |
|||
*The CAS No. in a Chembox will receive a green tick (check mark) once {{tl|cascite}} is added. This does not happen yet in the Drugbox (there is no change at present), but we hope to enable a similar system there too, if [[WP:PHARM]] is in agreement. |
|||
===Fields to check/upload=== |
|||
;Chemboxes |
|||
Check structure, CAS no., Formula, MolarMass. |
|||
Notes: |
|||
1: the bot 'divides' the fields in two sets, watched and unwatched; all changes are reported, but the watched fields are the ones we really want to take care of, those are the fields which contain hardcore, verifiable data which is very unlikely to change (as the boiling point of [[water (molecule)|water]], the CAS-number of [[benzene]], the number of carbons in [[glucose]]. N.B. the list of 'watched' fields may need to be updated |
|||
2: The bot regards an empty field as 'unknown'. It will report changes to this field, but will assign a lower 'warning level' to it. |
|||
3: Things between <!-- and --> are 'comments', they can be saved and appear in the editbox, but do not produce visible wikicode. |
|||
* When a 'better' version of a page comes up, change the number on the page. If there are two revids for the same page, it uses the one closest to the bottom of the index-page (the page gets parsed top to bottom, replacing values if duplicates occur). |
|||
==The workers== |
|||
Please sign up to work on some of the articles listed at '''[[User:Beetstra/CASFoundCorrect]]'''. More information later. |
|||
*1-1000 [[User:Walkerma|Walkerma]] ([[User talk:Walkerma|talk]]) 22:48, 3 November 2009 (UTC) |
|||
*1001-2000 [[User:Ambix|Ambix]] ([[User talk:Ambix|talk]]) 17:57, 17 November 2009 (UTC) |
|||
*2001-3000 |
|||
*3001-4000 |
|||
*4001-5000 |
|||
*5001-end |
*5001-end |
||
Revision as of 15:28, 9 September 2012
Chemical data validation |
---|
|
Bot Pages |
Categories |
IRC-related |
Commons |
Chemical Lists |
|
Style Guides |
Useful Links |
The WikiProject Chemicals, in cooperation with the WikiProject Pharmacology, is validating the content in the infoboxes {{chembox}} and {{drugbox}}. Values in the infobox are compared with values that are reported in li
- 5001-end
The software
Problems found when validating the Excel file
Please note any "to be checked" entries here.
1-100
101-200
201-300
- Kanamycin One chiral center seems to not match CAS. Are there multiple forms of this? Structure says Kanamycin A.
- Yes, there are multiple forms (A, B, C, D, X) and several derivatives, but the difference is in the side chains. Fvasconcellos (t·c) 11:48, 10 February 2009 (UTC)
- Tocopherol One chiral center seems not to match, multiple forms? a-tocopherol, CAS just says tocopherol.
- There are multiple isomers. File:RRR alpha-tocopherol.png shows the most common isomer. Tim Vickers (talk) 04:28, 10 September 2009 (UTC)
- Acetylcholine Parent ion, infobox not chembox.
- Linoleic_acid WP says cis, cis, CAS says trans trans 'linoelaidic acid', the whole world says linoleic acid is 60-33-3 including the spreadsheet and sigma.
- 60-33-3 appears to refer to all-cis. Fvasconcellos (t·c) 11:52, 10 February 2009 (UTC)
- This is very strange, it is trans,trans in the union file and cis,cis in the wikichem file (I have been using the union file to verify CAS numbers). I need to look into this. Ambix (talk) 12:47, 12 February 2009 (UTC)
- 60-33-3 appears to refer to all-cis. Fvasconcellos (t·c) 11:52, 10 February 2009 (UTC)
- Glucose_1-phosphate One chiral center is not specified (should be up to match CAS). (probably a result of copying glucose skeleton, in which this atom is not chiral?).
- See anomer. It is likely that both forms (alpha and beta-glucopyranoside) are described by this CAS number. --Tweenk (talk) 21:41, 15 November 2009 (UTC)
- Streptomycin 57-92-1 Seems to be mirror image of WP structure.
- Tubocurarine 57-94-3 and 57-95-4 Structure is messed up in union file. I can't make sense of it.
301-400
- Cephalosporin No chembox and other issues.
- This is a class article, I don't think there should be a chembox. Fvasconcellos (t·c) 19:54, 24 January 2009 (UTC)
- The CASRN refers to Cephalosporin C, for which we don't seem to have an article. Physchim62 (talk) 20:19, 25 January 2009 (UTC)
- Lactose CAS is for open chain aldehyde form, is this significant?
- I don't think so, but we're checking. Physchim62 (talk) 22:35, 23 January 2009 (UTC)
- Methionine Chiral issues.
- as per talk page Physchim62 (talk) 22:35, 23 January 2009 (UTC)
401-500
- Cholecalciferol: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be R here, and I think it is)
- Vitamin B12: The structure diagram does not adequately specify the stereochemistry of the Corrin ring
- Ellman's reagent: no chembox, and needs text cleanup
- I added a chembox. -- Ed (Edgar181) 19:12, 11 February 2009 (UTC)
- Sanger's reagent: no chembox
- Asparagine: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be S here, and is)
- Histidine: Structure needs to show stereochemistry
- Medroxyprogesterone acetate: Redirects to Medroxyprogesterone
- Veratridine: still to be verified, the structure displays badly in ChemFileBrowser
- Sodium lactate: old-style chembox; note that CASRN is for unspecified stereochemistry
- Valine: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be S here, and is)
- Threonine: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,3R)
- Endrin: The structure diagram appears to show the endo-isomer whereas the CASRN is for the exo-isomer (or vice versa, I never was very good at this particular bit of nomenclature! in any case, it's not the same compound!) We should recheck with Dieldrin (CASRN [60-57-1]) as well. Neither compound has the stereochemistry correctly specified.
- I've rechecked Dieldrin, adding the implicit hydrogens to the WP structure and drawing in chemsketch, I also copied the CAS structure exactly and had the program assign stereo labels. They match, which leads me to think my initial verify is OK. It maybe should be noted that while the carbon skeletons look to be the same projection, WP is from above and CAS (turns out to be) from below. If you are still unhappy could you describe your assignment in more detail? I'll try the chemsketch method with Endrin and hopefully we can compare notes Ambix (talk) 23:27, 6 February 2009 (UTC)
- I have checked Endrin with the same process and it does not match. There is an older version of this image Endrin.png and this does match. Given the difficulties of transposing a 3D structure to more conventional form it would probably be better to have a more conventional structure as well for compounds like this but I would suggest we avoid removing 3D structures providing it is possible to validate them. I will investigate furthur.
- I suggest that for our validated structure on such compounds, we should explicitly show the stereochemistry of each chiral centre, which is not the case at present on Endrin and Dieldrin (even if a knowledgeable chemist can figure out what it must be from the diagram). That doesn't necessarily mean changing the structures in the chemboxes (our images for inorganics don't always give a clear idea of the structure), but we should insist on the chembox information being correct and not-misleading, and that the full details be available in the article (maybe in a separate image). Physchim62 (talk) 23:23, 9 February 2009 (UTC)
- Dichlorodiphenyldichloroethylene: short-form chembox
- Trypan blue: Structure diagram shows free acid whereas CASRN is for tetrasodium salt
- Isoleucine: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,3S)
- Ethambutol: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,2'S)
- Done. Fvasconcellos (t·c) 23:42, 9 February 2009 (UTC)
- Arginine: Structure needs to show stereochemistry
- Ethylene: old-style chembox
- Missing articles: 3,5-Dimethylpyrazole, O-Methylhydroxylamine, Tetraethylammonium iodide, Ethyl 3-bromopyruvate, 1-Methyl-3-nitro-1-nitrosoguanidine, Mercaptosuccinic acid, p-Toluenesulfonamide, 4-Chlorobenzoic acid, N,N'-Diphenyl-1,4-phenylenediamine
- Ions: Acetate, Bicarbonate
501-600
- Trimethylaluminium is dimer, CAS is monomer. Is this significant, will CAS have a dimer listed?
- Camphor Both the WP page and the CAS are for unspecified stereoisomers however if we follow the naturally occuring rule, should the WP page be changed for the natural isomer and the unspecified CAS be relegated to an 'other'?
601-700
- Aprobarbital CAS 77-02-1 is unspecified.
2,3-Dimethylbutane redirects to Dimethylbutane only 2,2 has an article.Generic name is now DAB, both 2,2 and 2,3 have articles.
701-800
801-900
901-1000
Inorganics
The 677 "inorganics" (neutral compounds without C–C or C–H bonds) have now all been checked. 496 entries gave a perfect match, 74 entries had some sort of problem in the article (often minor and already fixed) and 100 entries had no appropriate corresponding article on Wikipedia. A full report will be available in due course.
Elements and ions
These will require special treatment: please contact Physchim62 for more details.