Talk:Information extraction

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Linguistics / Applied Linguistics  (Rated Start-class)
WikiProject icon This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
Taskforce icon
This article is supported by Applied Linguistics Task Force.

Information Extraction[edit]

Data extraction is the proper industry word, as it goes well with data mining, data farms, etc. Information Extraction, is a bit vague. Removing the merge suggestion. —Preceding unsigned comment added by Khazakistyle (talkcontribs) 18:12, 12 August 2009 (UTC)

I wouldn't agree. Information extraction is a well established term in both research and industry. Data (what data?) extraction could also refer to multimedia data extraction. Information extraction is bound to textual data extraction.George1975 (talk) 17:48, 2 December 2009 (UTC)
I would rather say it is bound to output structured information, while data extraction is bound to output data. What you extract from is not really the relevant focus. In DBpedia, we always say that we extract information (in this case RDF) from a semi-structured source, i.e. Wiki Markup. What I'm not sure about is, whether data is a subset of information or the other way round (depends on how you define it, data as low-level information or information as a special case of data). I would prefer to merge both under information extraction. SebastianHellmann (talk) 13:10, 10 January 2010 (UTC)
I would keep the distinction between data extraction and information extraction sticking on the idea that the output of information extraction is structured data and not "raw data". However I would not say that IE is limited to text. IE takes naturally its roots in text processing and natural language processing but approaches that do automatic tagging of pictures or annotate multimedia content could e seen as IE. This being said I will try to re-structure the article (which is somehow funny... well for me at last). G.Dupont (talk) 12:46, 13 January 2011 (UTC)
Data extraction is very specific, and while it overlaps with information extraction, they are not the same thing. Consider for example the focused remit of the DEOS workshop series - dedicated to getting specific kinds of data from existing embeddings, which while often linguistic, can also have semantics based on website markup or printed document layout. Determining numbers that have been visually presented in tables is not an information extraction task. Similarly, temporal relation extraction and most document anaphora resolution is IE, but not data extraction. Leondz (talk) 19:38, 4 December 2013 (UTC)

Trifeed? Advert?[edit]

Should the link really be there? 05:41, 11 May 2007 (UTC)

Really a type of information retrieval?[edit]

I don't think that information extraction is a type of information retrieval, is it? Aren't they two seperate concepts? In text mining, one would retrieve unstructured text using search filters or querying databases (this is IR), and then the information would be structured (this bit being the IE)... 10:53, 4 January 2007 (UTC)

Let's say you have a collection of documents or web pages and you want to extract information from all of them. Then you would have IE without IR. Normally it is even the other way round like in DBpedia, first IE, then IR see[1] as you can query the extracted information. SebastianHellmann (talk) 13:15, 10 January 2010 (UTC)
Another one: Google certainly does information retrieval, but I'm not sure if they do Information extraction. Calculating tf-idf, PageRank and LSA or other statistical measures can not be considered information extraction. SebastianHellmann (talk) 13:33, 10 January 2010 (UTC)

Semistructured information can't be the source?[edit]

For instance, web is semistructured. Web can be source for IE, can't it? [2] [3] Vsatayamas 09:07, 7 January 2007 (UTC)

It can and it is. Agin input is not discriminant IMHO, output (structured information) is. G.Dupont (talk) 12:48, 13 January 2011 (UTC)

Information extraction and the Web[edit]

As an answer to the previous question, semistructured text can indeed be the source. IE should be connected to "Web", "wrappers" and eventually to "machine learning" and "wrapper induction" concepts. I made an initial contribution. Please discuss/contribute. —Preceding unsigned comment added by George1975 (talkcontribs) 14:20, 12 February 2009 (UTC)

Vs. Concept mining[edit]

How are these different? Should these topics be mered? ---- CharlesGillingham (talk) 22:40, 10 December 2007 (UTC)

Cleaning up a bit[edit]

I recently tried to clean this page up a bit as part of a broader effort to improve the quality of Wikipedia entries related to information retrieval (see my contributions for more examples of my individual efforts towards this end). Specifically, I hope we can collectively make an effort to keep this accurate and spam-free. Please feel free to contact me directly at dtunkelang at gmail dot com if you'd like to be part of this effort, which I've been rallying via my blog, The Noisy Channel.

Dtunkelang (talk) 17:05, 26 October 2008 (UTC)


Can whoever keeps linking to ECHELON here stop? This is an entry about information extraction. While ECHELON may be applying information extraction, so are thousands of other projects. This is off topic, and seems motivated by activism, however well intended. Dtunkelang (talk) 02:54, 13 January 2009 (UTC)

Free or Open Source Information Extraction Software[edit]

I believe this section should remain in the article and not be removed. There are free tools or services for IE, except for GATE, like Mallet, OpenCalais or CRF++ etc, that should be mentioned here. This is not far off topic. Of course, commercial tools or services should be removed. —Preceding unsigned comment added by George1975 (talkcontribs) 14:59, 4 October 2009 (UTC)

None are notable. --Ronz (talk) 18:02, 4 October 2009 (UTC)
May I suggest a brief mention on what kind of approaches that are commonly used in that case? Then on the pages of the approaches it would certainly be a good idea to link in tools such as Mallet. Do you agree that this may be a good idea? --Dront (talk) —Preceding undated comment added 18:31, 4 October 2009 (UTC).
OpenCalais is certainly notable--I just attended a conference (the Transparent Text symposium) that wasn't even about information extraction and almost all of the presenters cited OpenCalais as the tool they were using. I undid Ronz's edit. If you want to delete the other tools (Mallet, CRF), I won't fight over it. Dtunkelang (talk) 22:03, 8 October 2009 (UTC)
I've reverted it. It's not notable in that it has no article of its own, nor has an independent, reliable source been provided that demonstrates notability. Alternatively, WP:WTAF, providing such sources in the new article. Further, the edit includes linkspam (see WP:EL). --Ronz (talk) 22:21, 8 October 2009 (UTC)
Just to clarify, it doesn't make a difference if the tools/services are commercial or free. --Ronz (talk) 22:23, 8 October 2009 (UTC)
There is a Wikipedia entry for ClearForest, which post-acquisition by Reuters became Open Calais. The entry could use some maintenance, to be sure. I don't have a horse in this race--if anything, some might make the case that Open Calais is competitive to my employer. But I am expert enough on this topic to assure you that Open Calais is the most notable information extraction tool. I'll leave the others out, but will revert that one back in, at least on the grounds that it has a Wikipedia entry. Dtunkelang (talk) 18:54, 12 October 2009 (UTC)
Regarding Dront's comment beginning, "May I suggest a brief mention...," I don't understand what he's referring to, so cannot answer any of his questions. --Ronz (talk) 22:21, 8 October 2009 (UTC)
I tried to add some (e.g. CRF++) under External Links but have been removed, as spam links. Adding an external link to an open source project that is widely used in information extraction, and it is hosted under Sourceforge, is neither a spam link, nor an advertisement. CRF++ and Mallet are very notable tools in IE, even if they do not -yet- have a wiki entry of their own. Please, would you be so kind to consider more carefully in the future before marking an external link as spam? George1975 (talk) 13:04, 2 December 2009 (UTC)

DBpedia is a good example, I will not add it myself because WP:COI SebastianHellmann (talk) 13:19, 10 January 2010 (UTC)

Machine Learning and Information Extraction[edit]

A section regarding the use of machine learning for information extraction should be added soon. I'll try to make some contributions. Could also be a separate Wiki article.George1975 (talk) 10:29, 2 December 2009 (UTC)

Removal of external links[edit]

This is really disappointing. I have been trying to contribute to this article for a long time now and I believe I am an expert on this field, but most of my contributions have been discarded by a single editor. I do not know this editor's expertise, but his arguments (off-topic, promotional, spam, etc.) are far from real. The PASCAL challenge is an important point of referernce in recent publications for information extraction. It is neither off-topic, nor promotional, while its source is reliable. The CRF++ is a well known tool (not yet "notable", that's why I tried to add it as an external link) used in many information extraction projects. Where did you see the off-topic, promotional, spam, etc? I am not sure whether it is for the best interest of Wikipedia to discourage authors from contributing. The current article for IE is really not well written and needs significant improvements.George1975 (talk) 17:35, 2 December 2009 (UTC)

WP:ELN is an appropriate location for getting others' perspective on the appropriateness of external links. --Ronz (talk) 17:50, 2 December 2009 (UTC)

Whether a software tool, project or whatever else, is important for a topic, is not only a matter of whether its link meets some subjective requirements. I have been trying -as an "insider" to the topic- to update the article with information that I really find useful, without intending to promote anybody or to attract spam. I accepted your argument about notability. However, your arguments about off-topic, spam, etc. in the external links I added, can be considered as subjective. I am not willing to fight a lot more on this. If Wikipedia does not need my contribution, then I probably won't come back.George1975 (talk) 18:06, 2 December 2009 (UTC)

I've gone ahead and removed it, per WP:ELNO# 1, 4, & 13. Given that the article has no references at all, I prefer to keep the external links section short and specific to the topic. In order to improve this article, we editors to expand and verify the actual article, not add more external links. --Ronz (talk) 19:34, 2 December 2009 (UTC)

The wiki article about Conditional Random Fields contains a list of external links, including CRF++ and Mallet (before the latter becomes a wiki entry). So it seems that the same external links are appropriate for one article (conditional random fields), while inappropriate for another (information extraction). This definitely creates an inconsistency.George1975 (talk) 13:22, 20 January 2010 (UTC)

I would prefer to see a fledgling article improved rather than pruned. On the subject of CRF++, Google Scholar reports close to 50 citations in the academic literature, of which the main bundle contains 35 papers. There seems to be a fair literature to cite here, so we should be able to keep both sides happy with a little more work. — MaxEnt 10:53, 27 March 2010 (UTC)

Another clean-up[edit]

I took the liberty of making a fairly aggressive clean-up. Boggles my mind that this has been proposed for merging with data extraction, so I removed that. I doubt the article on fractionating petroleum includes discarding the vast quantity of water and sand pumped up with the crude oil in most oil fields. Data extraction is about discarding chaff. IE is about cracking for value.

I was specifically trying to make the lead more accessible to a non-specialist. I have a fair amount of background with NLP, almost none with IE, so I had to content myself with gluing existing material together in a different order.

Since I don't know the IE literature as such, I wasn't able to supply the references which this article badly needs, other than where I drew attention to IE as somewhat of a stop-gap measure in the era of the distinctly non-semantic web.

There's much left to be done. — MaxEnt 09:49, 27 March 2010 (UTC)

Broken link[edit]

The document Peggy M. Andersen et al. "Automatic Extraction of Facts from Press Releases to Generate News Stories" is not found at It may be found at or ttp:// Ronbarak (talk) 08:40, 25 October 2011 (UTC)

This is not an English sentence[edit]

A word is missing in the following sentance:

In the previous example, it will be to say that the sentence

Ronbarak (talk) 12:24, 23 January 2012 (UTC)

Good point. I'll try to fix it... Jojalozzo 16:01, 23 January 2012 (UTC)