Data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
For data in computer science, see Data (computing). For other uses, see Data (disambiguation).

Data (/ˈdtə/ DAY-tə or /ˈdætə/ DA-tə) is a set of values of qualitative or quantitative variables; restated, data are individual pieces of information. Data in computing (or data processing) are represented in a structure that is often tabular (represented by rows and columns), a tree (a set of nodes with parent-children relationship), or a graph (a set of connected nodes). Data is typically the result of measurements and can be visualized using graphs or images.

Data as an abstract concept can be viewed as the lowest level of abstraction, from which information and then knowledge are derived.

Raw data, i.e., unprocessed data, refers to a collection of numbers, characters and is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next. Field data refers to raw data that is collected in an uncontrolled in situ environment. Experimental data refers to data that is generated within the context of a scientific investigation by observation and recording.

The word "data" used to be considered as the plural of "datum", but now is generally used in the singular, as a mass noun.

Meaning of data, information and knowledge[edit]

Data, information and knowledge are closely related terms, but each has its own role in relation to the other. Data are collected and analyzed to create information suitable for making decisions,[1] while knowledge is derived from extensive amounts of experience dealing with information on a subject. For example, the height of Mt. Everest is generally considered to be data. This data may be included in a book along with other data on Mt. Everest to describe the mountain in a manner useful for those who wish make a decision about the best method to climb it. Using an understanding based on experience climbing mountains to advise persons on the way to reach Mt. Everest's peak may be seen as "knowledge".

That is to say, data is the least abstract, information the next least, and knowledge the most.[2] Data becomes information by interpretation; e.g., the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".

'Information' bears a diversity of meanings that ranges from everyday to technical. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.

Beynon-Davies uses the concept of a sign to distinguish between data and information; data is a series of symbols, while information occurs when the symbols are used to refer to something.[3][4]

It is people and computers who collect data and impose patterns on it. These patterns are seen as information which can be used to enhance knowledge. These patterns can be interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken.[5]

Mechanical computing devices are classified according to the means by which they represent data. An analog computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer represents a piece of data as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.

Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish metadata, that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books.

Data keys and values, structures and persistence[edit]

Keys in data provide the context for values. Regardless of the structure of data, there is always a key component present. Data keys in data and data-structures are essential for giving meaning to data values. Without a key that is directly or indirectly associated with a value, or collection of values in a structure, the values become meaningless and cease to be data. That is to say, there has to be at least a key component linked to a value component in order for it to be considered data. Data can be represented in computers in multiple ways, as per the following examples:

  • Computer main memory or RAM is arranged as an array of locations beginning at 0 and each location can store a byte (usually 8, 16, 32 or 64 bits depending on the CPU architecture). Therefore any value stored in a byte in RAM has a matching location expressed as an offset from the first memory location in the memory array ie. 0+n, where n is the offset into the array of memory locations.
  • Data keys need not be a direct hardware address in memory. Indirect, abstract and logical keys codes can be stored in association with values to form a data structure. Data structures have predetermined offsets from the start of the structure, in which data values are stored. Therefore the data key consists of the key to the structure, plus the offset into the structure. When such a structure is repeated, storing variations of the data values and the data keys within the same repeating structure, the result can be considered to resemble a table, in which each element of the repeating structure is considered to be a column and each repetition of the structure is considered as a row of the table. In such an organization of data, the data key is usually a value in one (or a composite of the values in several of) the columns.
  • The tabular view of repeating data structures is only one of many possibilities. Repeating data structures can be organised hierarchically, such that nodes are linked to each other in a cascade of parent-child relationships. Values and potentially more complex data-structures are linked to the nodes. Thus the nodal hierarchy provides the key for addressing the data structures associated with the nodes. This representation can be thought of as an inverted tree. Modern computer operating system file-systems are a common example.
  • Data has some inherent features when it is sorted on a key. All the values for subsets of the key appear together. When passing sequentially through groups of the data with the same key, or a subset of the key changes, this is referred to in data processing circles as a break, or a control break. It particularly facilitates aggregation of data values on subsets of a key.
  • Until the advent of non-volatile computer memories like USB sticks, persistent data storage was traditionally achieved by writing the data to external block devices like magnetic tape and disk drives. These devices typically seek to a location on the magnetic media and then read or write blocks of data of a predetermined size. In this case, the seek location on the media is the data key and the blocks are the data values. Early data file-systems, or disc operating systems used to reserve contiguous blocks on the disc drive for data files. In those systems, the files could be filled up, running out of data space before all the data had been written to them. Thus much unused data space was reserved unproductively to avoid incurring that situation. This was known as raw disk. Later file-systems introduced partitions. They reserved blocks of disc data space for partitions and used the allocated blocks more economically, by dynamically assigning blocks of a partition to a file as needed. To achieve this, the file-system had to keep track of which blocks were used or unused by data files in a catalog or file allocation table. Though this made better use of the disc data space, it resulted in fragmentation of files across the disc, and a concomitant performance overhead due to latency. Modern file systems reorganize fragmented files dynamically to optimize file access times. Further developments in file systems resulted in virtualization of disc drives ie. where a logical drive can be defined as partitions from a number of physical drives.
  • Indexes are a way to copy out keys and location addresses from data structures in files, tables and data sets, then organize them using inverted tree structures to reduce the time taken to retrieve a subset of the original data. The most popular of these is the B-tree and the dynamic hash key indexing method. Indexing is yet another overhead for filing and retrieving data. There are other ways of organizing indexes, eg. sorting the keys and using a binary search on them.
  • The advent of databases introduced a further layer of abstraction for persistent data storage. Databases use meta data, and a structured query language protocol between client and server systems, communicating over a network, using a two phase commit logging system to ensure transactional completeness, when persisting data.
  • Modern scalable / high performance data persistence technologies rely on massively parallel distributed data processing across many commodity computers on a high bandwidth network. An example of one is Apache Hadoop.

Data in other fields[edit]

Though data is also increasingly used in other fields it has been suggested that the highly interpretive nature of them might be at odds with the ethos of data as "given". Peter Checkland introduced the term capta (from the Latin capere, “to take”) to distinguish between an immense number of possible data and a sub-set of them, to which attention is oriented.[6] Johanna Drucker has argued that since the humanities affirm knowledge production as "situated, partial, and constitutive," using data may introduce assumptions that are counterproductive, for example that phenomena are discrete or are observer-independent.[7] The term capta, which emphasizes the act of observation as constitutive, is offered as an alternative to data for visual representations in the humanities.

See also[edit]

References[edit]

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

  1. ^ "Joint Publication 2-0, Joint Intelligence". Defense Technical Information Center (DTIC). Department of Defense. 22 June 2007. pp. GL–11. Retrieved February 22, 2013. 
  2. ^ Akash Mitra (2011). "Classifying data for successful modeling". 
  3. ^ P. Beynon-Davies (2002). Information Systems: An introduction to informatics in organisations. Basingstoke, UK: Palgrave Macmillan. ISBN 0-333-96390-3. 
  4. ^ P. Beynon-Davies (2009). Business information systems. Basingstoke, UK: Palgrave. ISBN 978-0-230-20368-6. 
  5. ^ Sharon Daniel. The Database: An Aesthetics of Dignity. 
  6. ^ P. Checkland and S. Holwell (1998). Information, Systems, and Information Systems: Making Sense of the Field. Chichester, West Sussex: John Wiley & Sons. pp. 86–89. ISBN 0-471-95820-4. 
  7. ^ Johanna Drucker (2011). "Humanities Approaches to Graphical Display". 

External links[edit]