Data munging or data wrangling is loosely defined as the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use. Given the rapid growth of the internet such techniques will become increasingly important in the organization of the growing amounts of data available.
A data wrangler is the person performing the wrangling. In the scientific research context, the term often refers to a person responsible for gathering and organizing disparate data sets collected by many different investigators, often as part of a field campaign. In this sense, the term could be credited to Donald Cline during the NASA/NOAA Cold Lands Processes Experiment. It specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery. In research, this involves both data transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high performance computing instruments or access via cyberinfrastructure-based digital libraries.
The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has roots in munging as described in the Jargon File. The term "Data Wrangler" was also suggested as the best analogy to coder for code for someone working with data.
On a film or television production utilizing digital cameras that are not tape based, a data wrangler is employed to manage the transfer of data from a camera to a computer and/or hard drive.
- Data cleansing, correcting errors in a corpus of data.
- Data editing, correcting errors in a corpus of data.
- Data scraping, extracting parts of a corpus of data with automated tools.
- Data curation, a more general and abstract activity
- Data pre-processing, a step of cleaning data in data mining for analysis purposes
- Data fusion and data integration
- Semantic mapping (data integration)
- Simultaneous editing, efficient repeated edition of text in a multiple selection through direct manipulation.
- Extract, transform, load
- What Is Data Munging?
- The Guardian: Internet data heads for 500bn gigabytes
- Parsons, MA, MJ Brodzik, and NJ Rutter. 2004. Data management for the cold land processes experiment: improving hydrological science. HYDROL PROCESS. 18:3637-653. http://onlinelibrary.wiley.com/doi/10.1002/hyp.5801/abstract
- Jargon File entry for Mung
- Open Knowledge Foundation Blog Post