This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|Original author(s)||William Gould|
17.0 / April 20, 2021
|Operating system||Windows, macOS, Linux|
Stata is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including economics, sociology, political science, biomedicine, and epidemiology. StataCorp personnel pronounce Stata //.
Stata was initially developed by Computing Resource Center in California and the first version was released in 1985. In 1993, the company moved to College Station, TX and was renamed Stata Corporation, now known as StataCorp. A major release in 2003 included a new graphics system and dialog boxes for all commands. Since then, a new version has been released once every two years. The current version is Stata 17, released in April 2021.
Technical overview and terminology
Stata has always emphasized a command-line interface, which facilitates replicable analyses. Starting with version 8.0, however, Stata has included a graphical user interface based on Qt framework which uses menus and dialog boxes to give access to nearly all built-in commands. This generates code which is always displayed, easing the transition to the command line interface and more flexible scripting language. The dataset can be viewed or edited in spreadsheet format. From version 11 on, other commands can be executed while the data browser or editor is opened.
Data structure and storage
Until the release of version 16, Stata could only open a single dataset at any one time. Stata holds datasets in (random-access or virtual) memory, which limits its use with extremely large datasets. This is mitigated to some extent by efficient internal storage, as there are integer storage types which occupy only one or two bytes rather than four, and single-precision (4 bytes) rather than double-precision (8 bytes) is the default for floating-point numbers.
The dataset is always rectangular in format, that is, all variables hold the same number of observations (in more mathematical terms, all vectors have the same length, although some entries may be missing values).
Data format compatibility
Stata's proprietary file formats have changed over time, although not every Stata release includes a new dataset format. Every version of Stata can read all older dataset formats, and can write both the current and most recent previous dataset format, using the saveold command. Thus, the current Stata release can always open datasets that were created with older versions, but older versions cannot read newer format datasets.
Stata can read and write SAS XPORT format datasets natively, using the fdause and fdasave commands.
Stata allows user-written commands, distributed as so-called ado-files, to be straightforwardly downloaded from the internet which are then indistinguishable to the user from the built-in commands. In this respect, Stata combines the extensibility more often associated with open-source packages with features usually associated with commercial packages such as software verification, technical support and professional documentation. Some user-written commands have later been adopted by StataCorp to become part of a subsequent official release after appropriate checking, certification, and documentation.
The development of Stata began in 1984, initially by William (Bill) Gould and later by Sean Becketti. The software was originally intended to compete with statistical programs for personal computers such as SYSTAT and MicroTSP. Stata was written, then as now, in the C programming language, initially for PCs running the DOS operating system. The first version was released in 1985 with 44 commands.
There have been 17 major releases of Stata between 1985 and 2021, and additional code and documentation updates between major releases. In its early years, extra sets of Stata programs were sometimes sold as "kits" or distributed as Support Disks. With the release of Stata 6 in 1999,
updates began to be delivered to users via the web.
Hundreds of commands have been added to Stata in its 36-year history. Certain developments have proved to be particularly important and continue to shape the user experience today, including extensibility, platform independence, and the active user community.
programcommand was implemented in Stata 1.2, giving users the ability to add their own commands. ado-files followed in Stata 2.1, allowing a user-written program to be automatically loaded into memory. Many user-written ado-files are submitted to the Statistical Software Components Archive (SSC) maintained by Christopher (Kit) Baum and hosted by Boston College. StataCorp added an
ssccommand to allow community-contributed programs to be added directly within Stata.
- The initial release of Stata was for the DOS operating system. Since then, versions of Stata have been released for systems running Unix variants (including Linux), Windows, and Macintosh. Stata files, including do-files and saved datasets, are platform-independent.
- User community
- A number of important developments were initiated by Stata's active user community. The Stata Technical Bulletin was introduced in 1991 and issued six times a year, helping to share community-contributed commands. It was relaunched in 2001 as the peer-reviewed Stata Journal, a quarterly publication containing descriptions of community-contributed commands and tips for the effective use of Stata. The Statalist listserver began in 1994 and transitioned to a web forum format in 2014. Stata Users Group meetings began in 1995. The aforementioned SSC Archive was launched in 1997.
|Version||Release date||Select new or enhanced features|
There are four builds of Stata:
- Stata/MP: The fastest edition of Stata that can analyze the largest datasets, for quad-core, dual-core, and multicore/multiprocessor computers
- Stata/SE: Standard edition, for larger datasets
- Stata/BE: Basic edition, for mid-sized datasets (previously called Stata/IC)
- Numerics by Stata: Stata for embedded and web applications
Stata/MP can store 10 to 20 billion observations and up to 120,000 variables. Stata/SE and Stata/BE can each store up to 2.14 billion observations and handle 32,767 variables and 2,048 variables respectively. The maximum number of independent variables in a model is 65,532 variables in Stata/MP, 10,998 variables in Stata/SE, and 798 variables in Stata/BE.
The pricing and licensing of Stata depends on its intended use: business, government/nonprofit, education, or student. Single user licenses are either renewable annually or perpetual. Other license types include a single license for use by concurrent users, a site license, volume single user for bulk pricing, or a student lab.
User Group meetings are held annually in the United States (the Stata Conference), the UK, Germany, and Italy, and less frequently in several other countries. Only the annual Stata Conference held in the United States is hosted by StataCorp LP. Local Stata distributors host User Group meetings in their own countries, however, Stata developers frequently travel to and present at these meetings. Established under the Societies Act on 10 May 2008, Singapore Stata Users Group is the world's first government-approved users group (Registration No: 2048/2008; Unique Entity No: T08SS0091A). Its slogan is "Shaping Data Meaningfully". As a non-profit organisation, StataUGS does not organise regular meetings but provides programming and statistical advice to users in Singapore through informal means. The active members of StataUGS are mostly engaged in biomedical research.
The following set of commands revolve around simple data management.
sysuse auto // Open the included auto dataset browse // Browse the dataset (opens the Data Editor window) describe // Describes the dataset and associated variables summarize // Summary information about numerical variables codebook make foreign // Summary information about the make (string) and foreign (numeric) variables browse if missing(rep78) // Browse only observations with missing data for variable rep78 list make if missing(rep78) // List makes of the cars with missing data for variable rep78
The next set of commands move onto descriptive statistics.
summarize price, detail // Detailed summary statistics for variable price tabulate foreign // One-way frequency table for variable foreign tabulate rep78 foreign, row // Two-way frequency table for variables rep78 and foreign summarize mpg if foreign == 1 // Summary information about mpg if the car is foreign (the "==" sign tests for equality) by foreign, sort: summarize mpg // As above, but using the "by" prefix. tabulate foreign, summarize(mpg) // As above, but using the tabulate command.
A simple hypothesis test:
ttest mpg, by(foreign) // T-test for difference in means for domestic vs. foreign cars
twoway (scatter mpg weight) // Scatter plot showing relationship between mpg and weight twoway (scatter mpg weight), by(foreign, total) // Three graphs for domestic, foreign, and all cars
generate wtsq = weight^2 // Create a new variable for weight squared regress mpg weight wtsq foreign, vce(robust) // Linear regression of mpg on weight, wtsq, and foreign predict mpghat // Create a new variable contained the predicted values of mpg twoway (scatter mpg weight) (line mpghat weight, sort), by(foreign) // Graph data and fitted line
- Newton, H. Joseph (2005). "A conversation with William Gould". The Stata Journal. 5 (1): 19–31.
- "Disciplines". Stata: Software for Statistics and Data Science. Retrieved 2021-04-21.
- Cox, Nicholas J. "Statalist FAQ". Statalist: The Stata Forum. Retrieved 24 April 2021.
- Cox, Nicholas J. (2005). "A brief history of Stata on its 20th anniversary". The Stata Journal. 5 (1): 2–18. Retrieved 22 April 2021.
- Gould, William W.; Cox, Nicholas J. "When was Stata first released? When were later versions released?". Stata: Software for Statistics and Data Science. Retrieved 22 April 2021.
- "What's new in Stata?". Stata: Software for Statistics and Data Science. StataCorp. Retrieved 22 April 2021.
- "Data frames: multiple datasets in memory". www.stata.com. Retrieved 2020-08-13.
- "Stata 16 help for save". www.stata.com.
- Stata Glossary and Index: Release 17 (PDF). College Station, TX: Stata Press. pp. 1–50. ISBN 1-59718-283-4.
- "Stata features". Stata: Software for Statistics and Data Science. StataCorp. Retrieved 24 April 2021.
- "program - Define and manipulate programs" (PDF). Stata: Software for Statistics and Data Science. Stata Press. Retrieved 24 April 2021.
- "ssc - Install and uninstall packages from SSC" (PDF). Stata: Software for Statistics and Data Science. Stata Press. Retrieved 24 April 2021.
- "Which Stata is right for me?". Stata: Software for Statistics and Data Science. Retrieved 23 April 2021.
- "Order Stata software". Stata: Software for Statistics and Data Science. StataCorp. Retrieved 25 April 2021.
- Getting Started with Stata for Windows (PDF) (Release 17 ed.). College Station, TX: Stata Press. pp. 1–19. ISBN 1-59718-334-2. Retrieved 25 April 2021.
- Bittmann, Felix (2019). Stata - A Really Short Introduction. Boston: DeGruyter Oldenbourg. ISBN 978-3-11061-729-0.
- Pinzon, Enrique, ed. (2015). Thirty Years with Stata: A Retrospective. College Station, Texas: Stata Press. ISBN 978-1-59718-172-3.
- Hamilton, Lawrence C. (2013). Statistics with STATA. Boston: Cengage. ISBN 978-0-84006-463-9.