Git

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 99.98.34.87 (talk) at 18:37, 29 November 2011 (→‎Source code hosting: alphebetize). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Git
Original author(s)Linus Torvalds
Developer(s)Junio Hamano, Linus Torvalds, and many others
Initial releaseApril 7, 2005; 19 years ago (2005-04-07)
Repository
Written inC, Bourne Shell, Perl[1]
Operating systemPOSIX, Windows
TypeRevision control
LicenseGNU General Public License v2
Websitegit-scm.com
gitweb, a web interface for git

Git (/ɡɪt/) is a distributed revision control system with an emphasis on speed.[2] Git was initially designed and developed by Linus Torvalds for Linux kernel development. Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server. Git's current software maintenance is overseen by Junio Hamano. Git is free software distributed under the terms of the GNU General Public License version 2.

Name

Linus Torvalds has quipped about the name "git", which is British English slang for a stupid or unpleasant person:[3] "I'm an egotistical bastard, and I name all my projects after myself. First Linux, now git."[4][5] (Note that Torvalds did not in fact name Linux.[6])

Early history

Git development began after many Linux kernel developers chose to give up access to the proprietary BitKeeper system.[7] (The copyright holder of BitKeeper, Larry McVoy, withdrew free use of the product after he claimed that Andrew Tridgell had reverse-engineered the BitKeeper protocols.)

Torvalds wanted a distributed system that he could use like BitKeeper, but none of the available free systems met his needs, particularly his performance needs. From an email he wrote on April 7, 2005 while writing the first prototype:[8]

However, the SCMs I've looked at make this hard. One of the things (the main thing, in fact) I've been working at is to make that process really efficient. If it takes half a minute to apply a patch and remember the changeset boundary etc. (and quite frankly, that's fast for most SCMs around for a project the size of Linux), then a series of 250 emails (which is not unheard of at all when I sync with Andrew, for example) takes two hours. If one of the patches in the middle doesn't apply, things are bad bad bad.

Now, BK wasn't a speed demon either (actually, compared to everything else, BK is a speed deamon [sic], often by one or two orders of magnitude), and took about 10–15 seconds per email when I merged with Andrew. HOWEVER, with BK that wasn't as big of an issue, since the BK<->BK merges were so easy, so I never had the slow email merges with any of the other main developers. So a patch-application-based SCM "merger" actually would need to be faster than BK is. Which is really really really hard.

So I'm writing some scripts to try to track things a whole lot faster. Initial indications are that I should be able to do it almost as quickly as I can just apply the patch, but quite frankly, I'm at most half done, and if I hit a snag maybe that's not true at all. Anyway, the reason I can do it quickly is that my scripts will not be an SCM, they'll be a very specific "log Linus' state" kind of thing. That will make the linear patch merge a lot more time-efficient, and thus possible.

(If a patch apply takes three seconds, even a big series of patches is not a problem: if I get notified within a minute or two that it failed half-way, that's fine, I can then just fix it up manually. That's why latency is critical—if I'd have to do things effectively "offline", I'd by definition not be able to fix it up when problems happen).

Torvalds had several design criteria:

  1. Take CVS as an example of what not to do; if in doubt, make the exact opposite decision. To quote Torvalds, speaking somewhat tongue-in-cheek:

    For the first 10 years of kernel maintenance, we literally used tarballs and patches, which is a much superior source control management system than CVS is, but I did end up using CVS for 7 years at a commercial company [Transmeta[9]] and I hate it with a passion. When I say I hate CVS with a passion, I have to also say that if there are any SVN (Subversion) users in the audience, you might want to leave. Because my hatred of CVS has meant that I see Subversion as being the most pointless project ever started. The slogan of Subversion for a while was "CVS done right", or something like that, and if you start with that kind of slogan, there's nowhere you can go. There is no way to do CVS right.[10]

  2. Support a distributed, BitKeeper-like workflow:

    BitKeeper was not only the first source control system that I ever felt was worth using at all, it was also the source control system that taught me why there's a point to them, and how you actually can do things. So Git in many ways, even though from a technical angle it is very very different from BitKeeper (which was another design goal, because I wanted to make it clear that it wasn't a BitKeeper clone), a lot of the flows we use with Git come directly from the flows we learned from BitKeeper.[10]

  3. Very strong safeguards against corruption, either accidental or malicious[10][11]
  4. Very high performance

The first three criteria eliminated every pre-existing version control system except for Monotone, and the fourth excluded everything.[10] So, immediately after the 2.6.12-rc2 Linux kernel development release,[10] he set out to write his own.[10]

The development of Git began on April 3, 2005.[12] The project was announced on April 6,[13] and became self-hosting as of April 7.[12] The first merge of multiple branches was done on April 18.[14] Torvalds achieved his performance goals; on April 29, the nascent Git was benchmarked recording patches to the Linux kernel tree at the rate of 6.7 per second.[15] On June 16, the kernel 2.6.12 release was managed by Git.[16]

While strongly influenced by BitKeeper, Torvalds deliberately attempted to avoid conventional approaches, leading to a unique design.[17] He developed the system until it was usable by technical users, then turned over maintenance on July 26, 2005 to Junio Hamano, a major contributor to the project.[18] Hamano was responsible for the 1.0 release on December 21, 2005,[19] and remains the project's maintainer.

Design

Git's design was inspired by BitKeeper and Monotone.[20][21] Git was originally designed as a low-level version control system engine on top of which others could write front ends, such as Cogito or StGIT.[21] However, the core Git project has since become a complete revision control system that is usable directly.[22]

Characteristics

Git's design is a synthesis of Torvalds's experience with Linux in maintaining a large distributed development project, along with his intimate knowledge of file system performance gained from the same project and the urgent need to produce a working system in short order. These influences led to the following implementation choices:

Strong support for non-linear development
Git supports rapid branching and merging, and includes specific tools for visualizing and navigating a non-linear development history. A core assumption in Git is that a change will be merged more often than it is written, as it is passed around various reviewers. Branches in git are very lightweight: A branch in git is only a reference to a single commit. With its parental commits, the full branch structure can be constructed.
Distributed development
Like Darcs, BitKeeper, Mercurial, SVK, Bazaar and Monotone, Git gives each developer a local copy of the entire development history, and changes are copied from one such repository to another. These changes are imported as additional development branches, and can be merged in the same way as a locally developed branch.
Compatibility with existing systems/protocols
Repositories can be published via HTTP, FTP, rsync, or a Git protocol over either a plain socket or ssh. Git also has a CVS server emulation, which enables the use of existing CVS clients and IDE plugins to access Git repositories. Subversion and svk repositories can be used directly with git-svn.
Efficient handling of large projects
Torvalds has described Git as being very fast and scalable,[23] and performance tests done by Mozilla showed it was an order of magnitude faster than some revision control systems, and fetching revision history from a locally stored repository can be one hundred times faster than fetching it from the remote server.[24][25] In particular, Git does not get slower as the project history grows larger.[26]
Cryptographic authentication of history
The Git history is stored in such a way that the name of a particular revision (a "commit" in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. The structure is similar to a hash tree, but with additional data at the nodes as well as the leaves.[27] (Mercurial and Monotone also have this property.)
Toolkit-based design
Git was designed as a set of programs written in C, and a number of shell scripts that provide wrappers around those programs.[28] Although most of those scripts have since been rewritten in C for speed and portability, the design remains, and it is easy to chain the components together.[29]
Pluggable merge strategies
As part of its toolkit design, Git has a well-defined model of an incomplete merge, and it has multiple algorithms for completing it, culminating in telling the user that it is unable to complete the merge automatically and manual editing is required.
Garbage accumulates unless collected
Aborting operations or backing out changes will leave useless dangling objects in the database. These are generally a small fraction of the continuously growing history of wanted objects. Git will automatically perform garbage collection when enough loose objects have been created in the repository. Garbage collection can be called explicitly using git gc --prune.[30]
Periodic explicit object packing
Git stores each newly created object as a separate file. Although individually compressed, this takes a great deal of space and is inefficient. This is solved by the use of "packs" that store a large number of objects in a single file (or network byte stream), delta-compressed among themselves. Packs are compressed using the heuristic that files with the same name are probably similar, but do not depend on it for correctness. Newly created objects (newly added history) are still stored singly, and periodic repacking is required to maintain space efficiency. The process of packing the repository can be very computationally expensive. By allowing objects to exist in the repository in a loose, but quickly generated format, git allows the expensive pack operation to be deferred until later when time does not matter (e.g. the end of the work day). Git does periodic repacking automatically but manual repacking is also possible with the git gc command.

Another property of Git is that it snapshots directory trees of files. The earliest systems for tracking versions of source code, SCCS and RCS, worked on individual files and emphasized the space savings to be gained from interleaved deltas (SCCS) or delta encoding (RCS) the (mostly similar) versions. Later revision control systems maintained this notion of a file having an identity across multiple revisions of a project. However, Torvalds rejected this concept.[31] Consequently, Git does not explicitly record file revision relationships at any level below the source code tree.

Inexplicit revision relationships has some significant consequences:

  • It is slightly more expensive to examine the change history of a single file than the whole project.[32] To obtain a history of changes affecting a given file, Git must walk the global history and then determine whether each change modified that file. This method of examining history does, however, let Git produce with equal efficiency a single history showing the changes to an arbitrary set of files. For example, a subdirectory of the source tree plus an associated global header file is a very common case.
  • Renames are handled implicitly rather than explicitly. A common complaint with CVS is that it uses the name of a file to identify its revision history, so moving or renaming a file is not possible without either interrupting its history, or renaming the history and thereby making the history inaccurate. Most post-CVS revision control systems solve this by giving a file a unique long-lived name (a sort of inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage.[33][34] Source code files are sometimes split or merged as well as simply renamed,[35] and recording this as a simple rename would freeze an inaccurate description of what happened in the (immutable) history. Git addresses the issue by detecting renames while browsing the history of snapshots rather than recording it when making the snapshot.[36] (Briefly, given a file in revision N, a file of the same name in revision N−1 is its default ancestor. However, when there is no like-named file in revision N−1, Git searches for a file that existed only in revision N−1 and is very similar to the new file.) However, it does require more CPU-intensive work every time history is reviewed, and a number of options to adjust the heuristics.

Git implements several merging strategies; a non-default can be selected at merge time:[37]

  • resolve: the traditional three-way merge algorithm.
  • recursive: This is the default when pulling or merging one branch, and is a variant of the three-way merge algorithm. "When there are more than one common ancestors that can be used for three-way merge, it creates a merged tree of the common ancestors and uses that as the reference tree for the three-way merge. This has been reported to result in fewer merge conflicts without causing mis-merges by tests done on actual merge commits taken from Linux 2.6 kernel development history. Additionally this can detect and handle merges involving renames."[38]
  • octopus: This is the default when merging more than two heads.

Implementation

Git's primitives are not inherently a SCM system. Torvalds explains,[39]

In many ways you can just see git as a filesystem — it's content-addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.

From this initial design approach, Git has developed the full set of features expected of a traditional SCM,[22] with features mostly being created as needed, then refined and extended over time.

Some data flows and storage levels in the Git revision control system.

Git has two data structures: a mutable index that caches information about the working directory and the next revision to be committed; and an immutable, append-only object database.

The object database contains four types of objects:

  • A blob object is the content of a file. Blob objects have no filename, timestamps, or other metadata.
  • A tree object is the equivalent of a directory. It contains a list of filenames, each with some type bits and the name of a blob or tree object that is that file, symbolic link, or directory's contents. This object describes a snapshot of the source tree.
  • A commit object links tree objects together into a history. It contains the name of a tree object (of the top-level source directory), a timestamp, a log message, and the names of zero or more parent commit objects.
  • A tag object is a container that contains reference to another object and can hold additional meta-data related to another object. Most commonly, it is used to store a digital signature of a commit object corresponding to a particular release of the data being tracked by Git.

The index serves as connection point between the object database and the working tree.

Each object is identified by a SHA-1 hash of its contents. Git computes the hash, and uses this value for the object's name. The object is put into a directory matching the first two characters of its hash. The rest of the hash is used as the file name for that object.

Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression. This can consume a large amount of disk space quickly, so objects can be combined into packs, which use delta compression to save space, storing blobs as their changes relative to other blobs.

Git servers typically listen on TCP port 9418.[40]

Portability

Git is primarily developed on Linux, but can be used on other Unix-like operating systems including BSD, Solaris and Darwin. Git is extremely fast on POSIX-based systems such as Linux.[41]

Git also runs on Microsoft Windows. There are four variants:

  • A native Microsoft Windows port, called msysgit (using MSYS from MinGW). While somewhat slower than the Linux version,[42] it is acceptably fast[43], but is not recommended for production because the current release is a preview version. In particular, some commands are not yet available from the GUIs, and must be invoked from the command line.
  • An implementation in the form of a Microsoft Windows shell extension also exists, called TortoiseGit.
  • Git also runs on top of Cygwin (a POSIX emulation layer),[44] although it is noticeably slower, especially for commands written as shell scripts.[45] This is primarily due to the high cost of the fork emulation performed by Cygwin. As of 2007, the rewriting of many Git commands (originally implemented as shell scripts) in C has resulted in significant speed improvements on Windows.[46]
  • A Java implementation JGit. This is included in EGit, an Eclipse tooling of JGit; Gerrit Code Review a web based review system; and the NBGit module for NetBeans.

Source code hosting

The following websites provide free source code hosting for Git repositories:[47]

See also

References

  1. ^ "git/git.git/tree". git.kernel.org. Retrieved 2009-06-15.
  2. ^ Linus Torvalds (2005-04-07). "Re: Kernel SCM saga." linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) "So I'm writing some scripts to try to track things a whole lot faster."
  3. ^ "After controversy, Torvalds begins work on git". InfoWorld. 2005-04-19. ISSN 0199-6649. Retrieved 2008-02-20.
  4. ^ "GitFaq: Why the 'git' name?". Git.or.cz. Retrieved 2009-06-16.
  5. ^ "After controversy, Torvalds begins work on 'git'". PC World. 2005-04-20. Torvalds seemed aware that his decision to drop BitKeeper would also be controversial. When asked why he called the new software, 'git,' British slang meaning 'a rotten person,' he said. 'I'm an egotistical bastard, so I name all my projects after myself. First Linux, now git'.
  6. ^ Torvalds, Linus and David Diamond, Just for Fun: The Story of an Accidental Revolutionary, 2001, ISBN 0-06-662072-4
  7. ^ Feature: No More Free BitKeeper | KernelTrap.org
  8. ^ Linus Torvalds (2005-04-07). "Re: Kernel SCM saga." linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  9. ^ Linus Torvalds (2005-10-31). "Re: git versus CVS (versus bk)". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  10. ^ a b c d e f Linus Torvalds (2007-05-03). Google tech talk: Linus Torvalds on git. Event occurs at 02:30. Retrieved 2007-05-16.
  11. ^ Linus Torvalds (2007-06-10). "Re: fatal: serious inflate inconsistency". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) A brief description of Git's data integrity design goals.
  12. ^ a b Linus Torvalds (2007-02-27). "Re: Trivia: When did git self-host?". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  13. ^ Linus Torvalds (2005-04-06). "Kernel SCM saga." linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  14. ^ Linus Torvalds (2005-04-17). "First ever real kernel git merge!". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  15. ^ Matt Mackall (2005-04-29). "Mercurial 0.4b vs git patchbomb benchmark". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  16. ^ Linus Torvalds (2005-06-17). "Linux 2.6.12". git-commits-head (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  17. ^ Linus Torvalds (2006-10-20). "Re: VCS comparison table". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) A discussion of Git vs. BitKeeper
  18. ^ Linus Torvalds (2005-07-27). "Meet the new maintainer..." git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  19. ^ Junio C Hamano (2005-12-21). "ANNOUNCE: GIT 1.0.0". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  20. ^ Linus Torvalds (2006-05-05). "Re: [ANNOUNCE] Git wiki". linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) "Some historical background" on git's predecessors
  21. ^ a b Linus Torvalds (2005-04-08). "Re: Kernel SCM saga". linux-kernel (Mailing list). Retrieved 2008-02-20. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  22. ^ a b Linus Torvalds (2006-03-23). "Re: Errors GITtifying GCC and Binutils". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  23. ^ Linus Torvalds (2006-10-19). "Re: VCS comparison table". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  24. ^ Stenback, Johnny (2006-11-30). "bzr/hg/git performance". Jst's Blog. Retrieved 2008-02-20. {{cite journal}}: More than one of |author= and |last= specified (help), benchmarking "git diff" against "bzr diff", and finding the former 100x faster in some cases.
  25. ^ Roland Dreier (2006-11-13). "Oh what a relief it is"., observing that "git log" is 100x faster than "svn log" because the latter has to contact a remote server.
  26. ^ Fendy, Robert (2009-01-21). DVCS Round-Up: One System to Rule Them All?—Part 2. Linux Foundation. Retrieved 2009-06-25. One aspect that really sets Git apart is its speed. ...dependence on repository size is very, very weak. For all facts and purposes, Git shows nearly a flat-line behavior when it comes to the dependence of its performance on the number of files and/or revisions in the repository, a feat no other VCS in this review can duplicate (although Mercurial does come quite close). {{cite book}}: More than one of |author= and |last= specified (help)
  27. ^ "Trust". Git Concepts. Git User's Manual. 2006-10-18.
  28. ^ Linus Torvalds. "Re: VCS comparison table". git (Mailing list). Retrieved 2009-04-10. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help), describing Git's script-oriented design
  29. ^ iabervon (2005-12-22). "Git rocks!"., praising Git's scriptability
  30. ^ "Git User's Manual". 2007-08-05.
  31. ^ Linus Torvalds (2005-04-10). "Re: more git updates." linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  32. ^ Bruno Haible (2007-02-11). "how to speed up "git log"?". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  33. ^ Linus Torvalds (2006-03-01). "Re: impure renames / history tracking". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  34. ^ Junio C Hamano (2006-03-24). "Re: Errors GITtifying GCC and Binutils". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  35. ^ Junio C Hamano (2006-03-23). "Re: Errors GITtifying GCC and Binutils". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  36. ^ Linus Torvalds (2006-11-28). "Re: git and bzr". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help), on using git-blame to show code moved between source files
  37. ^ Linus Torvalds (2007-07-18). "git-merge(1)".
  38. ^ Linus Torvalds (2007-07-18). "CrissCrossMerge".
  39. ^ Linus Torvalds (2005-04-10). "Re: more git updates..." linux-kernel (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  40. ^ "Exporting a git repository via the git protocol". Kernel.org. Retrieved 2009-11-17.
  41. ^ Stenback, Johnny (2006-11-30). "bzr/hg/git performance". Jst's Blog. Retrieved 2008-02-20.
  42. ^ Johannes Schindelin (2007-10-14). "Re: Switching from CVS to GIT". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) A subjective comparison of Git under Windows and Linux on the same system.
  43. ^ Martin Langhoff (2007-10-15). "Re: Switching from CVS to GIT". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help) Experience running msysgit on Windows
  44. ^ Shawn Pearce (2006-10-24). "Re: VCS comparison table". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  45. ^ Johannes Schindelin (2007-01-01). "Re: [PATCH] Speedup recursive by flushing index only once for all". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  46. ^ Shawn O. Pearce (2007-09-18). "[PATCH 0/5] More builtin-fetch fixes". git (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  47. ^ http://git.wiki.kernel.org/index.php/GitHosting

External links