= Usability testing =

Usability testing is a technique used in user-centered interaction design to evaluate a product by testing it on users. This can be seen as an irreplaceable usability practice, since it gives direct input on how real users use the system. It is more concerned with the design intuitiveness of the product and tested with users who have no prior exposure to it. Such testing is paramount to the success of an end product as a fully functioning application that creates confusion amongst its users will not last for long. This is in contrast with usability inspection methods where experts use different methods to evaluate a user interface without involving users.

Usability testing focuses on measuring a human-made product's capacity to meet its intended purposes. Examples of products that commonly benefit from usability testing are food, consumer products, websites or web applications, computer interfaces, documents, and devices. Usability testing measures the usability, or ease of use, of a specific object or set of objects, whereas general human–computer interaction studies attempt to formulate universal principles.

==What it is not==
Simply gathering opinions on an object or a document is market research or qualitative research rather than usability testing. Usability testing usually involves systematic observation under controlled conditions to determine how well people can use the product. However, often both qualitative research and usability testing are used in combination, to better understand users' motivations/perceptions, in addition to their actions.

Rather than showing users a rough draft and asking, "Do you understand this?", usability testing involves watching people trying to use something for its intended purpose. For example, when testing instructions for assembling a toy, the test subjects should be given the instructions and a box of parts and, they should be asked to assemble the toy, rather than just comment on the parts of materials. Instruction phrasing, illustration quality, and the toy's design all affect the assembly process.

==History==
Usability testing didn't start with websites and apps. It first emerged through the study of how people used machines in the 1940s, such as airplane controls during World War II. Later, in the 1980s, when personal computers became the norm, a new field known as Human-Computer Interaction (HCI) established usability testing as a standard practice in technology design.

In the 1990s, people began setting up special labs where they could observe individuals using computers and identify areas where they were experiencing difficulties. As the internet and smartphones gained popularity in the 2000s and 2010s, usability testing expanded to websites and apps. It began to occur online, allowing companies to test with people from anywhere.

Usability testing has become a standard part of product design, particularly in fast-moving technology environments. New tools, remote testing, and emerging technologies, such as AI and virtual reality, make the process faster and more sophisticated. Still, the objective remains the same: to make technology easier for real people.

== Usability Labs ==

Usability labs, in contrast to field testing or remote testing, are spaces designed specifically for conducting usability testing. They provide conditions conducive to testing through access to resources, testing equipment, and a dedicated space which can be formatted and upgraded according to the specifications of the usability test. A typical usability lab often includes a desk, chair, and computer, along with any additional elements specific to the test.

==Methods==
Setting up a usability test involves carefully creating a scenario, or a realistic situation, wherein the person performs a list of tasks using the product being tested while observers watch and take notes (dynamic verification). Usability testing follows a structured process that allows researchers to observe how users interact with a product while performing tasks. In addition to direct observation, researcher may use several other test instruments such as scripted instructions, paper prototypes, and pre- and post-test questionnaires are also used to gather feedback on the product being tested (static verification). For example, to test the attachment function of an e-mail program, a scenario would describe a situation where a person needs to send an e-mail attachment, and asking them to undertake this task. The aim is to observe how people function in a realistic manner, so that developers can identify the problem areas and fix them. Techniques popularly used to gather data during a usability test include think aloud protocol, co-discovery learning and eye tracking.

=== Guerrilla Usability Testing ===
Guerrilla usability testing, also known as hallway testing or pop-up research, is a quick and cheap method of usability testing that consists of short informal interviews in public spaces that are frequented by people most likely to use your product or service.

This unorthodox method is primarily used in the early stages of a design process to receive direct and immediate feedback from a wide cross-section of the general public; significantly cutting the cost and testing time required in traditional testing. Guerrilla testing can help designers to identify core usability problems with the product and target "specific user groups that may be difficult to reach - for example, care home residents, homeless people or A level students."

This type of testing is an example of convenience sampling and thus the results are potentially biased. Limitations of this method can include: in-comprehensive data, lack of willing participants, or needing to be paired with other methods of usability testing to produce more detailed results.

===Remote usability testing===

In a scenario where usability evaluators, developers and prospective users are located in different countries and time zones, conducting a traditional lab usability evaluation creates challenges both from the cost and logistical perspectives. These concerns led to research on remote usability evaluation, with the user and the evaluators separated over space and time. Remote testing, which facilitates evaluations being done in the context of the user's other tasks and technology, can be either synchronous (moderated) or asynchronous (unmoderated). The former involves real time one-on-one communication between the evaluator and the user, while the latter involves the evaluator and user working separately. The increasing need for remote testing stems from its capacity to improve accessibility to essential services and communication for individuals with limited mobility, due to factors-such as susceptibility to illness, disability, or limited transportation resources. Numerous tools are available to address the needs of both these approaches.

Synchronous (moderated) usability testing methodologies involve video conferencing or employ remote application sharing tools such as WebEx. WebEx is a commonly used technology to conduct a synchronous remote usability test. This form of remote testing allows for real-time communication between moderators and participants, which is valuable to older adults or individuals who are homebound due to health, mobility, or environmental conditions. Unlike traditional usability testing, remote is able to reach participants who deal with the complications listed. As dependency on remote services such as telemedicine, online shopping, and remote banking continue to grow, moderated remote usability testing plays a crucial role in ensuring these technologies meet the needs of high-risk populations while being cost-efficient.

However, synchronous remote testing may lack the immediacy and sense of "presence" desired to support a collaborative testing process. Moreover, managing interpersonal dynamics across cultural and linguistic barriers may require approaches sensitive to the cultures involved. Other disadvantages include having reduced control over the testing environment and the distractions and interruptions experienced by the participants in their native environment. One of the newer methods developed for conducting a synchronous remote usability test is by using virtual worlds.

Asynchronous (unmoderated) methodologies include automatic collection of user's click streams, user logs of critical incidents that occur while interacting with the application and subjective feedback on the interface by users. Similar to an in-lab study, an asynchronous remote usability test is task-based and the platform allows researchers to capture data automatically by auto-logging which collects pages visited, time spent on each page, and interface actions. Hence, for many large companies, this allows researchers to better understand visitors' intents when visiting a website or mobile site. The tests are carried out in the user's own environment (rather than labs) helping further simulate real-life scenario testing. By eliminating the need to conduct individual sessions, asynchronous remote testing can include a larger number of participants, making it more flexible and cost-effective than traditional lad-based studies. Conducting usability testing asynchronously has also become prevalent and allows testers to provide feedback in their free time and from the comfort of their own home.

===Expert review===

Expert review is another general method of usability testing. As the name suggests, this method relies on bringing in experts with experience in the field (possibly from companies that specialize in usability testing) to evaluate the usability of a product.

A heuristic evaluation or usability audit is an evaluation of an interface by one or more human factors experts. Evaluators measure the usability, efficiency, and effectiveness of the interface based on usability principles, such as the 10 usability heuristics originally defined by Jakob Nielsen in 1994.

Nielsen's usability heuristics, which have continued to evolve in response to user research and new devices, include:
- Visibility of system status
- Match between system and the real world
- User control and freedom
- Consistency and standards
- Error prevention
- Recognition rather than recall
- Flexibility and efficiency of use
- Aesthetic and minimalist design
- Help users recognize, diagnose, and recover from errors
- Help and documentation

===Automated expert review===

Similar to expert reviews, automated expert reviews provide usability testing but through the use of programs given rules for good design and heuristics. Though an automated review might not provide as much detail and insight as reviews from people, they can be finished more quickly and consistently. The idea of creating surrogate users for usability testing is an ambitious direction for the artificial intelligence community.

===A/B testing===

In web development and marketing, A/B testing or split testing is an experimental approach to web design (especially user experience design), which aims to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement). As the name implies, two versions (A and B) are compared, which are identical except for one variation that might impact a user's behavior. Version A might be the one currently used, while version B is modified in some respect. For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can be seen through testing elements like copy text, layouts, images and colors.

Areas typically improved through A/B testing include algorithms, visuals, and workflow processes.

Multivariate testing or bucket testing is similar to A/B testing but tests more than two versions at the same time.

==Number of participants==

In the early 1990s, Jakob Nielsen, at that time a researcher at Sun Microsystems, popularized the concept of using numerous small usability tests—typically with only five participants each—at various stages of the development process. His argument is that, once it is found that two or three people are totally confused by the home page, little is gained by watching more people suffer through the same flawed design. "Elaborate usability tests are a waste of resources. The best results come from testing no more than five users and running as many small tests as you can afford."

The claim of "Five users is enough" was later described by a mathematical model which states for the proportion of uncovered problems U

$U = 1-(1-p)^n$

where p is the probability of one subject identifying a specific problem and n the number of subjects (or test sessions). This model shows up as an asymptotic graph towards the number of real existing problems (see figure below).

In later research Nielsen's claim has been questioned using both empirical evidence and more advanced mathematical models. Two key challenges to this assertion are:
1. Since usability is related to the specific set of users, such a small sample size is unlikely to be representative of the total population so the data from such a small sample is more likely to reflect the sample group than the population they may represent
2. Not every usability problem is equally easy-to-detect. Intractable problems happen to decelerate the overall process. Under these circumstances, the progress of the process is much shallower than predicted by the Nielsen/Landauer formula.

Nielsen does not advocate stopping after a single test with five users; his point is that testing with five users, fixing the problems they uncover, and then testing the revised site with five different users is a better use of limited resources than running a single usability test with 10 users. In practice, the tests are run once or twice per week during the entire development cycle, using three to five test subjects per round, and with the results delivered within 24 hours to the designers. The number of users actually tested over the course of the project can thus easily reach 50 to 100 people. Research shows that user testing conducted by organisations most commonly involves the recruitment of 5-10 participants.

In the early stage, when users are most likely to immediately encounter problems that stop them in their tracks, almost anyone of normal intelligence can be used as a test subject. In stage two, testers will recruit test subjects across a broad spectrum of abilities. For example, in one study, experienced users showed no problem using any design, from the first to the last, while naive users and self-identified power users both failed repeatedly. Later on, as the design smooths out, users should be recruited from the target population.

When the method is applied to a sufficient number of people over the course of a project, the objections raised above become addressed: The sample size ceases to be small and usability problems that arise with only occasional users are found. The value of the method lies in the fact that specific design problems, once encountered, are never seen again because they are immediately eliminated, while the parts that appear successful are tested over and over. While it's true that the initial problems in the design may be tested by only five users, when the method is properly applied, the parts of the design that worked in that initial test will go on to be tested by 50 to 100 people.

== Limitations of participant sampling ==

=== Statistical critiques of small sample testing ===
The widely cited claim that five participants are sufficient to identify 85% of usability problems has been subject to significant academic scrutiny. The underlying mathematical model assumes a constant probability of problem discovery across all users and all problems. Research demonstrated that this formula "makes unwarranted assumptions about individual differences in problem discovery" and found that while the model may hold for simple problem counts, analyses incorporating problem frequency and severity indicated that sample sizes may need to be doubled to avoid misleading results.

Empirical research has revealed substantial variance in the five-user model's effectiveness. Randomised sampling experiments have found that while groups of five users discovered an average of 85% of problems as predicted, the range varied dramatically: some groups of five found as few as 55% of problems, whereas no group of twenty found fewer than 95%. This variance suggests that small sample sizes introduce considerable uncertainty into usability findings.

The probability of detecting any given usability problem is not uniform but varies based on problem severity, user characteristics, product complexity, and test structure. Research indicates that subtle problems, which may have the most serious implications for user safety or task completion, have lower detection probabilities and require larger sample sizes to identify reliably.

It has been proposed heterogeneity models that account for varying problem detection rates, arguing that the original Nielsen-Landauer formula oversimplifies the discovery process. These models suggest that intractable problems decelerate the overall discovery process, resulting in shallower progress than the original formula predicts.

=== Underrepresentation of users with disabilities ===
A significant limitation of usability testing practice is the systematic underrepresentation of users with disabilities, despite this population comprising a substantial proportion of potential users. According to the World Health Organization, approximately 1.3 billion people (16% of the global population) experience significant disability, with prevalence increasing due to population ageing and the rise of chronic diseases. In countries with life expectancies over 70 years, individuals spend on average 11.5% of their lifespan living with disabilities.

People with disabilities stand to gain the most from new technologies, but they're too often the ones left behind by these developments, largely due to not being included in consultations and the early stages of development

Research found that technologies are not commonly tested with participants from diverse backgrounds, with key barriers including organisational pressures, stakeholder culture, and difficulties in participant recruitment.

It has been emphasised that designs must be tested with users with disabilities to determine whether they are both accessible and usable, as compliance with accessibility guidelines alone does not guarantee usability. Industry practitioners have argued that users with disabilities often identify a larger range of issues than non-disabled users, including problems that affect all users, suggesting that inclusive testing may be more efficient than testing with non-disabled users alone.

The concept of "extreme users" in design research posits that testing with users who face additional challenges, including those with disabilities, can reveal insights that improve usability for all users, a phenomenon sometimes termed the "curb-cut effect" after the observation that kerb ramps designed for wheelchair users also benefit people with prams, luggage, or mobility difficulties. Some modern user research practices focus on 'edge' or 'extreme users' rather than mainstream users, noting if designs solve for edge users then you know it's seamless and intuitive for all customers.

=== WEIRD sampling bias ===
Usability research shares broader sampling limitations identified in psychological research, particularly the overreliance on participants from Western, Educated, Industrialised, Rich, and Democratic (WEIRD) societies. Studies have found that while WEIRD populations constitute approximately 12% of the global population, they represented 96% of participants in published behavioural science research.

This sampling bias raises questions about the generalisability of usability findings, as research has demonstrated that cognitive and perceptual processes can vary significantly across cultures. For example, visual perception studies have shown that certain optical illusions that reliably affect people from industrialised countries do not have the same effect on people from non-industrialised societies.

A systematic review of developmental psychology research found that of 1,582 articles analysed, only 112 featured participants from Central and South America, Africa, Asia, and the Middle East combined, compared to 912 featuring United States participants alone. Similar patterns are likely present in usability research, though specific analyses of the field's sampling demographics remain limited.

The persistence of sampling bias despite widespread awareness has been attributed to convenience sampling practices, where researchers recruit participants who are readily accessible rather than representative of the user population. Standard experimental designs may also inadvertently exclude certain populations; for instance, lengthy testing sessions may be aversive to neurodivergent participants, leading to self-selection effects that reduce sample representativeness.

=== Convenience sampling and recruitment bias ===
Convenience sampling—recruiting participants who are readily available rather than systematically selected—is prevalent in usability testing practice but introduces significant bias. Unlike random sampling, which produces statistically balanced selections, convenience sampling results in samples that may not represent the broader user population.

==Examples==

=== Example A: Medical Device Testing (2015) ===
The 2015 edition of Usability Testing of Medical Devices defines usability testing as: "a means to determine whether a given medical device will meet its intended users' needs and preferences," and as a way "to judge if a medical device is more or less vulnerable to dangerous use errors."

The testing process typically proceeds in this fashion:

1. Planning. Usability specialists create test plans to assess how effectively a device's design meets user needs. Classically, such tests often occur in dedicated facilities, but the setting can vary to non-clinical areas depending on the device and study scope.
2. Participant selection. Representative users (Medical professionals and patients) are carefully selected to ensure they reflect individuals that will be using the device. This sometimes includes expanding the sample to include participants with physical or cognitive limitations to evaluate accessibility and usability.
3. Task performance and observation. Participants are acclimated and briefed on the test environment, purposes, rules, and are interviewed before performing tasks with the device. "While test participants perform their tasks, test personnel…observe intensively to determine how the medical device facilitates or hinders tasks." '
4. Analysis and iteration. Data is reviewed to assess consistency across sessions and uncover potential usability problems that don't suit the target audience's needs or could lead to unsafe or inefficient use.

Through this process, researchers gain critical insight into human interaction with medical devices, enabling designers to correct usability issues prior to the products' release.

=== Example B: Apple Computers (1982) ===
A 1982 Apple Computer manual for developers advised on usability testing:

1. "Select the target audience. Begin your human interface design by identifying your target audience. Are you writing for businesspeople or children?"
2. Determine how much target users know about Apple computers, and the subject matter of the software.
3. Steps 1 and 2 permit designing the user interface to suit the target audience's needs. Tax-preparation software written for accountants might assume that its users know nothing about computers but are experts on the tax code, while such software written for consumers might assume that its users know nothing about taxes but are familiar with the basics of Apple computers.

Apple advised developers, "You should begin testing as soon as possible, using drafted friends, relatives, and new employees":

Designers must watch people use the program in person, because

==Education==
Usability testing has been a formal subject of academic instruction in different disciplines. Usability testing is important to composition studies and online writing instruction (OWI). Scholar Collin Bjork argues that usability testing is "necessary but insufficient for developing effective OWI, unless it is also coupled with the theories of digital rhetoric."

== Survey research ==
Survey products include paper and digital surveys, forms, and instruments that can be completed or used by the survey respondent alone or with a data collector. Usability testing is most often done in web surveys and focuses on how people interact with survey, such as navigating the survey, entering survey responses, and finding help information. Usability testing complements traditional survey pretesting methods such as cognitive pretesting (how people understand the products), pilot testing (how will the survey procedures work), and expert review by a subject matter expert in survey methodology.

In translated survey products, usability testing has shown that "cultural fitness" must be considered in the sentence and word levels and in the designs for data entry and navigation, and that presenting translation and visual cues of common functionalities (tabs, hyperlinks, drop-down menus, and URLs) help to improve the user experience.

==See also==

- Commercial eye tracking
- Component-based usability testing
- Crowdsourced testing
- Diary studies
- Don't Make Me Think
- Educational technology
- Heuristic evaluation
- ISO 9241
- RITE Method
- Software performance testing
- Software testing
- System usability scale (SUS)
- Test method
- Tree testing
- Universal usability
- Usability goals
- Usability of web authentication systems
