# Talk:Flow cytometry bioinformatics

WikiProject Computational Biology (Rated B-class, Low-importance)
This article is within the scope of WikiProject Computational Biology, a collaborative effort to improve the coverage of Computational Biology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B  This article has been rated as B-Class on the quality scale.
Low  This article has been rated as Low-importance on the importance scale.
WikiProject Molecular and Cell Biology (Rated B-class, Low-importance)
B  This article has been rated as B-Class on the project's quality scale.
Low  This article has been rated as Low-importance on the project's importance scale.

## Reviews

This article was created on the PLoS Computational Biology Wiki as a review paper to be co-published on Wikipedia. As such it went through a formal academic peer review process, which was open. The comments from the reviewers, and their responses, are noted below. These reviews, and the revision history of the article prior to being transferred to Wikipedia, are archived at that page.

Reviewer 1: Holden Maecker

I find this to be a good and wide-ranging summary of topics associated with flow cytometry analysis and bioinformatics. It spans the territory from basic flow cytometry concepts and gating, to newer bioinformatics approaches like SPADE and PCA, and routines for data processing such as those in Bioconductor. Few people's expertise spans all of these areas, but this page provides a good synthesis for folks who work in one or more of these areas, and want to learn more. I would suggest expanding the section on Gating, to make some basic but missing or merely implied points, e.g.: -Gating is hierarchical, usually focusing in on specific subsets by sequential selection of populations, usually in two dimensions at a time (e.g., Lymphocytes->T cells->CD4+ T cells->naive CD4+ T cells). -This approach suffers from the inability to visualize all other relevant dimensions when gating on only two dimensions at a time; it may even make it difficult to distinguish closely spaced populations that could be better separated in >2-dimensional space. And it suffers from "tunnel vision", in that an overview of the entire dataset is virtually impossible. -Boolean gates can be created (to some extent, automatically in software such as FlowJo) that divide a population of cells into all logical combinations of markers. This is a complementary approach to automated gating algorithms that find "where the cell clusters are"; in a Boolean approach, one asks "what are all the possible cell phenotypes" and then monitors those compartments to see which ones are populated and to what extent. It is, however, a deterministic approach, assuming that cells are either positive or negative for a given marker, and the user decides the positive/negative boundary. The number of compartments can also become staggering with increasing dimensions. Clustering algorithms are, by contrast, unsupervised, in that they do not require any user input about what is positive or negative; they simply find regions of cell density, inflection points, etc. -Holden Maecker

These are some excellent suggestions. As there is some overlap between these comments on gating and the comments of reviewer 3, we have addressed both reviewers' comments in our response there. -Kierano (talk) 11:48, 27 June 2013 (PDT)

Reviewer 2: Nolwenn Le Meur

This topic page gives a good review of the field of flow cytometry bioinformatics. It covers the fundamentals of data handling and analysis for flow cytometry. It also highlights new approaches and ongoing developments, notably for cell population identification where room for improvements remains.

My main comment is on the lead paragraph. The sentence “Flow cytometry bioinformatics is the application of bioinformatics, computational statistics and machine learning to analyze flow cytometry data” is confusing. As mentioned in the Wikipedia page for Bioinformatics, this interdisciplinary field uses many areas of computer science, mathematics and engineering and therefore includes the concept of data analysis with notably machine learning technics. I would rather say: “Flow cytometry bioinformatics is the application of bioinformatics to flow cytometry, which involves storing, retrieving, organizing and analyzing flow cytometry data using extensive computational resources and tools." Maybe it could be added that flow cytometry bioinformatics requires and contributes to the development of computational statistics and machine learning methods. In addition, the introduction could be developed with examples of application fields. Indeed flow cytometry is used in wide range of domains from medicine and environment for human health to the analysis of the microbiome in seawater (e.g. Wang, Y et al. (2010). Past, present and future applications of flow cytometry in aquatic microbiology. Trends in Biotechnology, 28(8), 416–424. doi:10.1016/j.tibtech.2010.04.006.)

A minor comment is on the description of the different steps in computational flow cytometry analysis. This description is well done although the concept of workflow could be emphasized. Some software allows storing analysis workflows, which are notably useful for qualitative and reproducible research. For instance, for gating which is a hierarchical process, it is especially required to keep track of the process used for population selection. It is also essential when flow cytometry is used as a diagnostic tool to automate population selection. Finally, workflows saved in standard file format such as XML can be played by different software, which can be useful in terms of reproducible research.

Nolwenn Le Meur

The comments on the lead section were extremely helpful, and have been taken into account in the expansion of that section.
We have added a listing of some of the applications of flow cytometry to the introduction section.
We have added a paragraph to the section overviewing the steps in flow cytometry analysis to emphasise the importance of workflows and their interchange for reproducibility. -Kierano (talk) 11:48, 27 June 2013 (PDT)

Reviewer 3: Jorge Pardo

This page provides an informative overview of the type of multidimensional data generated by flow cytometry and the role of bioinformatics in analyzing increasingly complex data sets.

The introductory section on the basics of fluorescence based flow cytometry is missing a description of spectral cross-over compensation.This would seem an oversight, as compensation and compensation matrices are mentioned in other sections of the page.

Manual gating in the analysis of flow data should be described earlier in the page, certainly before describing Gating-ML, and with a bit more detail. The authors describing the process as "error prone" and "non-reproducible".Given the same data set, two investigators may use different hierarchical manual gating strategies to define a cell population, but this does not imply intrinsic non-reproducibility in the process. Indeed, clinical flow cytometry laboratories are certified based on their ability to reproduce results while testing a defined sample, and this testing involves manual gating. As for "error prone", the inference is that there is a correct way to gate flow data, and that when this process is done manually, it is likely to be done incorrectly. This statement is then ignored in the discussion of combinatorial gating approaches, like flow type/RchyOptimyx, that use manual gating. On the other hand, the discussion of automated gating using clustering algorithms fails to mention that repeated analysis of data sets with large number of clusters may report different cluster partitions (http://www.biomedcentral.com/1471-2105/14/S1/S8). I would invite the authors to present a balanced characterization of manual gating that recognizes its limitations in the analysis of increasingly complex flow data; it is a time consuming hierarchical approach that is limited to two dimensional analysis at each step.

Re. manual gating:
We have re-organized the content to discuss manual gating earlier as requested. We have also clarified that manual analysis can indeed be reproducible specially in controlled clinical settings and have better described the cases in which it can cause inaccuracies. We have also clarified that despite the recent advances in computational analysis, manual gating still is the main solution for identification of specific rare cell populations (e.g., for gating rare populations for the combinatorial gating algorithms). Finally, we have explained that the computational gating algorithms we have discussed here can automatically select the number of cell populations using different methods and that this choice can affect the sensitivity and specificity of the results. -Nnimaa (talk) 15:21, 28 June 2013 (PDT)

Lastly, I'd emphasize the need for informative representation of cell populations identified through automated gating of complex multidimensional flow data. It is not informative to show all cell populations defined through multidimensional analysis on two dimensional dot plots. The SPADE software does a great job as it organizes defined cell populations in hierarchies of related phenotypes and it also allows for the comparison of individual markers across all the cell populations. This facilitates the identification of cell lineages, identification of rare cell types and comparison of different samples. -Jorge Pardo

Re. visualization:
First, we would like to clarify that the SPADE algorithm is not always suitable for identification of lineages (as spanning trees are not necessarily representing lineages) or rare cell populations (due to the down sampling). Several approaches are being considered for addressing these limitations. This being said, we agree with the reviewer that SPADE is a fantastic algorithm for visualization of an entire sample to identify major cell populations and have discussed it in the "gating guided by dimension reduction" section. -Nnimaa (talk) 15:16, 28 June 2013 (PDT)
Re. compensation:
Initially we had thought to exclude compensation, as the methods for performing it, while computational and automated, are standard and have not advanced significantly since the development of multicolour flow. However, on re-reading, it did indeed feel missing, and we have consequently included a section discussing the computational aspects of compensation. -Kierano (talk) 11:48, 27 June 2013 (PDT)

### Response to reviewers

We have added our responses to each reviewer below their review. All of the comments were extremely helpful and we feel have strengthened the paper; we thank all of the reviewers for their input.

Full details of the changes can be seen at this diff. -Kierano (talk) 11:53, 27 June 2013 (PDT)

## Identifying cell populations

I have some comments on the section Flow cytometry bioinformatics#Identifying cell populations. I could try to resolve some of these issues myself by editing the article, but I thought it might be best to bring it up here first so that people can see where I am coming from.

1. I think it would help if we described the data we are working with. I take it that we have lots of variables that could be used to categorise the cells but no standard of what those categories are. The aim is to create categories, based on the measured data, right? This section sounds a bit like it is describing decision tree learning but that is different in that there we have a training set with the categories already known and the aim is to work out how best to infer the category from the explanatory variables so that in future we can calculate a probability of an item falling into a category based in measurements of the explanatory variables. If the aim of the identification process was made clearer this would help.
2. (Related to the above point.) Where has fluorescence intensity come from? It's mentioned in the second sentence of Flow cytometry bioinformatics#Gating, as if it is really important, but not earlier in the article. Are all the variables different types of fluorescence intensity?
3. I take it that the different subsections describe different methods for achieving the same aim. It would be good to make that clearer. It could look like you do one subsection and then the next.
4. It took me a while to get my head around the sentence "The data generated by flow-cytometers can be plotted in one or two dimensions to produce a histogram or scatter plot." - even though there was an example of the scatter plots just there. I was thinking "which two dimensions"? But now I realise that that point is that it could be any two. The data has multiple dimensions of explanatory variables and the point is that you take pairs of these variables. With a plot like the one shown, you can show every pair you want to, all next to each other in a logical way. Not sure what the best way to make this clearer is. Perhaps resolving points 1 and 2 will help.
5. The article says "As the number of markers measured by flow cytometry increases, the number of scatter plots that need to be investigated increase exponentially." If my understanding in point 4 is correct then it actually increases quadraticly. Perhaps best to write this as "The number of scatter plots that need to be investigated increase with the square of the number of markers measured by flow cytometry."
6. The text implies that probability binning is univariate and the multivariate equivalent is called "frequency difference gating". Is this correct? The image caption uses the term "probability binning", even though the image is multivariate. Might be best to just stick with one term? We could tweak the text, including removing the phrase "on a univariate basis".
7. With probability binning, what is the chi-squared test done on? I take it you are testing for a difference in the variables on which you aren't splitting the data. If true, the fact that you are using the other variables deserves a mention.

Yaris678 (talk) 10:25, 24 December 2013 (UTC)

Thanks for taking the time to comment -- it's good to get some outside perspective. And grr. We really need a better way of handling long comment blocks. I tried putting responses inline, but it broke the enumeration. Anyway, here goes:
1. This is covered in the first paragraph of the "flow cytometry data" section, although I can definitely see how language better linking that paragraph to the actual data would be better. In one earlier version, I had the first paragraph of the "Flow Cytometry Standard" section there instead. I'll see what I can do to make that section clearer. As for the aim of the section, it could probably be better described as feature extraction, but I'm not sure whether anyone in a primary source actually uses that term. I've added a sentence or two to the section introduction to hopefully clarify this, although I may need to mull over the wording somewhat, since it's not exactly typical feature extraction (often the number of cell types identified could exceed the number of cells, if you count subtypes).
2. See above.
3. This is true of "identifying cell populations", but not of "data pre-processing", so I can see how confusion could arise. I've added text to the beginning of each section to clarify this.
4. Hopefully, yes.
5. Quite right. I think this was Nima having fallen into the habit of thinking of the number of possible cell types if each marker is divided into positive or negative, which does increase exponentially with the number of markers. Please feel free to amend the text.
6. Yes, flowFP (what is pictured) is probably more accurately described as frequency difference gating than probability binning. I've amended the figure legend.
7. Hmmm ... re-reading Mario's paper, they actually used a custom "normalised" chi-square statistic (which they call ${\displaystyle \chi '^{2}}$ based on differences in the locations of the quantiles between a control and a test sample. They used a Monte-Carlo simulation to empirically determine the value of ${\displaystyle \chi '^{2}}$ for statistical significance. I'm not sure if there's a better way of phrasing this than the way it is currently phrased without it becoming too verbose for a Wikipedia article.
-Kieran (talk) 19:41, 6 January 2014 (UTC)
Hi Kieran,
I prefer a response in this format, so I'm happy.  :-)
1. The stuff I want to know isn't covered in the first paragraph of the "flow cytometry data". That paragraph is about how the data is collected, but it doesn't actually say say "This gives data on multiple types of fluorescence intensity for each cell in the sample" or whatever the case may be. However, I notice that you have added something along these lines to the "Identifying cell populations" section.
2. So is the data multiple types of fluorescence intensity?
3. Much clearer now. Thanks.
4. OK. I'll wait until points 1 and 2 are resolved.
To be more (but probably not even fully) precise, the data is a matrix of ${\displaystyle M}$ measurements of light intensity, by ${\displaystyle N}$ events detected as pulses by the flow cytometer. Most events in ${\displaystyle N}$ represent cells passing by the laser, although some may be doublets (two cells passing together) or subcellular debris.
Usually the first two measurements in ${\displaystyle M}$ are unfiltered directly scattered light ("forward scatter") and unfiltered right angle scattered light ("side scatter"). The remainder are usually fluorescence intensity, which is to say that they are the amount of light from the cell/particle that was detected after it passed through an optical bandpass filter calibrated to the peak region of the emission spectrum of the fluorophore being detected. As such it is a proxy measure of the amount of that fluorophore that is on the cell. In the most common case of fluorophores being bound to detector molecules such as antibodies, this is by further proxy a measure of how much detector is present, and by even further proxy a measure of how much of the biomarker the detector molecule is bound to (in the antibody case, usually a protein on the cell's surface). You could, I suppose, talk about "marker quantity", but the many layers of indirection make that somewhat incorrect. -Kieran (talk) 00:53, 28 January 2014 (UTC)