Jump to content

Wikipedia:Reference desk/Archives/Mathematics/2020 February 10

From Wikipedia, the free encyclopedia
Mathematics desk
< February 9 << Jan | February | Mar >> Current desk >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


February 10[edit]

Pet lifespans[edit]

I have some veterinary records for the life-times of pets. These plot to give me a nice curve, but are not that useful for forecasting how long a pet might live, because the curve is very spread out. So I divide the records into several categories, dog, cat, rabbit, etc., and generate the curves for each. I then measure the mean and std deviation to get a better method for forecasting the likely life span of a pet.

Q1. How do I measure the improvement in the accuracy of my forecasting?
Q2. Is there a method to reduce the number of classes, with minimal loss of accuracy. For example we might hope that we can amalgamate "black rabbits" and "white rabbits" without any loss of accuracy, and maybe even some gain as we have more data. (Factor analysis perhaps?)

All the best: Rich Farmbrough (the apparently calm and reasonable) 16:07, 10 February 2020 (UTC).[reply]

To start, using mean and spread to estimate remaining life time is only useful when working with an a priori known family of distributions, e.g. the family of normal (Gaussian) distributions. But this family does not give a good fit with typical life-span distributions. The log-normal distributions are better – at least individuals cannot have a negative life span – but their density functions have tails that are too fat. So it is better to work with the experimentally observed distributions directly. In the following, "distribution function" always refers to the cumulative distribution function.
Let F be the life-span distribution of a population. F(0) = 0, F(t) → 1 as t → ∞; in general, F(t) is the fraction of individuals that has a life-span of t or less in duration. It can be used to estimate the remaining life time of an individual when it attains some age t0 (assuming no dramatic dynamic changes in life-spans to be foreseen). It is convenient to work with the complement function defined by G(t) = 1 − F(t). (This is known as the survival function.) The expected remaining life time of an individual at age t0 then equals
Re Q1, lacking a ground truth it is hard to tell whether basing the computations on the distribution of some well-chosen subset is actually better than using a larger set of data. If the subset is still fairly large but its density function is noticeably less smeared out, it probably is an improvement. You can use the two-sample Kolmogorov–Smirnov test to test if the distribution of some subset is significantly different from that of a larger set. If not, then any seeming improvements in accuracy may be a mirage.
Re Q2, using common sense and real-word knowledge may work better here than any sophisticated analysis technique (unless the dataset is both huge and rich). As in Q1, use the K–S test to see whether a considered split-off produces a significant difference, and split only when it does.  --Lambiam 19:21, 10 February 2020 (UTC)[reply]

For human life expectancies, actuaries seem to use the Gompertz–Makeham law of mortality which is a blended distribution. You can do the same thing for animals but will want to estimate the parameters separately for each species, unless the species are very similar. 2601:648:8202:96B0:0:0:0:7AC0 (talk) 03:56, 11 February 2020 (UTC)[reply]

At very old ages that law does not adequately describe the mortality pattern. For an exposition of the method used for the annual U.S. Life Tables, see this publication (pdf) of the National Center for Health Statistics. They use vital statistics and census data to calculate death rates. Previously only giving estimates for ages under 85, they now also use Medicare data for ages 85 years and over. While technically complicated to mitigate the effects of anomalies in reporting, their method does not involve parameterized distributions.  --Lambiam 09:35, 11 February 2020 (UTC)[reply]
The same holds for the method used in the UK. No parametric distributions were harmed in doing these calculations.  --Lambiam 21:24, 11 February 2020 (UTC)[reply]
Thanks both. Food for thought, All the best: Rich Farmbrough (the apparently calm and reasonable) 10:24, 12 February 2020 (UTC).[reply]