Logistic regression

This is a follow up to my previous article on r-squared’s when modeling with logistic regression.  Depending on context in addition to your purpose, a small r squared might be acceptable. 

R squared is certainly a worthwhile tool in linear regression because it has a direct translation.  It is the amount of explained variance in the y value.

That is not the case with logistic regression.   As a result, mathematicians have “created” pseudo r-squared s.  Let’s look into the components of the r squared formula and try to discern exactly it conveys.

Let us look at two pictures with their respective r-squareds first.

The first picture (this had an r-squared of 0.18).  What is the y-value?  It is the blue dot.  What did we expect the blue dot to be?  The value on the line.   It is all relative, but they do not look very ‘close’, relatively speaking.

small r squared

The r-squared below is much higher, in fact, very close to 1.  Let’s keep this picture in mind as we look at the formula.

Here is one derivation of the formula.

Notice that breaking it down in words, it is the squares of the difference of y’s and expected y’s divided by the squares of the differences of the y’s and the mean of the y’s.  So, for instance, in the graph right above, notice what makes the r squared value close to 1.   Since the values are so close to the regression line, that makes the actual y-values close to the expected y’s (y hat), and the numerator in the fraction above relatively close to zero.  Certainly, that fraction will depend on the denominator, but if you run the numbers (the mean is close to 18), you will see the denominator dwarfs the numerator, making the fraction close to 0, and subtracting a value that is close to 0 will result in a value close to 1.

So, lets also think about why this formula make sense.  We are trying to get at how well we explain the y-value, correct?  Well, if the mean is 18, it makes sense to compare, so to speak, the actual y’s to the predicted y’s in assessing the fit.   What if we defaulted to predicting each y value to be the mean?  That would not do much, would it?  In fact, it does nothing.  Anybody could do that.  

For example, what if you try to predict the value of each of 100 houses in a neighborhood.  You know that the mean value is $200,000.  Thus, you predict that each house value is $200,000.  What would your r squared be?   Since your predicted y value is the SAME as the mean y value, the numerator and denominator will be the same, and therefore will be 1.  Thus, r squared will be 0 (and it should be).

So, you see, this formula conveys exactly what it was intended to convey to the extent that the closer the values are to the regression line, the better the fit, and the closer to 1 will be the value.

Let us shift over the logistic regression.  Here is the problem.  Because of the nature of the y-values (either zero or 1), it is difficult (if not impossible) to get meaning behind an r square value (or something akin to it).   This is not to say there is not some meaning to these values.

Let’s assume that we sample 10 4th graders and want to be predict whether they will become a lawyer based on certain characteristics (how much he/she studies, their interests, etc.).   Let’s also assume that 1% of the students will become a lawyer.   So, you devise a model that tries to predict whether or not a student becomes a lawyer.  Even the most likely of kids at that point aren’t going to have more than perhaps a 5% chance.  Because many things can happen in life from age 10 to say mid 20’s when one might pass the bar exam.   Life gets in the way, doesn’t it?  So, if a person does ultimately become a lawyer, even the best of models is likely going to miss that (because they will have a low probability for everyone).  Since you know that 99% do not become lawyers, and since you also know that it is very hard to predict who will become a lawyer, your logistic regression model is likely going to have at most a projected 5 to 10% chance for even the most likely of lawyers.

Let’s go back to the formula.

What happens to the formula is that the numerator will be highly dependent on whether or not one of those 10 actually become a lawyer (if somebody does, then the difference between the predicted (close to 0) and the actual (which is 1) is relatively high (keeping in mind that the differences in the denominator are very close to 0.

Here is the crux of this.  We could play around with numbers all day long, but the nature of the predictions (are they close to 0 like with becoming a lawyer?) or are they closer to 50-50 (like, for instance, will you graduate from college) will go a long way in determining the r-squareds.  It will also matter, at least in the case of becoming a lawyer, whether or not somebody becomes a lawyer (if just one does, that can bring the r squared much closer to 0. Dare I say it, luck matters here (some may take issue with that statement, but we can debate luck at another time 😊).

But, oftentimes, there is a way around this problem.  Again, it might depend on your purpose.  But generally, we use these models to help us in business, correct?  More often than not, we want to make predictions with it.  This gets to the importance of a holdout sample. 

Assume that you are manufacturing something (a widget) and it must meet a “pass” threshold for it to be sent to a customer.  Assume that 90% generally pass.  You run a logistic regression and hope to find a model that will find ranges of independent variables that will increase that 90% pass rate.  You end up with a model that has two independent variables as significant, but your r-squared is not very high.  Should you trust your model?

One thing you can do is to use a holdout sample and create deciles based on predicted success.  Then check to see how well those deciles correlate with the actual success.

Assume your holdout has 1000 observations.  You create a table that shows the decile, the predicted success, and the actual success.  Perhaps it looks something like this.

decilepredicted successactual success
180%82%
283%86%
385%85%
487%89%
589%90%
690%91%
790%92%
892%91%
993%93%
1095%94%

What can we make of this?  Well, the predicted success is reasonably close to the actual success.  We can work backwards, so to speak, to see what range of values went into the higher predicted successes.  This is not a guarantee, but there is a good chance, you can tweak those two independent variables to be in the range of the higher deciles, and perhaps move your success rate to somewhere between 93 and 95%.  Even if you move it to 92%, that is a relatively big move.  You have moved it 20% of the way to perfection.    So, we might be able to work around small r- squareds with logistic regression. Every predictive analyst has his or her own style. It could be argued that modeling is as much as an art as it is a science. Perhaps you have a different way of doing this. But I believe this technique can work well for certain contexts with logistic regression.

Advertisement

R-Squareds for logistic regression

Here is a good article about r-squareds (or best proxies for them) for logistic regression.

FAQ: What are pseudo R-squareds?

This is a subject that seems to be barely discussed and poorly understood.   (Note that I will refer to proxies as r-squareds for ease of discussion).

The article gives some excellent insight into the nature of these proxies, which do not work as well as for linear regression.  There is one part that I slightly disagree with, and that is in regard to the following two sentences.

“In other words, a pseudo R-squared statistic without context has little meaning. A pseudo R-squared only has meaning when compared to another pseudo R-squared of the same type, on the same data, predicting the same outcome.”

So, technically, these statements may be correct, but ‘meaning’ is a relative term, and the fact is that r-squareds do have meaning in that the closer they are to 1, the better the model fits.  I think it is a little bit like temperature. While 80 degrees might be hot to some, and not so much to others, we do know that 80 degrees is “pretty hot”.  A pseudo r-squared of say, 0.8 is probably more often than not, a pretty reasonable fit, depending upon context.  I am not sure if my temperature analogy is a perfect one, but you get the idea.

I would like to elaborate just a bit on one characteristic that may give these metrics a propensity for higher or lower r-squareds. I think that they somewhat depends upon the likelihood for success (or failure). I ran some models with the idea of gaining insight as to whether the likelihood of success matters with respect to the tendencies of low (or high) r-squareds.   There are a lot of numbers, and this is difficult to convey in a few paragraphs, so pay careful attention!

Specifically, I looked at four models with the same two independent variables, both coming in significant for all four models.  I took a variable that had a mean of 80 and it ranged anywhere from 0 to 1000 (skewed right). I defined a “success” four different ways; namely if that value was at least 50, at least 100, at least 200, or at least 400.  Of course, it is going to be at least 50 more often than at least 100, etc. So, there will much fewer successes when the criteria is to be at least 400. I found Somers D values (one of the many pseudo r-squareds) to go from 0.148 to 0.181 to 0.202 to 0.24 for the four models where success was defined for being at least 50, 100, 200, and 400, respectively.

Based on experience, this makes sense as it seems to me that the more polarized the data set is (meaning extreme percentages of successes or failures), the more it will lend itself to higher values of pseudo r-squareds.  So, for the case of having a success when the value was at least 400, (mostly ‘failures’ here), it had the highest r-squared.

Keeping in mind that many of these formulas are similar, let us look at Efrons formula.  It has in its numerator, the sum of the squares of the difference between the actuals and the predicteds, while the denominator is the sum of the squares of the difference between the actuals and average of the actuals, Let’s assume you have extreme data (i.e., almost always a success, say 99% of the time).  I believe the nature of the regression process is going to almost always model a very high probability of success. 

For instance, what if you know that 99% of students at a particular high school will go to college?  Perhaps one of the factors is that it is an affluent area.  For ease of discussion, assume that there were exactly 100 people at the high school in 2020, and 99 of them went to college.  One of the significant variables is GPA, and you find out that a person had a GPA in the bottom 10% of his class. Well, there is a real good chance that this person goes to college, and the model, understanding that, will give that person a high probability of going to college.  So, for 99 of the values in the numerator, they will be close to zero.  Only one value (namely, the person who did not go to college) will contribute to a relatively high numerator (actual will be zero, with likely a high predicted probability).

Let’s compare the difference between these two scenarios.  The first one has virtually every observation as a success and a probability of predicted success near 1.  That keeps Efrons numerator relatively low.  The second scenario, I suspect is going to have many predictions in the ballpark of say, 65%, or perhaps on the other side of 50, say 35%.   It is going to make the numerator much higher.  That said, because of the average of the successes in both situations, the denominator will also be lower in the extreme case (1st scenario).  So, on the surface, we have a ‘stalemate’.   For instance, start with the fraction, ½, and compare it do a fraction with a larger numerator and denominator, such as 2/4.  Of course, it is the same.

But I think what generally happens, is that in the extreme case, the numerator is sufficiently small to ‘offset’ the relatively high denominator in the other setting.   (i.e., Like comparing 3/4 to 12/15.  In one case, you get 0.75 and in the other, you get 0.8, and when subtracting from 1, you get 0.25 (extreme case) compared to 0.2 (non-extreme case).

So, I think what happens the great majority of the time is that “all things being equal”, the more extreme situations (actuals being almost always a success or failure) lend themselves to slightly higher r-squareds.

Hope this makes sense. 

Changing the way we think about p-values

Let’s break this down.  The article points out the potential for high p-values and gives an example with a high false discovery rate of 86%.  Without looking at the numbers, it seems counter intuitive that the false discovery rate could be 86% when you have a relatively high specificity (95%) and sensitivity (80%).  But let’s look further into why this is the case.  It is really the disparate sample size between the 9900 that do not have the condition and the 100 that do have the condition.  Stemming from that, we get the 495 false positives and the 80 true positives.  In other words, because 9900 is so much larger than 100 (99 times bigger), it has the potential to have more false positives, and thus will dwarf the true positives resulting in the very high false discovery rate.

Let’s change these numbers.  What if 50% (5000) people had the condition and 50% did not?  Assume the same 95% specificity and 80% sensitivity.  Now, we would have 4000 (80% of 5000) where we diagnose with the condition correctly, and 250 (5% of 5000) where we believe they have the condition, but do not .  So, our false discovery rate is now 250/4250= 6%. What a difference!

How might that play out with hypothesis tests that we conduct?  I think we can make an analogy to the screening condition.  Let’s assume that Ford manufacturing has had a car that has been around for many years and they claim to get at least 30 miles to the gallon (assume all driving conditions are the same).   A test is done every day for those many years of its existence.  Assume that Ford is very good at what they do, and in reality they almost never produce a car that gets less than 30 mpg.  What is likely to happen? There will be few times they will really have a car that gets less than 30 mph, yet we are going to find less than 30 mpg quite often just by random chance, aren’t we?  In these cases, we will have a high type I error.  Notice again what drove the high false discovery rate in our screening example.  It was that we had so few (80) true positives, at least relative to the false positives.  The same type of thing is going on with my scenario I just described.

Conversely, if Ford is poor at what they do, and built cars with volatile mpg values, then when we do find less then 30 mpg, it will be more likely that the car really is getting less then 30 mpg, and thus we will have a lower type I error. 

Here is the crux of the whole discussion.  When we find p-values in any context, much like the Ford example, we don’t enough about the context (hence, the need for conducting a hypothesis test to begin with) and thus we do not know what the real p-value is.  But as the article suggests, we should acknowledge that p-values are higher (perhaps much higher) than we give credit for.  Perhaps it is a good idea to create a tougher standard (make the critical region harder to hit) to reject the null hypothesis in many instances.

There are quite a few other really good articles at royalsocietypublishing.  Check them out.  And let me know your thoughts here.  Eventually, I think we will need to change what we are teaching in statistics books regarding hypothesis testing 😊

Sampling and the Presidential Election, part 3 of 3.

In the first part of a three-part series, we talked about some basics with sampling, including some definitions. In the second part, we talked about various types of sampling.

In this final installment, we will talk about the Presidential election polls, and important dynamics to consider when taking samples and making inferences. 

Before going further, I want to point something regarding discretion.  Part of a model might include some judgement calls.  When we take a sample and make a conjecture about the population (called making an inference), very often, a researcher must use judgement calls that might be based on anecdotal information or even “gut feel”.  Thus, as you will see, I will give my beliefs on some nuances related to the 2020 Presidential polls.  Those beliefs may be different than yours, and unlike many things in mathematics, it can be hard to refute one theory or another.  This can be an art as well as a science. 

I briefly mentioned bias in the last article, and it is important to elaborate on this.  We mentioned last time that bias can occur when the sample is not a good representation of the population.  The most famous case of bias happened in 1948 in the Presidential race between Harry Truman and Thomas Dewey.  In fact, the Chicago Newspaper wrote the headline: “Dewey wins”.  Why would this be the case when Dewey clearly did not end up winning?  They listened to the polls and the polls they adhered to was biased.  How was it biased?  Well back then, for the most part, only the affluent people had telephones, and the survey was done by telephone.  Thus, the people giving responses were not a representative subset of the population and more likely to favor Dewey.

Bias can be a hard thing to anticipate.  Obviously, if it had been anticipated in 1948, it would not have been a problem.  They would have allowed for it in some way.  Also, even though we mentioned they were not taking representative subsets of the population, bias really occurs where the misrepresentation affects the sample. 

For example, if we wanted to know the percent of people who liked chocolate ice cream better than vanilla ice cream, and we surveyed 80 males and 20 females, we might not have bias.  I stress ‘might not’ because males and females might have the same tendencies towards preference of ice cream flavors. 

But if what if we had that same 100 people and sampled whether a person likes a love story movie or an action film.  If you are trying to generalize for the whole population, you are almost certainly going to get a distorted representation when sampling 80 males and 20 females.

So, anyway, this poll was heavily slanted towards voters liking Dewey, and it distorted the general sentiment.

To summarize, we want to avoid bias in our polling, and bias is any factor (in this case, wealth) that may misrepresent the population (recall, the population is what is of interest to the researcher).

Let us fast forward from 1948 to 2020.  Dynamics have changed.  You have much more polarization about parties.  You have a media that many consider is much more biased.  I am going to try to stay away from politics here but since I am hopefully speaking to a common-sense audience, it is obvious what is going on with the media and their bias towards Democrats.  I believe this is relevant to polling today for reasons that I will explain and that is the reason I mention it

Three main concepts I want to discuss are bias, honesty, and volatility and in some ways I believe there is some connection with each of these to the media.

Now, one might ask, what is the population in the context of Presidential polls?  There is no rule as to how we define a population, but it really depends on what is of interest, and this is a subtle yet important distinction.  Is it registered voters?  Is it likely voters?  You will see some of each.  That said, it is the consensus (which I agree with) that the likely voters are better to poll because of course, they are the ones that are… well.. likely to vote.  So, right off the bat, we might have bias if we were to sample registered voters.

Let us assume a reputable polling organization takes a poll of 1000 voters.  Let us also assume there is a decent mixture of Democrats, Republicans, and Independents.  Why might there be bias

Well, perhaps the most obvious thing to look at is how the sample is split.  Assume that the electorate is made up of 40% Republicans, 40 Democrats, and 20% Independents.  Then without doing any further research, we would want to have our sample split in that fashion (this is called stratified sampling).

Now, I do not want to bog you down with granularity, but we could get very specific if we wanted to.  For example, it might be the propensity for one party or the other to be more likely to switch.  But if there is some tendency for either party, it is probably negligible.

This all said, and without having followed polls ultra-closely, anecdotally it seems to me that Democrats are slightly more represented than should be in many polls.  At least I have seen that in some cases.  Slight bias there.

I mentioned honesty.  The fact is that today, with the ‘deplorables’, and general mindset that FOX news is not mainstream (why is that doctor’s offices and such almost never will show Fox news?) … it is believed that some trump voters do not want to divulge who they like, and I think in part because they do not want to be the ‘bad guy’, so to speak.

And finally, volatility.  This is something that it is not really written in textbooks, but I believe is relevant in today’s political landscape, and I say this largely due to the media.

I believe that the media is the most biased in the U.S history.  But what makes this relevant is that they are influencing some “soft” voters, and those soft voters, I believe are more likely to say they are Democrat and vote Republican than vice versa.

Then there is the matter of electoral votes and not just the popular vote.

As of this writing and depending on the poll(s) you might follow, Joe Biden is up about seven points overall, and up in most battleground states by three to seven points.

I believe that with bias, honesty, and volatility all potentially leaning back in Trump’s favor, this vote will be awfully close again. 

Part 2, Types of Sampling

This is the second of a three part series on sampling. The third part will come more quickly than the second part. 🙂

There are four types of sampling.  Simple Random Sample, Stratified, Cluster, and Systematic.  I will give a brief definition as well as an example for each.

Simple random sample is the most basic.  It is where every person of interest (the population) has an equal chance of being selected for the survey.  We will talk more about it later. 

Stratified sampling is one in which the population is divided into groups, and the sample is obtained with respect to the relative sizes of each group.  For instance, if the sampling is to come from any individual in either of two cites, and one city has 1000 people and the other city has 2000 people, then the sample would consist of 1/3 of it subjects (people) from the first city (since it has 1/3 of the total) and 2/3 from the second city.

Cluster sampling is similar to systematic in that the population is divided into groups (called clusters in this case) but for cluster sampling, one (or more) of the clusters is chosen and represents other clusters.

So, for example. assume that voters across the country are to be surveyed.  Assume also we would like to sample proportional to the states’ population.  But instead of going to each state, they may just sample from a handful of states if it is believed that one state is representative of other states.  Perhaps they sample only from Oregon to represent three states on the west coast.

Finally, and in no particular order there is Systematic Sampling.  Systematic sampling occurs when every kth sample is obtained (where k is some natural number, such as 4).  So, for example, assume that one wants to sample hospital patients, and are interested in patients in some 24-hour period, perhaps a Saturday.  Assume also that they expect about 400 patients in a day.  Thus, they would like to sample about 100 people.  Assume they sample every fourth person on the register sheet.   The advantage of this is that by sampling people throughout the day, they are more apt to avoid peculiarities related to time of day.

For example, if they sample the first 100 people that go to the hospital some Saturday morning, it might be that they are getting a different type of patient.  Perhaps the people going early in the day are more apt to be giving blood.  Thus, if the sample is to try to ascertain the reasons people go in, they are likely to get a distorted picture.

Back to a Simple Random Sample.  When taking a Simple Random Sample, it is usually impossible (for all intents and purposes) to give everyone an equal chance of being selected.  For instance, in the example of polling for the Presidential election, I am not quite sure of their exact methods, but one thing I am sure of is that not every registered voter has an equal chance of being selected.  For instance, if they do this by way of telephone, not everybody has a telephone (though in this day and age, just about everybody does) but some people may not pick up their phone, or more so, if asked, do not want to divulge who they are leaning toward.

Also, just about any sample that is not a Simple Random Sample is going to include a Simple random sample (loosely speaking, as we discussed in the paragraph above) within it.  Take the example given above with stratified sampling.  Perhaps it is not expedient to give everyone an equal chance of being selected.  You might not have their names, their phone numbers, etc.  But you try to make it as random as is reasonably possible.

At the crux of all sampling is bias, and specifically the ability to avoid it.  Bias is where the sampling is a distorted representation of the population.

In part 3, we will discuss the most famous case of bias, as well as the polling in the Presidential election.

Many aspects to discuss with that.  Everybody seems to have an opinion on it. 

Trying to make sense of all of this

You sure do hear a lot of numbers flying around with respect to COVID-19.  It is often hard to believe (or understand) what you hear from the media, as well as from the experts. 

What makes this complex is that there are many factors involved, and it is hard to tease out the importance of each as well as the relationship between each (i.e. is one factor mutually exclusive from another?).

What makes the understanding of all of this so important is that decisions, and many of them life altering, will be made from these numbers.

You hear all kinds of probabilities.  From advocates of opening up everything, you might hear “A person has a 99.3% chance of surviving COVID-10, etc.”  Is that accurate?  Is that even good?  How do you put all of this in context?

Currently, many states are trying to decide what restrictions to put on people and businesses.  To do this, we need to understand these numbers, and probability is involved in much of this as it usually is.

As I have mentioned before, and as my website name implies, probability is at the crux of so much of decision making.

So, where to start?  In most places, most businesses are fully able to operate except where there may be too much of a crowd; places like concerts, games, restaurants. 

Let’s start with an assessment of restaurants.  Should restaurants, by and large, be open?

I will leave it to you to answer yourself, based on my interpretation of numbers (which will give a plausible range).

Perhaps the best way to go about this is to compare it to the flu.   The CDC estimates that about 50,000 people die each year from the flu. 

I think there should be no question that some kind of intervention had to happen.  The fact is that there have been viruses that have killed tens of millions in the past.  But with the hindsight of at least observing what has happened the last few months, is it time to allow restaurants to open, and allow them to define the capacity with social distancing required?  Note: I think it too hard to require face masks be worn (obviously, when one is eating, it is not logical).  We could require them to be worn when entering and leaving the building.

To date, about 100,000 Americans have died from COVID-19.  This is with intervention.  But many died before precautions were taken.  In other words, in the last several months, you have had a ‘mixed bag’ of people dying of all ages, of all heath types, and before intervention, as well as after (though we are pretty certain that intervention has helped to some extent).

Again, to focus on restaurants being open for now, what if we allowed people to go with both social distancing and leave off the facemasks? 

The great majority of the 100k that have passed away due to COVID 19 was after intervention was put in place.  But also to be considered is what part of the 100K were people that either already had it, were sickly to begin with, were in a nursing home where the chances were much greater of getting it, etc. compared to people that are in reasonable health that are generally going about their lives?  This is the most important piece of the puzzle in my opinion, and we are essentially missing this information, and although it is possible to piece together this information from various sources, it will still be inexact. 

Let’s approach this way.  Let’s define group A as people that are either in nursing homes, or will not venture out (either due to age or sickness or desire).

Let’s define group B as all other people.  How many are in each group?  Just thinking of people that I know and see, and having ball park ideas of numbers with respect to some demographics, lets’ put it at

100,000,000 people in group A and the rest (about 230,000,000) in group B.

How many of the people in group A were among the 100,000 deaths thus far?  The reason this number is important is because the assumption is that, by and large, only group B will be the ones venturing out to restaurants.

What if it is 50,000 people? 

50,000 people in two months  is 600,000 over a whole year, and further since there are 200,000,000 people in that group, if we extrapolate out to 330,000,000  (the population of the U.S.) to index it to the flu numbers I gave earlier, then we have upwards of 1 million people, and a twenty fold difference between that and the flu.

But, it sure seems silly to assume that it is 50,000.  It is probably closer to between 5 and 20,000, I surmise, and I would guess that it is closer to 5000.

What it if is 20,000?  Then that translates to 400,000 people per year.

What if it is 5000? That translates to 100,000 per year, and just twice the rate of the flu.

Now, this assumes the precautions of social distancing, but not using face masks.

If you were told you could go to your favorite restaurant without a face mask (of course, the social distancing would take care of itself, the restaurant would not put you too close to others), and you had eight times the chance of catching and dying from the flu, many people would still go.

If it were twice the rate, a great majority of people would go. 

In conclusion, these numbers would put the increased chance of catching the flu between twice the rate and eight times the rate. 

There was a Gilligan’s Island episode where Wrong Way Feldman, an aviator who had a horrible sense of direction, tried to describe where the island was, and by what he told authorities, it was somewhere between the Bay of Naples and the Arctic Ocean!  Now, my range leaves a lot to be desired, and I am certain I am not allowing for factors, and they are too numerous to mention here, but I think I have narrowed down a little better than Wrong Way.

These numbers are just meant to mostly be a partial “off the cuff” assessment, a plausible ‘drill down’ of what is going on.

One thing I am pretty sure I am sure of.  I am glad I do not have to make the decisions related to lock downs.

Sampling

This is the first of a three-part installment on sampling and inferences made from sampling.

As the Presidential race kicks into full gear (we hope it does anyway) we will see more and more polls come out as we get closer to election time.

There is much debate about polls, even among experts.

But to the non-expert, allow me to explain a few things. 

First, a few definitions.  A population is every item (or in many cases, person) of interest to a researcher.  For example, if a person is interested in the average age of all people who go into a store one day (and is only interested in that particular day), then the population would be all people that walk into the store that day.

If, for example, 100 people walked through that door on that day, then the 100 people would represent the population.  However, the population could also be defined as the 100 ages of the 100 people.  In other words, either the ages or the people themselves can be considered the population.

The population can be defined in any way the investigator wants.  It will normally be dependent upon his interest. 

The sample is a subset of the population.  In the example above, if you determined the ages of, say, the first 10 people in the store, then the sample would be those 10 people (or the 10 ages).

Perhaps the biggest factor that investigators want to focus on is something called bias.  Perhaps I should say the biggest pitfall one wants to avoid is that of bias, as bias can completely distort your findings.  Bias occurs when a sample does not represent the population.

Perhaps the most famous case of bias in sampling came in 1948.

We will talk more about bias (along with types of sampling) in the second part.

The third part will focus more on this year’s Presidential election (and a look back at the 2016 election).

The importance of definitions

Back in the 1980’s, Bill James, the pioneer in sabermetrics (which is the study of statistics in baseball) wrote an article that distinguished between a players’ value with respect to the peak of his career compared to his value regarding his career as a whole.  Of course, most people at the time (or even now) would not have thought much about this, and perhaps telling James’ “you are making this complicated, why can’t we just have a discussion”?

Why?  Because clarification/definition on just about any discussion is very important.  If you are a baseball fan, you have no doubt been embroiled in debates regarding the greatest players of all time.  One such discussion might have been “who is the greatest left-hand pitcher in history”.  But let’s pare this down to simply being a debate between Warren Spahn and Sandy Koufax.  Most people probably would probably say Sandy Koufax, if only because his legend was bigger.  If you look at their career statistics, you will see that Spahns’ statistics tower over Koufax (and not just in longevity).  In other words, if you look at career value, it would be extremely hard to say that Koufax was better.  The point here is that unless one defines what is meant by “greatest”, it is really hard to answer the question.

This is true for just about any discussion.  Look at the situation with COVID-19.  The media and supposed experts are in the news every day (heck, every hour) and they have done a poor job, if at all of defining what they mean with a “COVID related death.”

If somebody is 91, and they have all kinds of illnesses, and they essentially die of old age, while having have high cholesterol, is it fair to say they “died of high cholesterol”?

Even if they are assessing these deaths in the same manner they are defining flu related deaths, a flu related death has never been clearly defined, and further, if we are given a definition, is it a fair definition?  Note: (By the way, allow me to make an important distinction.  When I say clearly defined, I mean talked about, or written about enough to make the general public aware.  If it is written on page 24 of some dusty document, that doesn’t’ count as being ‘clearly defined’ in this context.)

There are just short of 80,000 deaths in the U.S. as of this writing.  How did they count these? What if you have somebody with an assortment of maladies, in their late eighties, they test positive with COVID-19, and pass away?  Should that count as COVID-19 death? 

I believe there are two reasonable ways to go about this.  Assume a person has three ailments and they pass away.  We could either pick what we believe has the strongest impact towards dying (and count that as THE condition they died from).  So, in the case of a person having conditions A, B, and C, if they would have lived just two months without A (while having B and C), but two years for each of the other two scenarios, then it seems reasonable to say they died due to condition A and A only. 

Or we could try to assess how long they would have lived without having a particular condition.  If it is believed they would have lived a few days (without COVID-19) it seems absurd to say they died from COVID-19.  If they would have lived a year, it seems reasonable to say that that they died from COVID-19.  So, what is a reasonable cutoff (i.e., threshold)?  Whatever you deem it to be, define it.  Note that by this second way, you can die from more than one illness.

This is not just important because of curiosity.  This is important because it allows lawmakers to make reasonable decisions with regard to social distancing, etc., and further allows the citizens of this country to be enlightened to respond to those decisions appropriately.  For instance, if the flu kills 30,000 people a year by however you define it, and COVID ends up killing 40,000 this year (by the same definition), then these extreme measures are nonsensical.  But if it ends up by that same definition that 500,000 die (or would have died without intervention) then it would appear that most of these measures have been reasonable. 

Notice we go back to word ‘reasonable’.  That word, in many ways defines our decision making, doesn’t it?  It is at the crux of our court system, i.e. guilty beyond a reasonable doubt.

We all need to use reason.  Have the lawmakers used reason with these restrictions?  There are at least two problems in answering.  Record keeping/assessment is one of them.  That is tough nut to crack, and even our best methods are going to leave debate along with uncertainty.  But the other problem is something they could have and should have easily resolved: how precisely do they define a COVID related death?  Nobody seems to know, because nobody seems to be defining it.

It’s all about probability (well, mostly)

There was a book written several years ago by astrophysicist Mario Livio called Is God a Mathematician?  Of course, an astrophysicist (or any scientist) is going to be almost certain to think that God is a mathematician as opposed to say an English major or somebody who studies rocks. 

It seems kind of natural to think of the world in terms of mathematics.  If we explain some of the mysteries of the universe by the equation e=mc2, few will take a second look.  But if we try to explain some of the mysteries of the world by Shakespeare’s “to be or no to be”, there will be quite a few heads turned.

As a mathematician myself, I am biased to a mathematics-based model of the world.  That said, one branch of mathematics that has not gotten as much hype as perhaps it should have is that of probability.  Despite my liking of mathematics, I have always had a particular affinity for probability.  Probability is narrow where mathematics is wide.

How does this relate to our current crisis?  Well, people are divided as to how quickly we should open up certain aspects of America.  Should restaurants be open now?  What about in states that have not has as big an issue as others?  Should hairdressers be allowed to conduct business?

Although it seems that the belief system is mostly partisan (Republicans want to end this lockdown, and Democrats want it extended), it does not completely go across party lines.

Rules/laws/ordinances are for the most part based in probability.  The U.S. is the “land of the free” but what exactly does that mean?  There are restrictions on what we can say.  We cannot go into a movie theatre and yell ‘fire’ (unless there really is one).

There are speed limit laws.  By imposing a law, are our freedoms being interfered with?  Some would say ‘yes’. 

Regarding speed limits (and other restrictions) society generally has a balance between risk and reward.  For instance, if the speed limits were 10 miles per hour (even on highways), there is a pretty good chance that many people would be rioting in the streets over their freedoms being infringed upon. 

If you are really not concerned about dying, you might be willing to go 150 miles per hour on a highway.  Or if given the chance, and told you had a 50% chance of dying if you took a spaceship to Mars, you might take it.  Others might think it is worth the risk.  Of course, many would not.

We all have different thresholds regarding risk.  If you could take a rocket to the moon and had a one in a million chance of dying, would you take it?  I probably would.  If the chance was one in a thousand, I probably would not.  Thus, my threshold is somewhere between 1 in a million and 1 in a thousand (of not making it alive) for going to the moon.

With regards to COVID-19, every governor has his/own discretion about what risks should be taken.  There are some that say that much like when we are skiing, we take risks, and in fact, take a risk not only that we will get hurt independently, but also that somebody else might hurt us.  But we take that risk.  The argument goes that we should be able to take the risk by going to a restaurant.  If somebody does not want to go out, that is up to them.

But unlike the skiing example, this is not an apples to apples comparison.  Because when you are skiing, you take risks for yourself and others who are willing to take the same risk.  If you go to a restaurant, while you are at the restaurant, you take risks along with others, but the problem is that you might bring something home with you. 

Now, I am not saying one way or another which way I feel.  I do believe in certain cases, restrictions are too strong.  I am making the general point that each person in authority, for the most part anyway, are making decisions not with the intent to infringe on anybody’s rights, but are making decisions with probabilities in mind.  Whether they also have ulterior motives in purposely infringing on somebody’s rights, I don’t know their hearts.

But my point is that everyone has a different threshold in the many facets of life, and this guides their decision making.  Just like in the rocket to the moon example.

Another way of counting

https://www.news18.com/news/world/25000-missing-deaths-tracking-the-true-toll-of-the-coronavirus-crisis-in-11-countries-2587973.html

My previous post was in regards to another way of counting. I figured it was a matter of time before somebody posted an article about it. The above article uses a method that I was espousing. Keep in mind, I was not suggesting that the current method was overstating or understating the number of deaths due to COVID-19. I was simply advocating for a “backup” system; a method that might ‘tease out’ the number of deaths by COVID-19. This article implies that (for 11 countries, anyway) the number of COVID-19 deaths is highly understated.

Let’s keep a few things in mind. First, we really do not know that this information is accurate. Like it or not, many people (or entities) have an agenda, and we can’t necessarily take an article (even if it is a reputable newspaper) as gospel.

Secondly, this method has its drawbacks, and is largely dependent on the volatility of the citizens in that area.

Let me give an illustration. Remember, Whoville, the town created by Dr. Seuss? Well, let’s assume that 10,000 people live there, and they live fairly healthy lives, and they do not do drugs, and they do not take crazy risks on the highway, and they don’t get in fights in bars, etc. A great majority live to be in their 80’s or 90’s. Let’s assume that typically 100 die per year. (For the sake of illustration, let’s assume the birth rate is about the same as the death rate, so that the number of people living there at any time is about 10,000). Perhaps over a five-year stretch, the number of deaths is always between 90 and 110. We would call this a non-volatile situation, and one that is generally easy to model. If 200 died one year, we would know that something is up, and it was in the midst of COVID-19, then we can be pretty sure that roughly 100 died from COVID-19.

Getting back to the 11 countries, the accuracy depends a lot on the volatility. For example, did any of them go through an economic crisis, whereby there may be a number of reasons why there is a higher death rate: not taking care of one’s health, suicide, murder, taking unnecessary risks, etc.? To truly ‘tease out’ the number of deaths that would be expected would require a model. And as we have seen models can be drastically off (at least it looks that way). That said, I think this article makes a good point. And I will reiterate what I said last time. Why don’t the media and experts give both methods when reporting the death toll. And let us decide which one might be more accurate. Studies have shown by the way (maybe I will make it a post someday) that very often the average of two or more models is best!