Predictability Score Definition

By Castedo Ellerman

The previous post discussed the motivation for a predictability score in Gene Heritage reports. This post dives into the definition of this predictability score.

Many statistics can evaluate how well genetic data predict a trait. But there are two notable limitations found in the usual statistics. Many only work with two possible outcomes. Others only work with traits that are measurable on a linear scale. In contrast, predictability scores have no restriction on traits being binary or linear. For instance, hair color does not fit well into a linear scale. But a natural categorization would be: black, brown, red, and blonde.

A common metric for traits on a linear scale is 'fraction of variance explained'. This nice intuition of 'fraction of something explained' is also provided by predictability scores. A predictability score corresponds to 'fraction of information explained'. But predictability scores are not limited to traits that are on a linear scale.

Two papers, [1] and [2], describe a metric almost equivalent to predictability score. These papers describe the metric as 'Relative Information Gain (RIG)' and 'Gain Ratio', respectively. Predictability score corresponds to 'relative information gain' when there is more genetic information than trait information. Otherwise, the metrics differ. Another difference is predictability scores can be per genetic profile and not only a score for the whole population.

Amounts of Information

The predictability score uses information theory to measure amounts of information. Amount of information is measured as a number of yes-no questions (bits). To see how, imagine a game of '10 questions' where you must predict each trait of ten randomly selected people. The trait is whether a person is in the taller or shorter half of the population. To perfectly predict each trait of all ten people, with no hints or clues, you need to ask 10 questions. The necessity to ask 10 questions measures the amount of trait information for ten people.

Now imagine you get genetic information for the ten random people. You find out whether each person has one vs two X chromosomes. Or in other words, whether each person is a genetic male or a genetic female. Knowing this genetic information, there is now a clever way to ask FEWER than 10 questions, on average. The next blog post will explain how this is possible. Males tend to have the "tall" trait and females will tend to have the "short" trait. Because of this tendency, only about 8 questions are needed, on average.

The genetic information rendered 2 questions unnecessary, on average. This means genetic information has explained (or gained) 2 questions worth of trait information. For this reason, the predictability score is 2 (out of 10) in this example.

In contrast, if you got the exact height in centimeters of each person, you could skip asking 10 questions entirely. The predictability score with height information would be 10 (out of 10). At the other extreme, imagine you got useless information. You would be spared no questions. The useless information gains zero trait information. The predictability score with useless information would be 0.

Mathematical Preliminaries

The rest of this post will jump into the mathematical definition of the predictability score. Like all statistics, the predictability score is relative to some reference population. A detailed example population is given below. This example will also show a predictability score that differs from 'relative information gain' [1]. This difference is due to having more trait information (2 bits) than genetic information (1 bit). The example population will be an imaginary class of 200 students. Their heights are a convenient simplification of the distribution of heights in the USA.

	Height
4 of 100	Above 177cm	46 of 100
16 of 100	177cm to 170cm	34 of 100
34 of 100	170cm to 163cm	16 of 100
46 of 100	Below 163cm	4 of 100

Two random variables \(G\) and \(T\) will be used in the mathematical definition of predictability score. They represent the genetic profile and trait of a person, respectively. In the example, these random variables correspond to:

\( G \): assignment of each student to a genetic profile of either male or female
\( T \): assignment of each student to a trait of one of four student height quartiles

For mathematical clarity, the kernel of any random variable \( Y \) is defined as \[ \ker{Y} := \{ Y^{-1}(\{y\}) : y \mbox{ is in the image of } Y \} \] The kernel of a random variable is a partition of the population such that each part of the partition contains people with the same assignment of the random variable. In the example, the kernels are:

\( \ker{G} \): set of two groups of students: genetic males and genetic females
\( \ker{T} \): set of four groups of students, of equal size, one for each student height quartile

Mathematical Definition

For a genetic group \( g \in \ker{G} \), the predictability score for trait \( T \) is \[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\P}{Pr} \DeclareMathOperator{\I}{I} \DeclareMathOperator{\Score}{PS} \DeclareMathOperator{\Eta}{Η} \DeclareMathOperator{\Pig}{PIG} \Score(T|g) := 10 { \E[\I_T|g] - \Eta(T|g) \over \Pig(T|g) } \] using the following components defined later in this post:

\( \E[\I_T|g] \): average amount of trait information ignoring genetic information
\( \Eta(T|g) \): amount of trait information not explained by genetic information
\( \Pig(T|g) \): potential information gain

This resulting predictability score is the fraction of potential trait information explained by genetic information (on a scale from 0 to 10). The score is on a scale of 10 to reduce confusion with probabilities by readers of genetic reports.

Average Trait Information

Within each genetic group \( g \in \ker{G} \), the measurement of average amount of trait information is \[ \E[\I_T|g] = \sum_{t \in \ker{T}} \Pr(t|g) \log_2{1 \over \Pr(t)} \] where the information content of \( T \) is the random variable \[ \I_T(\omega) := \log_2{1 \over \Pr(T^{-1}(\{T(\omega)\}))} \] Although this measurement is an average within genetic groups, it is not using genetic information. It is the amount of trait information, ignoring genetic information, on average within each genetic group. In the example, this amount of trait information is 2 bits (per person) for both genetics groups. \[ \E[\I_T|g] = 0.04 \log_2{1 \over 1/4} + 0.16 \log_2{1 \over 1/4} + 0.34 \log_2{1 \over 1/4} + 0.46 \log_2{1 \over 1/4} = 2 \]

Information Gain

Knowing to which genetic group a person is assigned is genetic information. After finding out genetic information, there is a remaining amount of unexplained trait information. This is the conditional entropy \[ \Eta(T|g) := \sum_{t \in \ker{T}} \P(t|g) \log_2{1 \over \P(t|g)} \] The difference \( \E[\I_T|g] - \Eta(T|g) \) is the information gain of trait information from knowing genetic information. In the example, the conditional entropy is \[ \Eta(T|g) = 0.04 \log_2{1 \over 0.04} + 0.16 \log_2{1 \over 0.16} + 0.34 \log_2{1 \over 0.34} + 0.46 \log_2{1 \over 0.46} \approx 1.65 \] which results in an information gain of about \( (2 - 1.65) = 0.35 \) bits.

Potential Information Gain

In the simple example of male/female, there are only two genetic groups. This provides only 1 bit of genetic information. But different genetics groups can lead to more than 1 bit of genetic information. With more genetic information is the potential for more information gain.

'Potential information gain' is defined as the smaller of two upper bounds on information gain. One upper bound is the average amount of trait information. Intuitively, there can be no more trait information gain than the amount of trait information. The other upper bound is the amount of genetic information. Intuitively, the information gain due to genetic information can not be greater than the amount of genetic information.

\[ \Pig(T|g) := \min\left( \E[\I_T|g], \log_2{1 \over \P(g)} \right) \] In the example, the potential information gain is \[ \Pig(T|g) = \min\left( 2, \log_2{1 \over 1/2} \right) = 1 \] The predictability score for both genetic groups is thus \[ \Score(T,g) \approx 10 {(2 - 1.65) \over 1} \approx 3.5 \]

Overall Score

The predictability score and potential information gain were defined for each genetic group. The potential information gain can be average across all genetic groups for an overall potential information gain: \[ \Pig(T,G) := \sum_{g \in \ker{G}} { \P(g) \Pig(T|g) } \] The overall score is a weighted average across all genetic groups. The weighting is by aggregate potential information gain. \[ \Score(T,G) := \sum_{g \in \ker{G}} {\Pig(T|g) \P(g) \over \Pig(T,G)} \Score(T|g) \] This score is connected to mutual information, \( I(T;G) \), as such: \[ \Score(T,G) = 10 { I(T;G) \over \Pig(T,G) } \]

Real Examples

Below is data using height and life span statistics of males and females in the United States. One of the nice features of the predictability score is that as the resolution of trait information is increased, the predictability score will approach a limit. Below are approximate predictability score limits:

9.99: female/male sexual anatomy predictability based on one vs two X chromosomes
4.5: height predictability based on one vs two X chromosomes
0.2: lifespan predictability based on one vs two X chromosomes

References

Yee J, Kwon M-S, Park T., et al. A modified entropy-based approach for identifying gene-gene interactions in case-control study. PloS One 2013; 8(7):e69321. [PMC free article] [PubMed] [Google Scholar]
Dong C, Chu X, Wang Y., et al. Exploration of gene-gene interaction effects using entropy-based methods. Eur J Hum Genet 2008; 16:229–35. [PubMed] [Google Scholar]

Posted 2019-04-08