Game Of Thrones : Who is most likely to die?

Otmane El aloi
14 min readAug 2, 2021
Of course, it goes without saying that this Analysis contains spoilers ;)

Game of thrones: Quick shout out ❤️

GOT is truly one of the best TV shows I’ve ever seen. the quality of the effects, the fictional world, the development of the characters… these things make it outstanding. But what I admired the most is the diversity of the stories, as if I were watching multiple series at the same time, each character has its own story, its own world, and its own vision of the game.

Over the course of eight seasons, GOT has featured thousands of deaths, from battlefield extras to major characters. And one thing that has motivated me to do this study, is the unpredictable way of the character’s death. More than that, some of the characters just can’t die as laboriously as the king in a chess match, you can only endanger him, cause if he dies the game doesn’t have a meaning anymore, and it just ends. Speaking about the end, most of GOT fans didn’t think the latter was satisfying. On one hand the death of Daenerys Targaryen was unpredictable, and Jon snow who is actually Aegon Targaryen, had the best claim to the Iron Throne but he didn’t make it , instead he was sent back to join the Night’s Watch as his punishment for killing Dany.

Jon kills Dany 😢

Observational statistical study:

The goal of this study is to estimate the survival time of game of thrones (GOT) characters, including the median time of survival, and to assess the association and impact of covariates to death events. And finally, to predict the survival status.

Methodo:

A univariate descriptive statistics analysis will be conducted using a non- parametric procedure, Kaplan -Meier (KM) method to Estimate the overall survival function, while the Cox proportional hazard model will be used for the multivariate analysis to determine the possible associations of the predictor variables. Further analysis using survival decision tree will be conducted to complete the previous description of predictor variables.

Data description:

The data, I am working with summarizes information from all 8 seasons about each of 359 characters. Other data sets contain more characters up to 2000 characters because they’ve gathered data also from the 7 books. Unfortunately, there are some differences between the novel series and the TV show. Below are ones that deal with our study event: Death. (many other differences could be found in the following post, those below were taken from the same website).

GOT books

Differences between books and TV show:

  • Mance Rayder is still alive in the books, kind of, despite being burned alive on the show (shot with an arrow in the heart by jon (*-^))
The great Mance :)
  • Mormont never gets greyscale in the books like he does in the show during seasons five and six. (Having a disease is another covariate that could increase the hazard of death. Unfortunately, that data doesn’t record this information)
Sam performing the impossible (possible) operation on Mormort.
  • Sansa Stark is nowhere near Winterfell in the books, has never met Jon either, and was never married to Ramsay. (I think the idea of being in the company of Ramsay increases your risk of death. In the survival analysis part this wouldn’t be taken into consideration. Instead, the fact of belonging to a particular community (allegiance) will be tackled.)
  • In season six, it appears that the Tyrell line is wiped out as Margaery, Loras, and Mace die in the Sept of Baelor. But in the books, there are two other Tyrell’s sons who aren’tdead.
  • Sam and Gilly go to Oldtown in the book as well as in the show, but they go with an alive Maester Aemon. Well, temporarily alive since he dies, on the way.(This does change the date of Death for maester Aemon though!).
Maester Aemon (Targaryen as well)

So I’ll stick for this study to the 359 sample data set!

Selected features for the study:

  • name
  • sex:
Variable description provided with the data

The number of male characters on the show is two times the number of female characters, Surprising! Right? (I didn’t notice that, when I was watching it).

  • religion:
Source: https://mbird.com/conferences/what-do-we-say-to-the-god-of-death-when-ministry-forces-us-to-answer-today-a-conference-preview/

Exploiting this variable, I ‘ll try to answer our main question:

Who is most likely to die? taking into consideration its religious belief only.

Variable description provided with the data

As we can see, most of the characters have no religious affiliation or maybe it’s just that the TV show didn’t give us any clue about it. So, to make it simple, I’ll divide the samples to religious people and nonreligious ones, doing so we’ll manage to get a balanced dataset.

  • allegiances:
Source: https://twitter.com/rissputra/status/1140806177007132672/photo/1

GOT has shown how important to have allies to rely on, you can’t make it to the iron throne on your own.

Exploiting this variable, I ‘ll try to answer our main question:

Who is most likely to die? taking into consideration its allegiance only.

Source : https://technabob.com/blog/2015/04/15/game-of-thrones-banners/
Variable description provided with the data
  • social_status:
Variable description provided with the data

In GOT, the social status is everything, being high-born engages you in a certain life style and you find yourself forced to follow the customs, manners, and traditions of the society you’re born in. The non-respect of the society’s rules dictated by your social status could put you in real danger and even kill you. If you are born in the wrong place, your fate is sealed, and death is just a matter of time.

Exploiting this variable, I ‘ll try to answer our main question:

Who is most likely to die? taking into consideration its social status only.

58.7% of lowborn have died in the show versus 59.82% of Highborn characters. The question that I’ll answer is surely not which category has the largest ratio of deaths, but which category has the largest hazard of death at a specific time (episode).

  • dth_flag, dth_episode.

The data looks finally like that:

For characters who didn’t die by the end of the series,the death episode is equal to NaN. For the analysis, I’ll assume that our study goes from episode 1 to 73, so alive characters would be considered as right censored data. Thus, NaN values will be replaced by 73 which is the last episode.

And as I previously clarified, the religion will be transformed to a binary variable. Because we only have interest in religious and nonreligious characters; Taking into consideration each specific religion will only add extra complexity without any importance to the analysis.

Univariate analysis:

Survival analysis is generally defined as a set of methods for analysing data where the outcome variable is the time until the occurrence of an event of interest, in our case the event is the death of a character.

In this section, The KM + NA (Nelson-Aalen) will be used to obtain univariate descriptive statistics for survival data, including the survival function,the hazard function, median survival time, it‘ll also be used to compare the survival experience by allegiances, social status, sex and religion, (One feature at a time).

The survival function is simply:

if T is the time to death, S(t) is basically the probability that the death event occurs after time t.

1- How does the gender affect the probability of death?

To statistacally quantify the difference between the two survival curves , I’ve performed a log-rank test.

  • H0: The null hypothesis claims that the two survival functions are equal.

The log-rank test results are as following:

Given the p-value, we can reject the null hypothesis, which means there is a significant difference between the survival function of women and that of men.

Above, the estimation of the cumulative hazard function for both women and men. As we can see men are more likely to die than women, and the danger increases by the end of the serie.

Conclusion: Women in GOT has larger suvival probabilty than men.

2- How does the religion affect the probability of death?

The log-rank test until the episode number 50 gives a p-value of 0.26, then we can’t reject the null hypothesis, which means there is no significative difference between the survival function for religious and non-religious characters until episode 50. But by the end of the serie being non-religious gives you more chances to live from this univariate point of vue.

The cumulative hazard of death increases drastically for religious characters by the end of the serie.

Conclusion: Characters with no religious belief in GOT has larger suvival probabilty than characters with religious belief, especially during the last 10 episodes.

3- How does the allegiance affect the probability of death?

principal allegiances :

  • Stark
  • Targaryen
  • Night’s Watch
  • Lannister
  • Greyjoy
  • Bolton
  • Frey

3–1 Do all principal allegiances share the same survival curve?

To answer this question, I’ve performed a multivariate logrank test, the result is as follow:

Given the calculated p-value, we can reject the null hypothesis and assume that the survival functions of all principal allegiances are different. The plot bellow shows the estimated survival functions by KM for The first 4 allegiances:

The survival functions of The Starks, the Targaryens, and the Lannisters may seem simillar, but there is no statistical significance in what the plot shows, in fact performing a multivariate logrank test we got :

The p-value is less than 0.05, so we can’t accept the null hypothesis, the test is statistically significative, and the survival functions are different.

Visually, belonging to the Night’s Watch increases the danger of death all over the eight seasons. While being a lannister offers you more chances to live at the beginning of the serie than at the ending episodes, and the exact opposite for the Starks and the Targaryens.

3–2 Is there any difference in the probability of survival between characters who belong to an allegiance and others who don’t?

Belonging to an allegiane puts the game of thones characters in more danger, in fact, 50% of the characters that belong to an allegiance have already died by the 60th episode.

3- How does the social status affect the probability of death?

High born characters have larger probability of survival compared to low born characters.

Multivariate analysis:

Since we want to understand the impact of all the previous variables on the survival time, a risk regression model is more appropriate. The most commonly used risk regression model is Cox’s proportional hazards model.

The following section is quite technical, if you are only interested in the analysis part, feel free to jump to the next section. :)

1- Verifying the proportional hazard assumption:

The proportional hazard assumption is that all individuals have proportional hazard functions and the scalar of proportion doesn’t vary in time, the only time varying term is the baseline hazard.

To check for the proportionality assumption, I’ ve performed the following test.

Variable 'social_status' failed the non-proportional test: p-value is 0.0076.Variable 'allegiance_last' failed the non-proportional test: p-value is 0.0399.

Other visual test was also performed:

Schoenfeld residuals:

The plots above shows non-random pattern against time. The violation of the PH is clearer for the social_status than for the allegiance_last covariate.

To take into consideration the time dependance, I’ll add time interaction terms to the model : allegiance_last*dth_episode, social_status* dth_episode.

Results before and after model correction:

Results before model correction
Results after model correction

The interaction term that includes the allegiance_last covariate doesn’t have a big effect on the hazard of death. In fact, it only increases the hazard of death by a factor of 1.03 which is broadly 1%, also the p-value for this coeficient is 0.05, so the test is not statistically significant and we can’t reject the null hypothesis which claims for the nullty of the interaction’s term (time*allegiance) coefficient.

While for the interaction term that includes the social_status, the p-value is less than 0.05 and the coefficient associated with this term is different from 0.

That’s why for the rest of the analysis, I’ve only added an interaction term for the social_status covariate.

2- Multivariate Cox analysis :

The p-value for the likelihood-ratio test is significant, indicating that the model is significant. So the null hypothesis of a model with only the intercept is soundly rejected.

In the multivariate Cox analysis, the covariates allegiance_last , sex, social_status, religion and the interaction term remain significant (p -value< 0.05). However, the effect of social_status is now dependent on time’s (dth_episode) value. In fact, being high born, which is equivalent to social_status = 1 induces the hazard of death by a factor of 3.98 which is a significant contribution that increases as we progress in the serie.

Conclusion: Being highborn put you in real danger especially by the end of the series. But the KM curves have suggested a larger survival probability for high-borns. To explain this contradiction, I’ll investigate the relationship between the covariates .

relationship between covariates:

Chi2 test p-values.

The p-value of the Chi2 test between sex and religion, and between sex and social status are larger than 0.05, so we can’t accept the null hypothesis and the religious belief of the characters or their social status have nothing to do with their sex. In contrast, the social status is related to religion and allegiance. So taking into consideration only the univariate analysis, being high born increases the chances of life while belonging to an allegiance decreases the chances of life. But the test above suggests a dependance between these two variables so any conclusion regarding the increase or the decrease of the chances of death should be conducted from a multivariate point of vue.

The allegiance_last hazard ratio, indicates a strong relationship between belonging to an allegiance and the increase of the chance of death. For example, holding the other covariates constant. Belonging to an allegiance which is equivalent to allegiance_last = 1 increases the hazard ratio of death ‘episodly’ (each episode) by a factor of 2.03.

While for the sex covariate hazard ratio, indicates a strong relationship between the sex variable and a decrease in the risk of death. So being a woman is related to a reduction of 50% of the risk of death.

3-survival trees:

As we’ve seen earlier cox-regression models are good for analyzing survival data, but their assumptions are easily violated. In this section I’ll employ decision‐tree models since they provide easy to display and interpret decision rules that help identify important covariate. Generally, in tree-based models, the data is recursively partitioned based on a splitting criterion, and the characters that are like each other based on the event of interest will be placed in the same node. The main difference between traditional decision tree and survival tree is in the splitting criterion. Here I’ll be using the same criteria used in the univariate analysis, the famous log-rank test.

To reduce variance a forest of survival decision tree was trained on all the characters data.
There are no prespecified assumptions regarding variables, and randomization is introduced into this model by random bootstrap sampling of characters from the original data. For the allegiance variable, I couldn’t implement the unicity of the allegiance for each character. In fact, including all the allegiances yields to non-logic branches in the tree, where a character would belong to two or three allegiances at the same time. (If you know how to handle this, let me know in the comments. I’ll be thankful 😊). So to overcome this practical limitation. I’ve considered only the fact of belonging to an allegiance or not.

The target variable for survival trees is a structured array with two fields, the event of death as the first field and the time of its occurence as the second field.

For hyperparmeter optimzation, I’ve conducted a grid search with cross validation. The final model has a conordance index of 0.59. (0.50 suggests a random model)

The most important variables were identified by permutation importance which is defined to be the decrease in a model score when a single feature value is randomly shuffled.

The result shows that the sex is by far the most important feature. If its relationship to survival time is removed (by random shuffling), the concordance index on the test data drops on average by 0.0421 points.
Allegiance come in the second place where social status come next and religion last as the less important variable.

N.B : the importance above differs in terms of significance from the previous importance identified in the cox regression model. In fact, that latter tells directly how each variable contributes to the hazard of death, while the permutation importance calculated with survival tree model tells how significant each variable is for the model performance.

Bellow the predicted risk scores for some selected characters:

The risk score is the total number of events, which can be estimated by the sum of the estimated cumulative hazard function in terminal node.

Where n(h) denotes the total number of the number of distinct event times of samples belonging to the same terminal node as the studied character (x).

Cerci, Dany, Arya have the lowest risk scores among those characters. which is quite understandable since Dany and Cerci were powerful till the end of the tv show, Arya hasn’t died.

Khal drogo has the largest risk score.Also understandable since he died early in the TVShow.

The End:

All men must die : “كل نفس دائقه الموت”

References :

data source : https://figshare.com/articles/dataset/Game_of_Thrones_mortality_and_survival_dataset/8259680/1

--

--

Otmane El aloi
Otmane El aloi

Written by Otmane El aloi

Hi! I am an engineering student, doing applied mathematics for data science as a major. I like learning new things.

No responses yet