The Missingno Experiment and Multiple Form Pokemon

Table of Contents

Wild missingno appeared!

Battle entry animation of a ‘wild missingno appeared’ from pokemon red/blue Missingno is the patron Pokemon of data science. You’re just casually surfing up and down your data, doing some sweet coding, when suddenly a bunch of missing and corrupted data gets in you way, and you suddenly have a bunch of random items in your bag for no reason. OK, well maybe I just have a messy bag.

The valuable part of this metaphor is the part where you battle Missingno, and win. I’ve been doing this with my Pokedex project recently, to try and iron out what data I can rely on from my data source, and what’s a bit patchy.

library(pokedex)
library(tidyverse)
library(naniar)
library(skimr)

Go, Skimr!

Skimr gives us a text based summary view. As well as the basics on data set size, it also shows us some statistical values, but most valuably it describes how many values are missing, and in what columns.

pokemon %>% 
  skimr::skim()
NamePiped data
Number of rows807
Number of columns24
_______________________
Column type frequency:
character8
list1
numeric15
________________________
Group variablesNone

Data summary

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
identifier01.0032008070
type_101.00380180
type_24020.50380180
name01.0031208070
genus01.00112105890
color200.98360100
shape200.98490140
habitat4220.48313090

Variable type: list

skim_variablen_missingcomplete_raten_uniquemin_lengthmax_length
flavour_text0180711

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
id01.00404.00233.111.0202.5404605.5807.0▇▇▇▇▇
species_id01.00404.00233.111.0202.5404605.5807.0▇▇▇▇▇
height01.001.161.080.10.611.514.5▇▁▁▁▁
weight01.0061.77111.520.19.02763.0999.9▇▁▁▁▁
base_experience01.00144.8574.9536.066.0151179.5608.0▇▇▁▁▁
is_default01.001.000.001.01.011.01.0▁▁▇▁▁
hp01.0068.7526.031.050.06580.0255.0▃▇▁▁▁
attack01.0076.0929.545.055.07595.0181.0▂▇▆▂▁
defense01.0071.7329.735.050.06789.0230.0▅▇▂▁▁
special_attack01.0069.4929.4410.045.06590.0173.0▃▇▅▂▁
special_defense01.0070.0127.2920.050.06585.0230.0▇▇▂▁▁
speed01.0065.8327.745.045.06585.0160.0▃▇▆▂▁
generation_id200.983.671.941.02.045.07.0▇▅▃▅▅
evolves_from_species_id4260.47364.35232.431.0156.0345570.0803.0▇▆▅▆▅
evolution_chain_id200.98195.96124.571.084.0187303.0427.0▇▆▅▆▅

I was expecting some missing data in type_2, and evolves_from_species_id, but I wasn’t expecting only half of habitat to be there. Either I broke something in my data pipeline, or the data wasn’t there to begin with. colour, shape, generation_id and evolution_chain_id are all missing 20 entries each, which is a bit or a coincidence. I wonder if they are all missing from the same Pokemon?

Visdat I choose you!

visdat is a package that helps you visualise missing data and data types.

visdat::vis_dat(pokemon)

This clearly shows us the data types in each column, and where values are missing in context. It looks like habitat might just not be available after a certain time. It also looks like colour, shape, generation_id and evolution_chain_id looks like they are maybe all missing from the same individual Pokemon?

Go, Naniar!

Naniar helps us check through plots where relationships between missing values and other variables might occur. Lets check first if there is a relationship between generation_id and evolution_chain_id

pokemon %>%
  ggplot(aes(generation_id, evolution_chain_id)) +
  geom_miss_point()

This plot might need a little explanation. For the Not Missing blue values, this is a normal geom_point(). However, where the values are marked as Missing pink they are deliberately moved below the (0,0) mark for the axis they are missing values for, then they ‘jitter’, to avoid over-plotting. The little cluster at the far bottom left in a line marks that for all values where evolution_chain_id being missing, generation_id is also missing. Let’s have a look at the evolves_from_species_id variable just to help us understand.

pokemon %>%
  ggplot(aes(evolves_from_species_id, generation_id)) +
  geom_miss_point()

This is showing that in every game generation (Red/Blue, X/Y, etc.) that there are Pokemon that have an evolves_from_species_id, i.e. they have a precursor Pokemon, and that there are also Pokemon that don’t have a precursor. Just what we see in the games. It’s also showing that have neither generation_id or evolves_from_species_id.

Who is that Pokemon?

Now we know the characteristics of the missing data we are interested in, we can pull them out easily. Especially with the newly released across() function

missing_cols \u003c- c("color", "shape", "generation_id", "evolves_from_species_id")
pokemon %>% 
  filter(across(missing_cols, ~is.na(.x))) %>% 
  select(name, identifier, missing_cols) -> missing_pokes

missing_pokes %>% knitr::kable()
nameidentifiercolorshapegeneration_idevolves_from_species_id
Deoxysdeoxys-normalNANANANA
Wormadamwormadam-plantNANANANA
Giratinagiratina-alteredNANANANA
Shayminshaymin-landNANANANA
Basculinbasculin-red-stripedNANANANA
Darmanitandarmanitan-standardNANANANA
Tornadustornadus-incarnateNANANANA
Thundurusthundurus-incarnateNANANANA
Landoruslandorus-incarnateNANANANA
Keldeokeldeo-ordinaryNANANANA
Meloettameloetta-ariaNANANANA
Meowsticmeowstic-maleNANANANA
Aegislashaegislash-shieldNANANANA
Pumpkaboopumpkaboo-averageNANANANA
Gourgeistgourgeist-averageNANANANA
Oricoriooricorio-baileNANANANA
Lycanroclycanroc-middayNANANANA
Wishiwashiwishiwashi-soloNANANANA
Miniorminior-red-meteorNANANANA
Mimikyumimikyu-disguisedNANANANA

So it looks like in the current version of the package, these Pokemon all have ‘complex’ identifiers. This is because these Pokemon all have different forms. Some vary by colour like Basculin which can be Red or Blue striped, others have ability transformations, like Aegislash or which game it was caught in like Deoxys.

missing_pokes %>% 
  pull(name) %>% 
  stringr::str_to_lower(.) -> missing_pokes_name_list

pokedex$pokemon_species %>%
  filter(
    stringr::str_to_lower(identifier) %in% missing_pokes_name_list
    ) %>% 
  select(identifier, generation_id, evolves_from_species_id, shape_id, color_id) %>% 
  knitr::kable()
identifiergeneration_idevolves_from_species_idshape_idcolor_id
deoxys3NA128
wormadam441255
giratina4NA101
shaymin4NA85
basculin5NA35
darmanitan555488
tornadus5NA45
thundurus5NA42
landorus5NA43
keldeo5NA810
meloetta5NA129
meowstic667762
aegislash668053
pumpkaboo6NA13
gourgeist671053
oricorio7NA98
lycanroc774483
wishiwashi7NA32
minior7NA13
mimikyu7NA210

If we go back to the raw source data, we can see that the data is actually there for most cases, it just didn’t join properly because in the source data, they are identified by the simple name, in lower case, and in this version of the package this data is joined on id AND the column that actually has the complex name. Also, because shape and color link through this data, they are missed as well!

You defeated wild missingno!

This is all based on my Pokedex R data package, which I’m just about to fix :)

daveparr/pokedex

Go to top

Read Next

How to calculate a Pokemons 'power level' using kmeans

Read Previous

Why did I make this dev.to API wrapper?