pokemon-eda

Pokemon EDA

Pokemon EDA Kaggle

The Anime Pokemon doesn’t need any introduction but for the sake of this article, I do need to provide one. Pokémon has been a successful long-running Anime, with a huge fan following. An adventurous based story, that mainly focuses on 10-year-old Ash Ketchum, a young Pokémon Trainer from Pallet Town. Together with Pikachu, Ash set out on a daring journey where he battles with many Pokémon trainers. On their journey, they meet new friends and new rivals. From traveling around one gym to another and participating in many championships, Ash dreamt of becoming the World’s greatest Pokémon Master.

Pokemon is not just a popular Anime but also a popular Video Game. Not to forget Pokemon Go, which had a crazy hype in Moblie Gaming. With the vast Pokemons detail, performing EDA becomes even more interesting. 

What Is EDA?

EDA stands for Exploratory Data Analysis. When provided with a dataset, you need to take data-driven steps to analyze the data. Using Python modules Pandas, Matplotlib, Seaborn, you can perform EDA. EDA is more like the storytelling of the given dataset.

The more you understand the data, the better you analyze and visualize the dataset. You can try hands-on experience in Data Analysis and Data Visualisation in the following article. How cool is that?

One thing to keep in mind is you need to have basic Python Knowledge and how to import a module. If you are new, then you should try our Anime Vyuh PYTHON tutorials. Once done you can try out Pokémon EDA to get started with Pandas, Matplotlib, and Seaborn.

Learn Pandas And Matplotlib Using Pokemon

Major Building Blocks Of EDA

Before jumping into the code, download the Dataset from Kaggle: The World Of Pokemon.

Analysis Of Pokemon Dataset

First Import modules, read the CSV dataset, and create a DataFrame.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
import squarify
warnings.filterwarnings('ignore')

dataset = pd.read_csv("pokemons dataset.csv")
df = pd.DataFrame(dataset)

Filtering, Cleaning, Selecting In Pandas

Filtering is used to choose a selected part from the dataset by removing another part of the data. 

Here I will be sharing three different methods using which you can filter out the dataset.

First, let us look at Single Filtering.

#Method 1: Using Opeators
attack = df[df['Attack']>150]

#Method 2: Using Anonymous Functions: Map and Lambda
attack_2 = df['Attack'].map(lambda atk:atk>150)

#method 3: Using pandas.Series
attack_pokemons = []
for atck in df['Attack']:
    if atck > 150:
        attack_pokemons.append(True)
    else:
        attack_pokemons.append(False)

attack_3 = pd.Series(attack_pokemons)

Next up, Multi Filtering

attack_defense = df[(df['Attack']>180) & (df['Defense']>160)]

water_poison = df[(df["Primary Type"]=="WATER") & (df['Secondary type']=="POISON")]

Cleaning is the process of removing unwanted data from the dataset. The major unwanted data which has to be cleaned is the empty dataset. Obviously, most of the datasets with large data will have few(or more) empty values. 

You can check the empty dataset using isnull() method.

df.isnull() #where df is DataFrame

Pandas also enable us to see the total sum of empty values in each column.

df.isnull().sum()

The above command returns the total sum from each column and there is one more command that returns the total sum of empty values from the entire dataset.

How To Deal With Empty Values In Pandas?

Using the pandas drop method, we can remove the unwanted data.

#to drop columns
df.drop(column_name,axis=1,inplace=True)
df.drop([column1.column2],axis=1,inplace=True)

#axis=0 rows, axis=1 columns

#to delete empty values
df.dropna(how="all",inplace=True)
df.dropna(how="any",inplace=True)

By cleaning it doesn’t mean just removing the data but also filling up the empty dataset. Using the pandas fillna method we can fill up the empty values.

Consider the first dataset, the data is empty if in case the student is absent for the exam. You can then fill the empty values with 0. 

In the second dataset, since you are dealing with sales, there might be empty data. This empty data should contain a price for the specific month but is left empty. In such cases, you can take the average mean of that column (here column is Company Name). 

Now let’s look at the fillna pandas method.

df.fillna(0,inplace=True)
df.fillna(df[column_name].mean(),inplace=True)

inplace=True makes the changes permanently in the provided dataset.

The Ultimate Use Of Pandas groupby method

Selecting is the process of picking up the required data from the dataset. We can even select the data(or column values) in reference to other columns. This can be done using groupby method. 

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

print("Bug Minimum(Primary):",df.groupby("Primary Type")["Total"].min()["BUG"])  #min bug total in Primary type

print("Fire Min(Sec):",df.groupby("Secondary type")["Total"].min()["FIRE"])
print("Water Mean(Pri):",df.groupby("Primary Type")["Total"].mean()["WATER"])
print("Lightening Count(Sec):",df.groupby("Secondary type")["Total"].count()["ELECTRIC"])

Using Pandas we can even plot the graphs, and this groupby method is a big help.

df.groupby('Primary Type')['Attack'].agg(['count','min','max','mean']).plot(kind="bar",stacked=True,figsize=(15,10),title="Primary Type Pokemon Attack Analysis")

In the above code sample, using groupby we are selecting count, mean, max, and min value from the dataset. We use agg i.e., Aggregate to choose multiple operations from the groupby. After selecting the operations we can plot a bar plot, by providing bar as a value in the kind attribute. By declaring stacked as True, you are stacking i.e., merging different bar plots in a single plot but with a different color.

Visualisation

Visualization is a graphical representation of the dataset. Python supports Seaborn, Matplotlib, Plotly, and Squarify modules to perform Data Visualization. Using Visualization, we can plot amazing graphs that include: Bar Plot, Histogram, Scatter Plot, Heatmap, Line Plot, Pie Plot, Box Plot, and many more.

Here I will be sharing a few examples of plotting the dataset.

Bar Plot

Bar Plot is mainly used to plot categorical values. By category values in this context can be Primary Type of Pokémons i.e, Water, Fire, Electric, Grass, Flying Pokémons.

plt.figure(figsize=(20,10)) #20:width, 10:height
plt.title("Water Pokemon Overall Strength Above 500",fontsize=24)
plt.bar(water_pokemons_names,water_strength)
plt.xticks(rotation=90) #to modify x labels
plt.xlabel("Name",fontsize=18)
plt.ylabel("Total",fontsize=18)

Histogram

A Histogram is to plot the numerical values. Here we can plot speed, attack, defense using Histogram. And can visualize mean, median for the dataset using Histograms.

plt.hist(water_pokemons['HP'],bins=15)
plt.xlabel("HP")
plt.ylabel("Count Range")

Bins is an integer, it defines the number of equal-width bins in the range.

Heatmap

Heatmap is the best way to check the correlation between the data. Correlation is the relationship between two or more labels. It ranges from -1 to 1. The Value near 1, i.e., >0.5 has a Positive Correlation. The Value near -1, i.e., <-0.5 has a Negative Correlation. In both Positive and Negative we can conclude that there is some relationship between the labels. Whereas the values that are near to Zero, are Null Correlation, which shows no relationship between the labels.

_,(ax, cbar_ax) = plt.subplots(2, gridspec_kw= {"height_ratios": (1, .04), "hspace":.3}, figsize=(18,9))
sb.heatmap((dataset.loc[:,['Attack','Defense','HP','Sp.Attack','Sp.Defense','Speed']]).corr(),
            annot= True,
            fmt = "3.3f",
            vmin = -1,
            vmax = 1,
            ax=ax,
            cbar_ax=cbar_ax,
            cbar_kws={"orientation": "horizontal"},
            cmap = "plasma")
ax.set_title('Correlation Between Powers Of Pokemon', size = 24);

To visualize the empty dataset it is best practice to use a heatmap.

seaborn.heatmap(dataset.isnull())

Pie Plot

A Pie Plot is used to plot the series of data into separate versions. Pie Plot is in the circular part where each series of data represents a numerical proportion. In most cases, we do prefer the Bar plot over Pie Plot.

plt.figure(figsize=(8,5))

plt.pie(water_pokemons[water_pokemons['Speed']>100]["Speed"],labels=water_pokemons[water_pokemons["Speed"]>100]["Name"],autopct='%1.3f%%')

Scatter Plot

Scatter plots are used to observe relationships usually between two variables. For example, we can check the speed relation between Water and Fire Pokémons.

plt.scatter(df[df["Primary Type"]=="WATER"]["Speed"].iloc[0:65],df[df["Primary Type"]=="FIRE"]["Speed"])
plt.xlabel("Water Pokemons")
plt.ylabel("Fire Pokemons")

You can check out Documentations for more plots and more examples, I will conclude this article here. If you have Jupyter Notebook check out the source code. If not check out my Kaggle Notebook.

Conclusion

The EDA is done for three types(Water, Fire, Electric) of Pokémon. Get your hands dirty by completing EDA for 2 types(Flying, Grass) Of Pokémon. Get the Source Code, and add your piece of code to the practice session. Ganbatte Senpai 👍 

And yes, If I missed anything, do open a Pull Request, and let me know.