top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureHassan Abdihakin

Exploratory Data Analysis: European Soccer Dataset

Table of Contents: 1. Introduction 2. Data Wrangling 3. Exploratory Data Analysis 4. Conclusions Introduction:

Data Description The dataset used in this analysis comes from Kaggle and it contains data for soccer matches, players, and teams from several European countries in the years 2008 to 2016. The country table contains 11 rows 2 and columns and has a relationship with the league which has 11 rows and 3 columns and the team with 299 rows and 5 columns. team_attributes has 1458 and 25 columns that have relationships with teams. Match has 25979 rows and 115 columns that relate to Teams Table. the player table has 11060 with 7 columns and has a relationship with player_Attributes with 183978 rows and 42 columns. for further details

here is the link to the dataset: https://www.kaggle.com/datasets/hugomathien/soccer

Question(s) for Analysis Research Question 1: How many times a Gaol was netted for all the leagues from 2008 to 2016? Research Question 2: Chances Created By Crossing for all leagues from 2008 to 2015? Research Question 3: Which league had the most possession of the ball throughout the period? Research Question 4: Number of times a team dominated with ball position over 70 in EPL from 2008-2015 Research Question 5: Top 10 players with highest chances of scoring penalty in EPL? Research Question 6: who were the fastest players from 2008 to 2016? Research Question 7: who were the best left-footed shooters in the year (2015-2016) in EPL? Data Wrangling phase


# importing modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# importing datasets
df_country=pd.read_csv('Data/Country.csv')
df_league=pd.read_csv('Data/League.csv')
df_match=pd.read_csv('Data/Match.csv')
df_player=pd.read_csv('Data/Player.csv')
df_player_att=pd.read_csv('Data/Player_Attributes.csv')
df_team=pd.read_csv('Data/Team.csv')
df_team_att=pd.read_csv('Data/Team_Attributes.csv')


Data cleaning: The below info method shows some data types are inaccurate and there is missing data in it. the total entries should be 142271 but other columns such as goals and possessions which are the basis of the analysis have missing data. let's clean them before we analyse but a function would be great to avoid repeatative work



the below function produced a data that is clean and now we can do the Exploration process

Exploration Phase

Research Question 1: How many times a Gaol was netted for all the leagues from 2008 to 2016?


total_goals = df_team_tweak.groupby('league')[['home_team_goal','away_team_goal']].sum().sum(axis=1).sort_values\
(ascending=False)


plt.figure(figsize=(18,10))
sns.barplot(x=total_goals.index,y=total_goals)
plt.title('Goals Scored for all leagues from 2008 to 2016',fontsize=28)
plt.xticks(rotation=90)
plt.xlabel('Leagues')
plt.ylabel('Goals Scored (2008-2015)');


Observation:: EPL has the highest goals scored from the years 2008 to 2015.



Research Question 2: Chances Created By Crossing for all leagues from 2008 to 2015?


chance_Creation_by_Crossing= df_team_tweak.groupby('league')['chanceCreationCrossing'].sum().sort_values(ascending=False)

chance_Creation_by_Crossing=chance_Creation_by_Crossing.head(5)
#Plotting the above results

plt.figure(figsize=(20,10))
sns.barplot(x=chance_Creation_by_Crossing.index,y=chance_Creation_by_Crossing)
plt.title('Chances Created By Crossing for all leagues from 2008 to 2015',fontsize=20)
plt.xticks(rotation=90)
plt.xlabel('Leagues')
plt.ylabel('Chances Created By Crossing (2008-2015)');


Observation:: France Ligue 1 has the least chances created by crossing than the Big Five leagues.


Research Question 4: Number of times a team dominated with ball possition over 70 in EPL from 2008-2015

top_poss_teams = df_team_tweak.groupby('league').get_group('England Premier League').query("ball_possession >=70")['team_long_name'].value_counts(ascending=False)


plt.figure(figsize=(15,5))
sns.barplot(x=top_poss_teams.index,y=top_poss_teams)

plt.title('Dominated with Ball possition >= 70 from 2008 to 2016')
plt.xticks(rotation=90)
plt.xlabel('Leagues')
plt.ylabel('# of times Ball possition is over 70');

Observation::Manchester United and manchester city are the only two teams that had 70+ ball positions more than 120 times from 2008 to 2016.

Data Wrangling Phase for Players Data sets

Below is a reminder of the objective of the player's Analysis . Q5: Top 10 players with highest chances of scoring penalty in EPL? Q6: who were the fastest players during 2008 to 2016? Q7: who were the best left-footed shooters in the year (2015-2016) in EPL? Merging Process


# below we have joined the two data set and assigned to the df_players
df_players=df_player.merge(df_player_att,how='left',on='player_api_id')

Data Cleaning Process

The below method shows most data has missing data. if the missing values are dropped then a significant amount of data is lost. therefore, the sklean simpleImputer provides a way to fill the categorical data with their frequently value and for the numeric values with their mean.

df_players.isnull().sum()
id_x                       0 player_api_id              0 player_name                0 player_fifa_api_id_x       0 birthday                   0 height                     0 weight                     0 id_y                       0 player_fifa_api_id_y       0 date                       0 overall_rating           836 potential                836 preferred_foot           836 attacking_work_rate     3230 defensive_work_rate      836 crossing                 836 finishing                836 heading_accuracy         836 short_passing            836 volleys                 2713 dribbling                836 curve                   2713 free_kick_accuracy       836 long_passing             836 ball_control             836
...
gk_handling              836 gk_kicking               836 gk_positioning           836 gk_reflexes              836 dtype: int64

# before imputing let's split our dataframe into numeric and categorical

# numeric columns
df_player_numeric=df_players.select_dtypes(include='number')
numeric_cols=df_player_numeric.columns

# categorical colums
df_player_categorical=df_players.select_dtypes(exclude='number')
cat_cols=df_player_categorical.columns

# before imputing let's split our dataframe into numeric and categorical

# numeric columns
df_player_numeric=df_players.select_dtypes(include='number')
numeric_cols=df_player_numeric.columns


# categorical colums
df_player_categorical=df_players.select_dtypes(exclude='number')
cat_cols=df_player_categorical.columns

# import modules

from sklearn.impute import SimpleImputer

simple_imp_num=SimpleImputer(strategy = 'constant',fill_value=-999)

# filling all nulls with -999
imputed_num=simple_imp_num.fit_transform(df_players[numeric_cols])


# let's fill the missing values of numeric columns with thier mean
simple_imp_mean=SimpleImputer(strategy = 'mean')
imputed_num=simple_imp_mean.fit_transform(df_players[numeric_cols])
df_players[numeric_cols]=imputed_num

let's fill the missing values of categorical columns


# let's fill the missing values of categorical columns with their most frequenty values
simple_imp_freq=SimpleImputer(strategy = 'most_frequent')
imputed_cat=simple_imp_freq.fit_transform(df_players[cat_cols])
df_players[cat_cols]=imputed_cat


df_players.info()
Data columns (total 48 columns):  #   Column                Non-Null Count   Dtype   ---  ------                --------------   -----    0   id_x                  183978 non-null  float64  1   player_api_id         183978 non-null  float64  2   player_name           183978 non-null  object   3   player_fifa_api_id_x  183978 non-null  float64  4   birthday              183978 non-null  object   5   height                183978 non-null  float64  6   weight                183978 non-null  float64  7   id_y                  183978 non-null  float64  8   player_fifa_api_id_y  183978 non-null  float64  9   date                  183978 non-null  object   10  overall_rating        183978 non-null  float64  11  potential             183978 non-null  float64  12  preferred_foot        183978 non-null  object   13  attacking_work_rate   183978 non-null  object   14  defensive_work_rate   183978 non-null  object   15  crossing              183978 non-null  float64  16  finishing             183978 non-null  float64  17  heading_accuracy      183978 non-null  float64  18  short_passing         183978 non-null  float64  19  volleys               183978 non-null  float64
...
 46  gk_positioning        183978 non-null  float64  47  gk_reflexes           183978 non-null  float64 dtypes: float64(42), object(6) memory usage: 68.8+ MB

As per the above method, the data has zero missing but there is inaccuracy in the data types. some columns have unmeaningful names and Duplicates. Below is a function that takes care of any data inconsistencies that the above method had shown and beyond.



def tweak_players(df_players):
    cols=df_players.columns
    list(cols)
    return(
        df_players
        [cols]
        .assign(
            birthday=pd.to_datetime(df_players.birthday), # convert DOB datatype to datetime
            date=pd.to_datetime(df_players.date) # convert Date datatype to datetime
        )   
    .drop(columns=['id_x','id_y','player_fifa_api_id_x', 'player_fifa_api_id_y']) # droping these columns
    .rename(columns={'player_api_id':'id'}) # renaming a column
    .dropna() # drop nulls
    .drop_duplicates() # drop dulicates

)
df_players_tweak=tweak_players(df_players) # call the function and save the data to the variable
df_players_tweak # testing what we got




#checking for nulls ,duplicates and other in accuracies
df_players_tweak.info()
 Data columns (total 44 columns):  #   Column               Non-Null Count   Dtype          ---  ------               --------------   -----           0   id                   183977 non-null  float64         1   player_name          183977 non-null  object          2   birthday             183977 non-null  datetime64[ns]  3   height               183977 non-null  float64         4   weight               183977 non-null  float64         5   date                 183977 non-null  datetime64[ns]  6   overall_rating       183977 non-null  float64         7   potential            183977 non-null  float64         8   preferred_foot       183977 non-null  object          9   attacking_work_rate  183977 non-null  object          10  defensive_work_rate  183977 non-null  object          11  crossing             183977 non-null  float64         12  finishing            183977 non-null  float64         13  heading_accuracy     183977 non-null  float64         14  short_passing        183977 non-null  float64         15  volleys              183977 non-null  float64         16  dribbling            183977 non-null  float64         17  curve                183977 non-null  float64         18  free_kick_accuracy   183977 non-null  float64         19  long_passing         183977 non-null  float64       
...
 42  gk_positioning       183977 non-null  float64         43  gk_reflexes          183977 non-null  float64        dtypes: datetime64[ns](2), float64(38), object(4) memory usage: 63.2+ MB

Exploratory Data Analysis

Research Question 5: Top 10 players with highest chances of scoring penalty in EPL?


top_penalty_takers=df_players_tweak[['player_name','penalties']].query("penalties>=94")['player_name'].value_counts()

plt.figure(figsize=(16,8))
sns.barplot(x=top_penalty_takers.index,y=top_penalty_takers)

plt.title('Players with highest chances of scoring penalty in EPL from 2008 to 2016')
plt.xticks(rotation=90)
plt.xlabel('Player Name')
plt.ylabel('# of times penalties were  >= 94 (2008-2015)');


Observation::Rickie is the only player in the EPL that had the highest chances of scoring a penalty from 2008 to 2016

Research Question 6: who were the fastest players during 2008 to 2016?


fastest_players=df_players[['player_name','sprint_speed']].query("sprint_speed>=96")['player_name'].value_counts()

plt.figure(figsize=(14,10))
sns.barplot(x=fastest_players.index,y=fastest_players)

plt.title('Speed Consistency  >= 96 from 2008 to 2016')
plt.xticks(rotation=90)
plt.xlabel('Player Name')
plt.ylabel('Speed Consistency >= 96 (2008-2015)');

Observation::theo,Mathis and Pierre are the fastest players in the EPL during 2008 to 2016.


Research Question 7 : who were the best left-footed shooters in the year (2015-2016)?



best_left_shooter=df_players[['player_name','preferred_foot','shot_power','date']].query("shot_power>=90 & preferred_foot=='left'& date>'2015' ")['player_name'].value_counts()

plt.figure(figsize=(20,8))
sns.barplot(x=best_left_shooter.index,y=best_left_shooter)
sns.set_theme(style="whitegrid")
plt.title('Best left-footed shooters in (2015-2016)',fontsize=18)
plt.xticks(rotation=90)
plt.xlabel('Player Name',fontsize=14)
plt.ylabel('# of Times Shot Power was  >= 90 (2015-2016)',fontsize=16);

Observation:: Robin Van Persie was among the best left-footed players and had 90+ shot Power during the 2015/2016 season.


Conclusions:

Teams: After cleaning the data, overall ball possession performance was one of the priority questions and I found out that EPL came first and Liga BBVA second. However, this could be result of data gathering mainly focused on EPL than other leagues. I also asked Examined further questions for most chances created through crossing across all leagues and EPL come first and that could be geniune since most goals were scored in the EPL. Now, with the Focus on EPL teams, i wanted to figure out,how many times a certain team had ball possion over 70 from 2008 to 2016. Manchester united and Manchester city were the only teams to have done that more than 110 times. where leicester city, Westham United and several other times have dominated 6 times with possession greater than 70. However, this could be a result of the opposition team having red card or top players' injuries, or mental problems. due to that further investigation is needed on why certain teams only had less than 30% ball possition.


Players: after cleaning data for exploration, one of the objectives was to know, those players who had consistencily maintained speed greater that 96 over the period(2008-2016). theo Walcott and Mathis Bolly were the players to maintain that record for over 20 times, Where Pierre-Emerick Aubameyang,Jonathan Biabiany and David Odonkor consistencily had more than 3 times.however having speed greater that 96 does constitute your overall rating. i was also concerned the best left footed players with over 90 shot power. Bradley Johnson,Hulk and Lukas Podolski had achived this over 7 times. while the Greate Robin Van Persie had only once attained shot power over 90. i further analyzed the data to find out the best penalty takers with over 94% chance of scoring.Rickie Lambert and Mario Balotelli were the best penalty takers and had higher chances of netting a goal from the spot. Fabio Cannavaro,Paul Scholes and Xavi Hernandez had achieved only one time over the period of (2008-2016). I also investigated if there is correlation between player's ball control and position and from the visualization, it seems there is a postive relationship. also markig and standing tackle seems to have a greater positive relationship and this could be true after all you don't need to slide if you are already marking the player. the data had limitation since you cannot map a player with the number of goals he scored or the number of headers netted. the data concentarted the traits of the players and the team. and for that reason, we may not know if there is relationship between higher penalty accuracy and scoring.

0 comments

Recent Posts

See All

Comments


bottom of page