top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Data Cleaning In Python

Writer's picture: Sakayo Toadoum SariSakayo Toadoum Sari


Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct (Click here)

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. A lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job.

Common Issues With Data are:

  • Reading the file

  • Inconsistent Column Names

  • Missing Data

  • Different Data Types

  • Duplicate rows

  • etc

Loading or Reading the File

  • Encoding Error

  • Inconsistent rows

# importing packages
import pandas as pd
import numpy as np
# Issue 1
df = pd.read_csv("unclean_data.csv")
---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-3-0b7adce25293> in <module>
      1 # Issue 1
----> 2 df = pd.read_csv("unclean_data.csv")


~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper


~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 


~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:


~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):


~\anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):


~\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in __init__(self, src, **kwds)
     67         kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
     68         try:
---> 69             self._reader = parsers.TextReader(self.handles.handle, **kwds)
     70         except Exception:
     71             self.handles.close()


~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()


~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()


~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()


~\anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 275: invalid start byte

We are getting this error because Python tries to convert a byte-array (bytes which it assumes to be a utf-8-encoded string) to a Unicode string (str). This process of course is decoding according to utf-8 rules. When it tries this, it encounters a byte sequence that is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

As a solution we can use the following code it will strip out (ignore) the characters and return the string without them. We only use this if our need is to strip them not convert them.

# Solution 1
# Use Text Editor and Save it as Utf-8,ISO-8859-1,latin1
df = pd.read_csv("unclean_data1.csv",encoding='utf8')
#print out the head of the dataframe
df.head()

Inconsistent Column Names

  • Change Cases

  • Rename them

df.columns
Index(['movie_title', 'num_critic_for_reviews', 'duration',
       'DIRECTOR_facebook_likes', 'actor_3_facebook_likes',
       'ACTOR_1_facebook_likes', 'gross', 'num_voted_users',
       'Cast_Total_facebook_likes', 'facenumber_in_poster',
       'num_user_for_reviews', 'budget', 'title_year',
       'ACTOR_2_facebook_likes', 'imdb_score', 'title_year.1'],
      dtype='object')

According to these columns’ names, it appears that they are inconsistent (a mix of upper and lower). When can convert all the columns names to upper:

df.columns=df.columns.str.upper()
df.columns
Index(['MOVIE_TITLE', 'NUM_CRITIC_FOR_REVIEWS', 'DURATION',
       'DIRECTOR_FACEBOOK_LIKES', 'ACTOR_3_FACEBOOK_LIKES',
       'ACTOR_1_FACEBOOK_LIKES', 'GROSS', 'NUM_VOTED_USERS',
       'CAST_TOTAL_FACEBOOK_LIKES', 'FACENUMBER_IN_POSTER',
       'NUM_USER_FOR_REVIEWS', 'BUDGET', 'TITLE_YEAR',
       'ACTOR_2_FACEBOOK_LIKES', 'IMDB_SCORE', 'TITLE_YEAR.1'],
      dtype='object')

Renaming Columns

We can also decide to rename a column using the following line of code:

df.rename(columns = {'DURATION':'TIME'})

Missing Data

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Missing data can occur because of nonresponse: no information is provided for one or more items or a whole unit (“subject”). Some items are more likely to generate a non response than others: for example items about private subjects such as income (Wikipedia).

There are many techniques that can be used to deal with missing data/values.

  • Add a default value for missing data or use mean to fill it

  • Delete the row/column with missing data

  • Interpolate the rows

  • Replace

To check for missing data

False means no missing data

  • df.isnull().sum() int

  • df.isnull().any() bool

df.isnull().head()
df.isnull().any()
MOVIE_TITLE                  False
NUM_CRITIC_FOR_REVIEWS       False
DURATION                      True
DIRECTOR_FACEBOOK_LIKES       True
ACTOR_3_FACEBOOK_LIKES       False
ACTOR_1_FACEBOOK_LIKES       False
GROSS                        False
NUM_VOTED_USERS               True
CAST_TOTAL_FACEBOOK_LIKES     True
FACENUMBER_IN_POSTER          True
NUM_USER_FOR_REVIEWS         False
BUDGET                       False
TITLE_YEAR                   False
ACTOR_2_FACEBOOK_LIKES        True
IMDB_SCORE                   False
TITLE_YEAR.1                  True
dtype: bool

This output shows all columns that contain missing values (True)

# Columns with NAN using Integer
df.isnull().sum()
MOVIE_TITLE                  0
NUM_CRITIC_FOR_REVIEWS       0
DURATION                     3
DIRECTOR_FACEBOOK_LIKES      2
ACTOR_3_FACEBOOK_LIKES       0
ACTOR_1_FACEBOOK_LIKES       0
GROSS                        0
NUM_VOTED_USERS              1
CAST_TOTAL_FACEBOOK_LIKES    2
FACENUMBER_IN_POSTER         5
NUM_USER_FOR_REVIEWS         0
BUDGET                       0
TITLE_YEAR                   0
ACTOR_2_FACEBOOK_LIKES       1
IMDB_SCORE                   0
TITLE_YEAR.1                 7
dtype: int64
# Total Number of Missing NA
df.isnull().sum().sum()
21

Adding A Default Value or Filling the Missing Data

We can fill in missing data using a default value.

df.head()

We can fill all the NAN with zero (0).

df_with_0 = df.fillna(0)
df_with_0.head()

We can also fill it using the mean().

# Fill it with the mean
df['DURATION'].mean()
150.72727272727272
df_with_mean = df.DURATION.fillna(df['DURATION'].mean())
df_with_mean
0     178.000000
1     150.727273
2     148.000000
3     150.727273
4     132.000000
5     156.000000
6     150.727273
7     141.000000
8     141.000000
9     153.000000
10    183.000000
11    169.000000
12    106.000000
13    151.000000
Name: DURATION, dtype: float64

Droping NA

## Droping NA
df.head()
df.shape
(14, 16)
df_drop = df.dropna()
df_drop.shape
(4, 16)

Our shape have changed to 4 by 16 instead of 14 by 16 because we have dropped missing values.

df_drop.head()
0 comments

Recent Posts

See All

Comments


bottom of page