top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

How to import and Clean Data with Python

Writer's picture: Fatma AliFatma Ali

Importing Data

Loading and Saving CSVs:

When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

df= pd.read_csv('IMDB_Movies.csv')
df

Cleaning Data

Diagnose the Data:

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data? For data to be tidy, it must have:

  • Each variable as a separate column

  • Each row as a separate observation

df.info() gives some statistics for each column.

df.info()

Dealing with Duplicates:

Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can use the pandas function .duplicated(), which will return a Series telling us which rows are duplicate rows.

df.duplicated()

We can use the pandas .drop_duplicates() function to remove all rows that are duplicates of another row.

df.drop_duplicates(subset=['director_name'])
df

Missing Values:

We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN (or Not a Number) values.

df.isnull().sum()

If we wanted to remove every row with a NaN value in the director_name column only, we could specify a subset:

df = df.dropna(subset=['director_name'])
df

Looking at Types:

Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert between types so that we can do better analysis.

To see the types of each column of a DataFrame, we can use:

print(df.dtypes)

You can check the full code here




0 comments

Recent Posts

See All

Comments


COURSES, PROGRAMS & CERTIFICATIONS

 

Advanced Business Analytics Specialization

Applied Data Science with Python (University of Michigan)

Data Analyst Professional Certificate (IBM)

Data Science Professional Certificate (IBM)

Data Science Specialization (John Hopkins University)

Data Science with Python Certification Training 

Data Scientist Career Path

Data Scientist Nano Degree Program

Data Scientist Program

Deep Learning Specialization

Machine Learning Course (Andrew Ng @ Stanford)

Machine Learning, Data Science and Deep Learning

Machine Learning Specialization (University of Washington)

Master Python for Data Science

Mathematics for Machine Learning (Imperial College London)

Programming with Python

Python for Everybody Specialization (University of Michigan)

Python Machine Learning Certification Training

Reinforcement Learning Specialization (University of Alberta)

Join our mailing list

Data Insight participates in affiliate programs and may sometimes get a commission through purchases made through our links without any additional cost to our visitors.

bottom of page