top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureAbu Bin Fahd

Machine Learning: Feature Creation & Extraction in Python




What is feature engineering? Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning.

Different types of data?

  • Continuous

  • Categorical

  • Ordinal

  • Boolean

  • Datetime

Get to know your data


# Import pandas
import pandas as pd
# Import Combined_DS_v10.csv into so_survey_df
so_survey_df = pd.read_csv("/content/Combined_DS_v10.csv")
# Print the first five rows of the DataFrame
print(so_survey_df.head())
# Print the data type of each column
print(so_survey_df.dtypes)

Selecting specific data types Often a dataset will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame.


# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 
                                'float'])
# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)

Dealing with Categorical variable One-hot encoding and dummy variables To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables.


# Convert the Country column to a one hot encoded Data Frame
one_hot_encoded = pd.get_dummies(so_survey_df, columns=     
                                ['Country'], prefix='OH')
# Print the columns names
print(one_hot_encoded.columns)
# Create dummy variables for the Country column
dummy = pd.get_dummies(so_survey_df, columns=['Country'], 
                drop_first=True, prefix='DM')
# Print the columns names
print(dummy.columns)


Dealing with uncommon categories


# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 
                                10].index)
# Label all other categories as Other
countries[mask] = 'Other'
# Print the updated category counts
print(pd.value_counts(countries))

Numeric:Binarizing Columns While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful


so_survey_df["ConvertedSalary"].fillna(0, inplace=True)
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0
# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0,         
  'Paid_Job'] = 1
# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

Binning values For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into.



# Bin the continuous variable ConvertedSalary into 5 bins
so_survey_df['equal_binned'] = 
                pd.cut(so_survey_df['ConvertedSalary'], 5)
# Print the first 5 rows of the equal_binned column
print(so_survey_df[['equal_binned', 
                                 'ConvertedSalary']].head())
# Bin the ConvertedSalary column using the boundaries in the list bins and label the bins using labels
# Import numpy
import numpy as np
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
# Bin labelslabels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = 
   pd.cut(so_survey_df['ConvertedSalary'], bins=bins, 
    labels=labels)
# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 
                                'ConvertedSalary']].head())



0 comments

Recent Posts

See All

Comments


bottom of page