top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Time Series Analysis of NAICS: Monthly employment data series from 1997 to 2018



Introduction


The North American Industry Classification System or NAICS is a classification of business establishments by type of economic activity (the process of production). It is used by the government and businesses in Canada, Mexico, and the United States of America.


NAICS works as a hierarchical structure for defining industries at different levels of aggregation. For example, a 2-digit NAICS industry (e.g., 23 - Construction) is composed of some 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy and civil engineering construction, and a few more 3-digit NAICS industries).

Similarly, a 3-digit NAICS industry (e.g., 236 - Construction of buildings), is composed of 4-digit NAICS industries (2361 - Residential building construction and 2362 - Non-residential building construction).



For this post, we will work only with the four first digits.


Data


15 CSV files beginning with RTRA. These files contain employment data by industry at different levels of aggregation; 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS. Columns mean as follows:

(i) SYEAR: Survey Year

(ii) SMTH: Survey Month

(iii) NAICS: Industry name and associated NAICS code in the bracket

(iv) _EMPLOYMENT_: Employment


Our goal during this post


  1. Fill the empty column with the data in our prepared file.

  2. Try to answer some questions about employment in Construction evolved and how this compares to the total employment across all industries?


Data Import and cleaning


Firstly, let's import some libraries that we will use.


import re
import glob
import numpy as np
import pandas as pd
import seborn as sns
import matplotlob.pyplot as plt

Then, let's create a function to retrieve all data for the selected digit. before that will we need to get all CSV files in the data folder.


# get all csv files in data folder
files = glob.glob('./data/*.csv')

def create_df(digit:int) -> pd.core.frame.DataFrame :
    """
    Create a sorted dataframe for selected digit using files already in data folder that match with '_{digit}NAICS'
    Parameters
    ----------    
    digit : int
        Number of digit of NAICS file we want to retrieve
      
    files : list
        Global parameter, list of csv files in data folder.
        
    Returns    
    -------    
    pd.core.frame.DataFrame
        sorted dataframe of NAICS with n digits.      
    """
    global files
    return pd.concat([pd.read_csv(file) for file in files if re.search("_{}NAICS".format(digit), file)]).sort_values(by=['SYEAR', 'SMTH'])

Now we create our function, we will get data for 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS.


df_2_digit = create_df(2)
df_3_digit = create_df(3)
df_4_digit = create_df(4)

Let's present data for the first created dataset.

df_2_digit.head(25)

For the time series purpose, we need to create a new column date. This column will be created from the concatenation of SYEAR and SMTH columns.

The type of this column will be DateTime.

After creating this column we will need to remove SYEAR and SMTH columns for each dataset.

Let's begin by creating a function like the previous one for doing it.


def create_date(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    Create the date column in the selected dataset then remove unused columns 
    Parameters
    ----------    
    df : pd.core.frame.DataFrame
        the dataframe we want to change
      
    Returns    
    -------    
    pd.core.frame.DataFrame
        the input dataframe with ['DATE'] and without ['SYEAR', 'SMONTH'] columns     
    """
    df['DATE'] = pd.to_datetime(df['SYEAR'].astype('str') + df['SMTH'].astype('str'), format='%Y%m')
    df.drop(columns=['SYEAR', 'SMTH'], inplace=True)
    return df

Let's add a DATE column for each data frame.


df_2_digit = create_date(df_2_digit)
df_3_digit = create_date(df_3_digit)
df_4_digit = create_date(df_4_digit)

Now let's see the result for the first dataset again.


df_2_digit.head()

To match with the LMO_Detailed_Industries_by_NAICS file, we have to split NAICS Code and NAICS Industries in our data frames.


def create_code(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    Create the code column in the selected dataset 
    Parameters
    ----------    
    df : pd.core.frame.DataFrame
        the dataframe we want to change
      
    Returns    
    -------    
    pd.core.frame.DataFrame
        the input dataframe with ['CODE']     
    """
    df['CODE'] = df['NAICS'].astype('str').str.extract('(\d+)')
    df['NAICS'] = df['NAICS'].astype('str').str.split('[').str.get(0)
    return df

df_2_digit = create_code(df_2_digit)
df_3_digit = create_code(df_3_digit)
df_4_digit = create_code(df_4_digit)

df_2_digit.head()

From now we will work with 4 digit dataset.

We will try to get the correct code of each NAICS in LMO_Detailed_Industries_by_NAICS.


Firstly we will retrieve value from LMO_Detailed_Industries_by_NAICS


LMO = pd.read_excel('./data/LMO_Detailed_Industries_by_NAICS.xlsx')
LMO.head(6)

Now we will create a function to get the code when there is matching.


def match_industry(df: pd.core.frame.DataFrame) -> list:
    """
    Get the matched industry code for each row   
    Parameters
    ----------    
    df : pd.core.frame.DataFrame
        the dataframe we want to change
        
    LMO: global values
      
    Returns    
    -------    
    list
        the list of code of industry in LMO dataset     
    """
    global LMO
    codes = LMO['NAICS'].astype('str').str.replace('.', '').str.replace('&', ',').str.strip().str.extract('(\d+)', expand=True).values.ravel()
    match = []
    for i, row in df.iterrows(): 
        selected = None
        for code in codes:
            if not isinstance(row['CODE'], str) and row['CODE'] != None:
                print(row['CODE'])
            if row['CODE'][:len(str(code))] == str(code) and code not in [None, np.nan]:
                selected = code
        if selected is None:
            match.append(np.nan)
        else:
            match.append(selected)
            
    return match

df_4_digit['CODE'] = match_industry(df_4_digit)
print(df_4_digit['CODE'].head(20))

Now we will drop all rows where there is no matching.


df_4_digit.drop(df_4_digit[df_4_digit['CODE'].isna()].index, inplace=True)
df_4_digit.drop(columns='NAICS', inplace=True)
df_4_digit.head() 

Finally, we get the correct NAICS.


codes = LMO['NAICS'].astype('str').str.replace('.', '').str.replace('&', ',').str.strip().str.extract('(\d+)', expand=True).values
lmo_dict={row['LMO_Detailed_Industry']: codes[i] for i,row in LMO.iterrows()}

industries = []
for i, row in df_4_digit.iterrows():
    for name, codes in lmo_dict.items():
        if row['CODE'] in codes:
            industries.append(name)

df_4_digit['NAICS'] = industries

Let's see an overview of the result dataset.


df_4_digit.groupby(['NAICS','DATE']).agg({'_EMPLOYMENT_':sum})

Industries Employment


Let's see Employments in all industries.


total_counts = df_4_digit.groupby('NAICS')['_EMPLOYMENT_'].sum().sort_values(ascending=False)
total_df = pd.DataFrame({'Industry':total_counts.index, 'Employments':total_counts.values})
plt.figure(figsize=(13,13))
sns.barplot(x='Employments', y='Industry', data = total_df)

We notice that the Construction industry is the best employment industry.


The year with the most employment


employment_by_year = df_4_digit.groupby(df_4_digit.DATE.dt.year)['_EMPLOYMENT_'].sum().sort_values(ascending=False)
total_df = pd.DataFrame({'Year':employment_by_year.index, 'Employments':employment_by_year.values})
plt.figure(figsize=(13,7))
sns.barplot(x='Year', y='Employments', data = total_df)

The year in which there was the most recruitment is 2005.


Time Series


data = pd.DataFrame(df_4_digit.reset_index().groupby(['NAICS', 'DATE'], as_index=False)['_EMPLOYMENT_'].sum())
data = data.pivot('DATE', 'NAICS', '_EMPLOYMENT_')
data['total'] = data.sum(axis=1)
data = data.iloc[:-3]
data.index.freq = 'MS'
data.fillna(0, inplace=True)
data.plot(subplots=True, figsize=(10, 120))

The graph is quite long so we will display the first 6 industries for the presentation. Let's focus on the total and Construction.


Global employability analysis


data['total'].plot(figsize=(13, 7))

We manage to distinguish several peaks of growth approximately at an interval of 5 years which decrease then rise to reach another peak. We could say globally that the employability rate is periodic for 5 years.


Distribution of employment in construction


sns.boxplot(y=data['Construction'])

We find that the construction industry creates on average about 50,000 jobs per month. And that most 3/4 of the jobs generated per month range from 0 to 120,000 jobs per year for a peak of around 250,000 jobs.


Comparing the top 5 industries


data['Construction'].cumsum().plot(figsize=(13, 7),legend=True)
data['Food services and drinking places'].cumsum().plot(figsize=(13, 7), legend=True)
data['Repair, personal and non-profit services'].cumsum().plot(figsize=(13, 7), legend=True)
data['Elementary and secondary schools'].cumsum().plot(figsize=(13, 7),legend=True)
data['Hospitals'].cumsum().plot(figsize=(13, 7), legend=True)

We find that in terms of the number of jobs created, the construction industry does not always have the largest. From 1997 to almost 2013 it was overtaken by the Food services and drinking places industries. After this date, it experienced a strong ascent. Compared to other industries, these two outperform them.


Conclusion


We can conclude that the construction industry is the most job-providing, even if she not always be the greatest industry.


Reference


The source for the data is Real-Time Remote Access (RTRA) data from the Labour Force Survey (LFS) by Statistics Canada.


This article was originally written by Kossonou Kouamé Maïzan Alain Serge, as part of the Data Insight data scientist program.

The full code is here on my GitHub.

0 comments

Comments


bottom of page