top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureMariam Ahmed

An important station on my way of Data Analysis (NAICS - Project)

There was an Arab investor who decided to invest his money in North American countries, specifically in three countries: Canada, Mexico, and the United States But he didn't know in which industry he would invest all that money He decided to use a data analyst to help him make his decision The analyst realized that these countries use a system called the NAICS Classification Which depends on its definition of a given code for each industry in a hierarchical system This analyst started collecting data on this system from 1997 to 2018 Then the analysis began Which discovered that one of the most important and largest industries that are held in those countries It is the construction industry, followed by food services This is because the number of employees and its employees was increasing over time according to his analysis of that time series The data analyst came back and told him what he had concluded: If he wants to invest in a large industry that employs many people and can open many branches, then he should turn to the construction industry. But if he wants a medium-sized industry with a kind of entertainment, he should open a chain of restaurants or cafes The investor thought and made his decision... And you, what would your decision be if you were in his place?


Let's see our operations...


NAICS 2017 – Statistics Canada: Description of the North American Industry Classification System (NAICS). All you would need to understand for this task is, how the NAICS works as a hierarchical structure for defining industries at different levels of aggregation. For example 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy and civil engineering construction, and a few more 3-digit NAICS industries).

Summary of the data set files we will be using:


  1. 15 RTRA (Real-Time Remote Access) CSV files containing employment data by industry at different levels of aggregation, 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS. We will search through rows with 2, 3, or 4 digits NAICS and append employment data each month of each year from 1997 - 2018 Meaning of columns in the files:

  • SYEAR: Survey Year

  • SMTH: Survey Month

  • NAICS: Industry name and associated NAICS code in the bracket

  • _EMPLOYMENT_: Employment

LMO Detailed Industries by NAICS: An excel file for mapping the RTRA data to the desired data row. Columns in the file:

  • Column 1 lists all 59 industries that are used frequently

  • Column 2 lists the industry's NAICS definitions

As part of our data wrangling, we would create a dataset of monthly employment series from 1997 to 2018 for the industries. One of the guiding principles for our data wrangling is to try to create each series from the highest possible level of aggregation in the raw data files, thus, if an LMO Detailed Industry is defined with a 2-digit NAICS only, we would not use a lower level of aggregation (i.e. 3-digit or 4-digit level NAICS files in the RTRA), similarly, if an LMO Detailed Industry is defined with a 3-digit NAICS only, we would not use the 4-digit NAICS files for that industry.


# Importing the needed libaries

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport re
# Uploading filesfrom google.colab 

import filesuploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length}     bytes'.format(name=fn, length=len(uploaded[fn])))  
    

# Reading  LMO_Detailed_Industries_by_NAICS  & 

data_outputlmo_detailes = pd.read_excel('LMO_Detailed_Industries_by_NAICS.xlsx')
data_output=pd.read_excel('Data_Output_Template.xlsx')
# Clean LMO_Detailed_Industries_by_NAICSCleaned

NAISC=lmo_detailes['NAICS'].astype(str).str.replace(' &',
',').str.split(',')
lmo_detailes['NAICS']=CleanedNAISClmo_detailes.head()

lmo.info()
data_output.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 58
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   LMO_Detailed_Industry  59 non-null     object
 1   NAICS                  59 non-null     object
dtypes: object(2)
memory usage: 1.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15576 entries, 0 to 15575
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   SYEAR                  15576 non-null  int64  
 1   SMTH                   15576 non-null  int64  
 2   LMO_Detailed_Industry  15576 non-null  object 
 3   Employment             15576 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 486.9+ KB

#clean the RTRA files NAICS column then Create NAIC drop the old one 

def Cleaning(df):
                df['NAICS'] = df.NAICS\
                        .map(lambda x:x.split('[' [1].strip(']').replace('-', ','))
                df = df.drop(columns = 'NAICS', axis = 1)[df.SYEAR < 2019]
                return df
#clean the RTRA files NAICS column then Create NAIC drop the old one 

def Cleaning(df):
                        df['NAICS'] = df.NAICS\
                        .map(lambda x:x.split('[')[1].strip(']').replace('-', ','))
                        df = df.drop(columns = 'NAICS', axis = 1)[df.SYEAR < 2019]
                        return df
# loading 2 digit code dataset

df2_naics = pd.read_csv('RTRA_Employ_2NAICS_00_05.csv')
employ_2_naics_files = ['RTRA_Employ_2NAICS_06_10.csv',  'RTRA_Employ_2NAICS_11_15.csv', 'RTRA_Employ_2NAICS_16_20.csv',  'RTRA_Employ_2NAICS_97_99.csv']

for i in employ_2_naics_files:
    df = pd.read_csv(i)
    df2_naics = df2_naics.append(df,ignore_index=True)
    
    
df2_naics.head()

# loading 3 digit code dataset

df3_naics = pd.read_csv('RTRA_Employ_3NAICS_00_05.csv')
employ_3_naics_files = ['RTRA_Employ_3NAICS_06_10.csv',  'RTRA_Employ_3NAICS_11_15.csv', 'RTRA_Employ_3NAICS_16_20.csv',  'RTRA_Employ_3NAICS_97_99.csv']

for i in employ_3_naics_files:
    df = pd.read_csv(i)
    df3_naics = df3_naics.append(df, ignore_index=True)

df3_naics.head()
# loading 4 digit code dataset

df4_naics = pd.read_csv('RTRA_Employ_4NAICS_00_05.csv')
employ_4_naics_files = ['RTRA_Employ_4NAICS_06_10.csv', 'RTRA_Employ_4NAICS_11_15.csv', 'RTRA_Employ_4NAICS_16_20.csv', 'RTRA_Employ_4NAICS_97_99.csv']

for i in employ_4_naics_files:
    df = pd.read_csv(i)
    df4_naics = df4_naics.append(df, ignore_index=True)
    
df4_naics.head()
df2_naics.info()
df3_naics.info()
df4_naics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5472 entries, 0 to 5471
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   SYEAR         5472 non-null   int64 
 1   SMTH          5472 non-null   int64 
 2   NAICS         5472 non-null   object
 3   _EMPLOYMENT_  5472 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 171.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28140 entries, 0 to 28139
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   SYEAR         28140 non-null  int64 
 1   SMTH          28140 non-null  int64 
 2   NAICS         28140 non-null  object
 3   _EMPLOYMENT_  28140 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 879.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85572 entries, 0 to 85571
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   SYEAR         85572 non-null  int64
 1   SMTH          85572 non-null  int64
 2   NAICS         85572 non-null  int64
 3   _EMPLOYMENT_  85572 non-null  int64
dtypes: int64(4)
memory usage: 2.6 MB
# merge the output template & RTRA final 'data_output' ready for analysis

data_output_merged = pd.merge
(lmo, data_output,on='LMO_Detailed_Industry', how='left')

data_output_merged.NAICS =
data_output_merged.NAICS.astype('str')

data_output_merged.head(25)
# merging all datasets [2, 3, 4 digits]

rtra_total = pd.concat([df2_naics, df3_naics, df4_naics], ignore_index=True)rtra_total





Now we are ready for Analysis...


Q1. What's the most Industry corresponding to Employment?


- We get that the Construction had the high number

data_output_grouped_lmo[:16].plot(kind='barh')
plt.xlabel('Employment')
plt.ylabel('Industries')

Q2. What happened to the employment in Construction evolved over time compared to employment in other industries?


construction=data_output_merged.loc[data_output_merged['LMO_Detailed_Industry']=='Construction']


construction_by_year=construction.groupby(['SYEAR'])['Employment'].sum()
construction_by_year
SYEAR
1997    1489750.0
1998    1424750.0
1999    1363500.0
2000    1345750.0
2001    1347250.0
2002    1401500.0
2003    1412000.0
2004    1705000.0
2005    1989000.0
2006    2097000.0
2007    2301000.0
2008    2616750.0
2009    2444750.0
2010    2380750.0
2011    2365750.0
2012    2383250.0
2013    2453000.0
2014    2406500.0
2015    2417750.0
2016    2536000.0
2017    2743250.0
2018    2860750.0
Name: Employment, dtype: float64
from IPython.core.pylabtools import 

figsizeplt.plot(construction_by_year.index,construction_by_year.values)
plt.xticks(range(1997,2019))
plt.title("Employment in Construction evolved overtime")
plt.xlabel("Month and Year")
plt.ylabel("Employment")plt.show(figsize(10,10))


Q3. How has employment in Construction evolved over time compared to employment in other industries?



total_employment=construction.groupby("SMTH")["Employment"].sum()
total_employment =
total_employment.reset_index()plt.figure(figsize=(20,10))
sns.lineplot(x="SMTH", y="Employment", data=total_employment, label="Total Employment")
sns.lineplot(x="SMTH", y="Employment", data=construction, label="Construction Employment")
plt.show()

Q4. How the employment changed every year?


index = data_output_merged.groupby('SYEAR')['Employment'].idxmax()
print(data_output_merged.loc[index].head())
                                  LMO_Detailed_Industry  \
5550  Other retail trade (excluding cars and persona...   
5567  Other retail trade (excluding cars and persona...   
5570  Other retail trade (excluding cars and persona...   
5586  Other retail trade (excluding cars and persona...   
5602  Other retail trade (excluding cars and persona...   

                                                  NAICS  SYEAR  SMTH  \
5550  ['442,443,444,445,446,447,448,451,452,453', '4...   1997     7   
5567  ['442,443,444,445,446,447,448,451,452,453', '4...   1998    12   
5570  ['442,443,444,445,446,447,448,451,452,453', '4...   1999     3   
5586  ['442,443,444,445,446,447,448,451,452,453', '4...   2000     7   
5602  ['442,443,444,445,446,447,448,451,452,453', '4...   2001    11   

      Employment  
5550    211000.0  
5567    219750.0  
5570    218250.0  
5586    219500.0  
5602    219250.0  
sns.lineplot(data=index.pct_change())

Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f95f9d10dd0>

Q5. What's the average of employment for each industry over time?

fig, ax = plt.subplots(1, 1, figsize=(15,15))
ax.plot(pd.crosstab(data_output_merged
['LMO_Detailed_Industry'], [data_output_merged['SYEAR']], values = data_output_merged['Employment'], aggfunc='mean'))
plt.xticks(rotation='vertical')
plt.show()

That's it, I hope this article was worth reading and helped you acquire new knowledge no matter how small.


Feel free to check up on the notebook. You can find the results of code samples in this post.




0 comments

Recent Posts

See All

Comments


bottom of page