top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureRubash Mali

Understanding and Analysis Of NAICS Dataset

Introduction


The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico and the United States. NAICS is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies.


NAICS works as a hierarchical structure for defining industries at different levels of aggregation. For example a 2-digit NAICS industry (e.g., 23 - Construction) is composed of some 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy and civil engineering construction, and a few more 3-digit NAICS industries). Similarly, a 3-digit NAICS industry (e.g., 236 - Construction of buildings), is composed of 4-digit NAICS industries (2361 - Residential building construction and 2362 -Non-residential building construction).


15 CSV files beginning with RTRA. These files contain employment data by industry at different levels of aggregation; 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS. Columns mean as follows:

(i) SYEAR: Survey Year

(ii) SMTH: Survey Month

(iii) NAICS: Industry name and associated NAICS code in the bracket

(iv) _EMPLOYMENT_: Employment


We explore each hierarchy level of NAICS individually and then as a whole. Firstly we import required package's as follow:


import pandas as pd
import os
import glob
import matplotlib.pyplot as plt

Analysis of 2-digit NAICS

We first merge all 2-digit NAICS RTRA file using the globe library by matching the string 2NAICS to all RTRA files to get there paths and names as follow:


m=os.getcwd()
joined_files = os.path.join(m, "*2NAICS*.csv")
joined_list = glob.glob(joined_files)
df_2= pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df_2.shape
#Output
(5472,4)

We inspect some of the rows using the head function.

df.head()

Listing out the unique 2-digit NAICS industries as follow:


df_2.NAICS.unique()
#Output
array(['Accommodation and food services [72]','Administrative and support, waste management and remediation services [56]','Agriculture, forestry, fishing and hunting [11]',Arts, entertainment and recreation [71]', 'Construction [23]','Educational services [61]', 'Finance and insurance [52]','Health care and social assistance [62]','Information and cultural industries [51]','Management of companies and enterprises [55]','Manufacturing [31-33]','Mining, quarrying, and oil and gas extraction [21]','Other services (except public administration)[81]','Professional, scientific and technical services [54]','Public administration [91]','Real estate, rental and leasing [53]', 'Retail trade  [44-45]', 'Transportation and warehousing [48-49]', 'Utilities [22]','Wholesale trade  [41]'], dtype=object)

Total number of 2-digit NAICS is 20


len(df_2.NAICS.unique())
#Output
20

We group on the basis of NAICS and find the top 5 largest 2 digit sectors providing employment as follow :


df2_naic=df_2.groupby('NAICS')['_EMPLOYMENT_'].sum()
df2_naic.nlargest(5)
#Output
NAICS
Retail trade  [44-45]                     71565000
Health care and social assistance [62]    66872000
Manufacturing [31-33]                     49336250
Construction [23]                         47616500
Accommodation and food services [72]      46411000
Name: _EMPLOYMENT_, dtype: int64

We plot the employee count of each NAICS sector as follow:


df2_naic.plot(kind='bar',figsize=(10,6),title='Employee Count at diffrent  2 digit NAICS')
plt.ylabel("Employment")



We can observe the highest employee count for 2-digit NAICS is of Retail trade having code 44-45 and employee count of 71565000.


So we further investigate Retail trade and observe employee count per year as follow:


df_2_retail=df_2[df_2.NAICS=='Retail trade  [44-45]']
df_2_retail.groupby('SYEAR')['_EMPLOYMENT_'].sum()
#Output
SYEAR
1997    2742250
1998    2766750
1999    2719500
2000    2836500
2001    2832500
2002    2956250
2003    2991750
2004    2970250
2005    3015000
2006    3155250
2007    3288000
2008    3200750
2009    3282000
2010    3449750
2011    3317750
2012    3185000
2013    3342000
2014    3330250
2015    3216000
2016    3362250
2017    3484750
2018    3434750
2019    2685750
Name: _EMPLOYMENT_, dtype: int64

We create the line plot of the same a follow

df_2_retail.groupby('SYEAR'['_EMPLOYMENT_'].sum().plot(kind='line',figsize=(7,6))
plt.xlabel('year')
plt.ylabel('numbe of employees')

We can find that there is a sharp rise from late 90s to 2010 in employee for Retail sector. From 2010-2015 fluctuation can be seen. More interestingly there is a steep decline in the last year (2019). So we observe the last year of Retail trade in depth a follow


df_2_retail[df_2_retail.SYEAR==2019]













We can see that the employee count is missing (zero) for months October to December. So we inspect the entire 2 digit NACIS sectors and sum up there Employe count for year 2019 and months October to December.


df_2[(df_2.SYEAR==2019) & (df_2.SMTH>9)]._EMPLOYMENT_.sum()
#Output
0

Since the sum is zero we can conclude that employee count data is only provided till 2019 September for all 2 digit NACIS.

Another sector of interest is Construction so we further analyze on that sector.


df_2_construction=df_2[df_2.NAICS=='Construction [23]']
df_2_construction.groupby('SYEAR'['_EMPLOYMENT_'].sum().plot(kind='line',figsize=(7,7))
plt.xlabel('year')
plt.ylabel('numbe of employees')













Line plot of Construction Sector employee count per year is similar in pattern to that of the general NAICS plot shown above.


df_2_nonconstruction=df_2[df_2.NAICS!='Construction [23]']
c=df_2_construction.groupby('SYEAR')['_EMPLOYMENT_'].sum()
d=df_2_nonconstruction.groupby('SYEAR')['_EMPLOYMENT_'].sum()
df = pd.DataFrame({"Construction":c,"Others":d})
4ax = df.plot.bar(color=["SkyBlue","IndianRed"],  title="Construction Vs Other Sectors Employee Count Year graph",figsize=(15,7),width=0.8,stacked=True)
ax.set_xlabel("Year")
ax.set_ylabel("Employee")
plt.show()


The stack bar chart was plotted to showcase yearly Employee count of Construction Sector as compared to other NAICS sectors.


c.mean()/d.mean()*100
#Output
8.784898531299422

On an average 8.75% of total employees from 2 digit NAICS are from Construction Sector.


We also check the employee count each month from 1997 to 2019 for all 2 digit NAICS.

df2_smth=df_2.groupby('SMTH')['_EMPLOYMENT_'].sum()
df2_smth.plot(kind='bar',figsize=(8,6))
plt.xlabel('Month')
plt.ylabel('No of employees')


Not much information can be grasped from the plot.


We also check the employee count each year from 1997 to 2019 for all 2 digit NAICS.

df2_year=df_2.groupby('SYEAR')['_EMPLOYMENT_'].sum()
df2_year.plot(kind='bar',figsize=(10,6))
plt.xlabel('Year')
plt.ylabel('No of employees')

We observed gradual steady rise in employee and the sharp decline in year 2019 is due to missing values as concluded earlier above in this blog.

Analysis of 3-digit NAICS

Similar to 2 digit NAICS we merge all 3-digit NAICS RTRA files as follow


joined_list=[]
joined_files = os.path.join(m, "*3NAICS*.csv")
joined_list = glob.glob(joined_files)
df_3= pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df_3.head()

There are total of 103 3 digit NAICS.

len(df_3.NAICS.unique())
#Output
103

So we select only top 15 3 digit NAICS based on employee count and plot there employee count as follow.


df3_naic=df_3.groupby('NAICS'['_EMPLOYMENT_'].sum().nlargest(15)
df3_naic.plot(kind='bar',figsize=(10,6),title='Employee Count at diffrent 3 digit NAICS')
plt.ylabel("Employment")

i


Professional, scientific and technical services with code 541 has the highest employee count of about 44 milllion.So we further dig into it.


df_3_top_naic=df_3[df_3.NAICS=='Professional, scientific and technical services[541]']
df_3_top_naic.groupby('SYEAR')['_EMPLOYMENT_'].sum().plot(kind='line',figsize=(7,7))
plt.xlabel('year')
plt.ylabel('numbe of employees')

A steep rise can be observed and a fall in 2019. So we zoom into the year 2019.


df_3_top_naic[df_3_top_naic.SYEAR==2019]


Here also for the year 2019 on months 10 ,11 and 12 employee data is missing. We investigate for all other 3 digit NAICS as follow


df_3[(df_3.SYEAR==2019) & (df_3.SMTH>9)]._EMPLOYMENT_.sum()
#Output
0

So we can corroborate that like 2 digit NAICS, In 3 digit NACIS for the year 2019 and months 10 ,11 ,12 the employee data is missing.


Similarly the yearly and monthly employee count plots across all 3 digit NAICS are as follow



Analysis of 4-digit NAICS

We merge all 4 digit NAICS RTRA files similar to previous two as follow


joined_list=[]
joined_files = os.path.join(m, "*4NAICS*.csv")
joined_list = glob.glob(joined_files)
df_4= pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df_4.head()


Here we can observe that the NAICS only contains the NAICS code and no other further textual description.

There is total of 314 4 digit NAICS present in the data set.


len(df_4.NAICS.unique())
# Output 
314

We select only the top 15 4 digit NAICS based on employee count for further analysis.

df4_naic=df_4.groupby('NAICS'['_EMPLOYMENT_'].sum().nlargest(15)
df4_naic.plot(kind='bar',figsize=(8,10),title='Employee Count at diffrent 3 digit NAICS')
plt.ylabel("Employment")


Here sector 7225 has the highest employee count of about 33 million. So we further dive into it for analysis.


df_4_top_naic=df_3[df_4.NAICS==7225]
df_4_top_naic.groupby('SYEAR')['_EMPLOYMENT_'].sum().plot(kind='line',figsize=(7,7))
plt.xlabel('year')
plt.ylabel('number of employees')










The line plot is extremely fluctuating so we further dive into it.


 df_4_top_naic[df_4_top_naic.SYEAR==2019]

Here 7225 NAICS is available only for the months 2,5,8,11 for year 2019. So we again look for the year 2018 similarly.


df_4_top_naic[df_4_top_naic.SYEAR==2018]


Same is the case for year 2018. So we can conclude that there is alot of missing data for 4 digit NAICS


Yearly and monthly employee count plots across all 4 digit NAICS are as follow:




Combined Analysis

For combined analysis we firstly preprocess 2 and 3 digit NIACS by extracting there codes only from NIACS column as follows


 df_2['NAICS']=df_2.NAICS.str.extract('\[(.*?)\]')
 df_2.head()









df_3['NAICS']=df_3.NAICS.str.extract('\[(.*?)\]')
df_3.head()








For 4 digit NAICS no processing is required a direct code is provided in the NAICS column.

We merge all three of the dataframe as follows:


final=pd.concat([df_2,df_3,df_4],axis=0)
final.info()
#Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 119184 entries, 0 to 85571
Data columns (total 4 columns):
SYEAR           119184 non-null int64
SMTH            119184 non-null int64
NAICS           118632 non-null object
_EMPLOYMENT_    119184 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.5+ MB

We can observe that many NAICS rows have nan value. So we remove these rows before further analysis.


final.dropna(subset = ["NAICS"],inplace=True)
final.info()
#Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 118632 entries, 0 to 85571
Data columns (total 4 columns):
SYEAR           118632 non-null int64
SMTH            118632 non-null int64
NAICS           118632 non-null object
_EMPLOYMENT_    118632 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.5+ MB

We select the top 15 highest employee based NAICS and plot there count as follow


final_top_naic=final.groupby('NAICS'['_EMPLOYMENT_'].sum().nlargest(20)
final_top_naic.plot(kind='bar',figsize=(10,6),title='Employee Count of diffrent NAICS')
plt.ylabel("Employment")

We observe mostly 2 digit NAICS are at the top, few 3 digit NAICS and interestingly two 4 digit NAICS.


Conclusion

Hence we performed exploratory data analysis on NAICS dataset at different hierarchy level and combined as well and stated a few findings.

The github link to the code is as follow.










0 comments

Recent Posts

See All

Comments


bottom of page