/*

HS-SIC-NAICS-SITC Concordance Project

This program

	1. reads in the hs-sic and hs-naics concordances from the monthly trade cd files and
         uses two mechanical matches to fill in naics matches prior to 2000 and sic 
         matches after 2001.
	2. although step 1 succeeds in generating many matches, there are some that remain unmatched. 
         these were given to an RA (thanks kitjawat!) to match by hand. These hand matches are merged in 
         after the mechanical matches. (Note that one typo in the handmatched file is fixed in this .do file.
	3. In the future, we intend to explore using our hs-over-time concordances to avoid the need for any
         hand-matching. 

copyright Peter K. Schott / Justin R. Pierce

This program, associated files and a working paper describing our overall HS-SIC-NAICS concordance effort 
can be found at http://www.som.yale.edu/faculty/pks4/

2008.08.21 first version
2009.10.16 current version 

*/


**0 Prelim
clear
set more off
set mem 500m


**1 SIC Mapping
foreach zzz in exp imp {

	**1.1 read in the hs-sic mappings provided by census in its monthly trade cd files
	cd "C:\Users\pks4\Documents\My Dropbox\research\concordances\production\for_schott\"

	*create list of mappings
	use `zzz'_concord_89_106, clear
	*keep latest year for which sic is available
	keep if year==101
	keep commodity sic
	drop if sic==""
	duplicates drop commodity, force 
	sort commodity
	save temp0, replace
	
	*read in the list of raw hs10 export codes
	use `zzz'_concord_89_106, clear
	*only need to match years in which sic data are not provided
	keep if year>101
	keep commodity
	duplicates drop commodity, force
	sort commodity
	merge commodity using temp0, keep(sic) 
	tab _merge
	drop _merge
	destring commodity, force g(hs)
	egen sic87=group(sic)
	save `zzz'temp_01, replace

	*save group-sic mapping for below
	use `zzz'temp_01, clear
	collapse (mean) sic87, by(sic)
	rename sic87 sic87_new1
	rename sic sic_new1
	drop if sic_new1=="" | sic87_new1==.
	sort sic87_new1
	save temp1, replace
	
	use `zzz'temp_01, clear
	collapse (mean) sic87, by(sic)
	rename sic87 sic87_new2
	rename sic sic_new2
	drop if sic_new2=="" | sic87_new2==.
	sort sic87_new2
	save temp2, replace


	**1.2 First Mechanical Match
	**Create new matches mechanically by looking to see what the already-matched sic look like. 
	**Look at all hs9 to see what sic87 the already-matched have; if unanimous, use that. If not,
	**go up one level. and so on. 
	use `zzz'temp_01, clear
	gen sic87_new1 = sic87
	sum hs sic87*
	quietly {
	  foreach x in 9 8 7 6 5 4 3 2 {
		noisily display [`x']
		local y       = 10-`x'
		gen hs`x'     = int(hs/(10^`y'))
		egen t1       = mean(sic87), by(hs`x')
		egen t2       = sd(sic87), by(hs`x')
		egen t3       = count(sic87), by(hs`x')
		gen sic87_`x' = t1 if t2==0 | t3==1
		replace sic87_new1 = sic87_`x' if sic87==. & sic87_new1==.
		drop t1 t2 t3
		drop hs`x' sic87_`x'
	  } 
	}
	sum hs sic87 sic87_new1
	sort hs
	save `zzz'temp_02, replace
	
	**1.3 Second Mechanical Match
	**Look at gaps. If last known and next know are the same, use them to fill in.
	use `zzz'temp_02, clear
	gen sic87_new2 = sic87_new1
	gen begin      = 1 if sic87_new1==. & sic87_new1[_n-1]~=.
	gen end        = sic87_new1==. & sic87_new1[_n+1]~=.
	gen bsum       = sum(begin)
	gen gap        = sic87_new1==.
	replace bsum=. if gap==0
	gen sb         = sic87_new1[_n-1]*begin
	gen se         = sic87_new1[_n+1]*end
	egen tb        = mean(sb), by(bsum)
	egen te        = mean(se), by(bsum)
	gen match      = tb==te
	replace sic87_new2 = tb if match==1 & sic87_new1==.
	sum hs sic87*
	drop begin end bsum gap sb se tb te match
	sort hs
	save `zzz'temp_03, replace
	
	*1.4 Recover groups from above
	use `zzz'temp_03, clear
	sort sic87_new1
	merge sic87_new1 using temp1, keep(sic_new1)
	tab _merge
	drop _merge
	sort sic87_new2
	merge sic87_new2 using temp2, keep(sic_new2)
	tab _merge
	drop _merge
	sort hs
	gen t=sic87_new1~=.
	tab t
	drop t 
      drop sic87*
	format hs %15.0g
	drop if hs<100
	save `zzz'_concord_89_106_sicfillin, replace

}



**2 naics
foreach zzz in exp imp {

	**2.1 read in the hs-sic mappings provided by census in its monthly trade cd files
	
	*create list of mappings
	use `zzz'_concord_89_106, clear
	*keep earliest year for which naics is available
	keep if year==100
	keep commodity naics
	drop if naics==""
	duplicates drop commodity, force 
	sort commodity
	save temp0, replace
	
	*read in the list of raw hs10 export codes
	use `zzz'_concord_89_106, clear
	*Only need years for which there is no naics
	keep if year<100
	keep commodity
	duplicates drop commodity, force
	sort commodity
	merge commodity using temp0, keep(naics) 
	tab _merge
	drop _merge
	destring commodity, force g(hs)
	egen naics87=group(naics)
	save `zzz'temp_01, replace

	*save group-naics mapping for below
	use `zzz'temp_01, clear
	collapse (mean) naics87, by(naics)
	rename naics87 naics87_new1
	rename naics naics_new1
	drop if naics_new1=="" | naics87_new1==.
	sort naics87_new1
	save temp1, replace
	
	use `zzz'temp_01, clear
	collapse (mean) naics87, by(naics)
	rename naics87 naics87_new2
	rename naics naics_new2
	drop if naics_new2=="" | naics87_new2==.
	sort naics87_new2
	save temp2, replace


	**2.2 First Mechanical Match
	**Create new matches mechanically by looking to see what the already-matched naics look like. 
	**Look at all hs9 to see what naics87 the already-matched have; if unanimous, use that. If not,
	**go up one level. and so on. 
	use `zzz'temp_01, clear
	gen naics87_new1 = naics87
	sum hs naics87*
	quietly {
	  foreach x in 9 8 7 6 5 4 3 2 {
		noisily display [`x']
		local y       = 10-`x'
		gen hs`x'     = int(hs/(10^`y'))
		egen t1       = mean(naics87), by(hs`x')
		egen t2       = sd(naics87), by(hs`x')
		egen t3       = count(naics87), by(hs`x')
		gen naics87_`x' = t1 if t2==0 | t3==1
		replace naics87_new1 = naics87_`x' if naics87==. & naics87_new1==.
		drop t1 t2 t3
		drop hs`x' naics87_`x'
	  } 
	}
	sum hs naics87 naics87_new1
	sort hs
	save `zzz'temp_02, replace
	
	**2.3 Second Mechanical Match
	**Look at gaps. If last known and next know are the same, use them to fill in.
	use `zzz'temp_02, clear
	gen naics87_new2 = naics87_new1
	gen begin      = 1 if naics87_new1==. & naics87_new1[_n-1]~=.
	gen end        = naics87_new1==. & naics87_new1[_n+1]~=.
	gen bsum       = sum(begin)
	gen gap        = naics87_new1==.
	replace bsum=. if gap==0
	gen sb         = naics87_new1[_n-1]*begin
	gen se         = naics87_new1[_n+1]*end
	egen tb        = mean(sb), by(bsum)
	egen te        = mean(se), by(bsum)
	gen match      = tb==te
	replace naics87_new2 = tb if match==1 & naics87_new1==.
	sum hs naics87*
	drop begin end bsum gap sb se tb te match
	sort hs
	save `zzz'temp_03, replace
	
	*2.4 recover groups from above
	use `zzz'temp_03, clear
	sort naics87_new1
	merge naics87_new1 using temp1, keep(naics_new1)
	tab _merge
	drop _merge
	sort naics87_new2
	merge naics87_new2 using temp2, keep(naics_new2)
	tab _merge
	drop _merge
	sort hs
	gen t=naics87_new1~=.
	tab t
	drop t 
      drop naics87*
	format hs %15.0g
	drop if hs<100
	save `zzz'_concord_89_106_naicsfillin, replace

}

**3 Add in hand matches to imports and exports, respectively, first for sic and then for naics
**  Any missing matches after the last section were matched by hand by kitjawat. Add these
**  hand matches into the data here and then also create a variable that identifies each 
*** mapping according to whether it is from Census, mechanical match 1, mechanical match 2 or
**  from kitjawat's hand matching. 
**
**  2009.10.16 change sic 2612 to 2621 in kitjawat_handmatch_imports_sic_20080821 per Justin's email
**             also add leading zero to sic's from handmatch and fix missing naics for 1605106000
**

use imp_concord_89_106_sicfillin, clear
sort hs
merge hs using kitjawat_handmatch_imports_sic_20080821
tab _merge
drop if _merge==2
replace kitjawat = 2621 if kitjawat==2612
drop _merge
gen sic_new3=sic_new2
tostring kitjawat, g(kitjawats)
replace kitjawats = "0"+kitjawats if kitjawat>=100 & kitjawat<=999 
replace sic_new3=kitjawats if sic_new3=="" & kitjawats!=""
replace sic_new3="" if sic_new3=="."
sort hs
merge hs using sic_imp_jrp
tab _merge
replace sic_new3=sic_new4 if sic_new3=="" & sic_new4!=""
codebook sic_new3
gen id         = "From Census"
gen newsic     = sic
replace id     = "From mechanical match 1" if sic==""
replace newsic = sic_new1 if sic==""
replace id     = "From mechanical match 2" if newsic==""
replace newsic = sic_new2 if newsic==""
replace id     = "From hand match" if newsic==""
replace newsic = kitjawats if newsic==""
label var id "SIC match type"
keep commodity hs newsic id
rename newsic sic 
rename id sic_matchtype
rename sic new_sic
keep commodity new_sic sic_matchtype
order commodity new_sic sic_matchtype
sort commodity
save sic_imp_final, replace

use imp_concord_89_106_naicsfillin, clear
sort hs
merge hs using kitjawat_handmatch_imports_naics_20081016
tab _merge
drop if _merge==2
drop _merge
gen naics_new3=naics_new2
tostring kitjawat, g(kitjawats)
replace kitjawats = "311711" if commodity=="1605106000"
replace naics_new3=kitjawats if naics_new3=="" & kitjawats!=""
replace naics_new3="" if naics_new3=="."
sort hs
merge hs using naics_imp_jrp
tab _merge
replace naics_new3=naics_new4 if naics_new3=="" & naics_new4!=""
codebook naics_new3
gen id           = "From Census"
gen newnaics     = naics
replace id       = "From mechanical match 1" if naics==""
replace newnaics = naics_new1 if naics==""
replace id       = "From mechanical match 2" if newnaics==""
replace newnaics = naics_new2 if newnaics==""
replace id       = "From hand match" if newnaics==""
replace newnaics = kitjawats if newnaics==""
label var id "NAICS match type"
drop naics
rename newnaics naics 
rename id naics_matchtype
rename naics new_naics
keep commodity new_naics naics_matchtype
order commodity new_naics naics_matchtype
sort commodity
save naics_imp_final, replace

use exp_concord_89_106_sicfillin, clear
sort hs
merge hs using kitjawat_handmatch_exports_sic_20080821
tab _merge
drop if _merge==2
drop _merge
gen sic_new3=sic_new2
tostring kitjawat, g(kitjawats)
replace kitjawats = "0"+kitjawats if kitjawat>=100 & kitjawat<=999 
replace sic_new3=kitjawats if sic_new3=="" & kitjawats!=""
replace sic_new3="" if sic_new3=="."
sort hs
merge hs using sic_exp_jrp
tab _merge
replace sic_new3=sic_new4 if sic_new3=="" & sic_new4!=""
codebook sic_new3
gen id         = "From Census"
gen newsic     = sic
replace id     = "From mechanical match 1" if sic==""
replace newsic = sic_new1 if sic==""
replace id     = "From mechanical match 2" if newsic==""
replace newsic = sic_new2 if newsic==""
replace id     = "From hand match" if newsic==""
replace newsic = kitjawats if newsic==""
label var id "SIC match type"
drop sic
rename newsic sic 
rename id sic_matchtype
rename sic new_sic
keep commodity new_sic sic_matchtype
order commodity new_sic sic_matchtype
sort commodity
save sic_exp_final, replace

use exp_concord_89_106_naicsfillin, clear
sort hs
merge hs using kitjawat_handmatch_exports_naics_20081016
tab _merge
drop if _merge==2
drop _merge
gen naics_new3=naics_new2
tostring kitjawat, g(kitjawats)
*replace kitjawats = "0"+kitjawats if kitjawat>=100 & kitjawat<=999 
replace naics_new3=kitjawats if naics_new3=="" & kitjawats!=""
replace naics_new3="" if naics_new3=="."
sort hs
merge hs using naics_exp_jrp
tab _merge
replace naics_new3=naics_new4 if naics_new3=="" & naics_new4!=""
codebook naics_new3
gen id           = "From Census"
gen newnaics     = naics
replace id       = "From mechanical match 1" if naics==""
replace newnaics = naics_new1 if naics==""
replace id       = "From mechanical match 2" if newnaics==""
replace newnaics = naics_new2 if newnaics==""
replace id       = "From hand match" if newnaics==""
replace newnaics = kitjawats if newnaics==""
label var id "NAICS match type"
drop naics
rename newnaics naics 
rename id naics_matchtype
rename naics new_naics
keep commodity new_naics naics_matchtype
order commodity new_naics naics_matchtype
sort commodity
save naics_exp_final, replace

**4 Reassemble HS-SIC data for all years

*Imports
use imp_concord_89_106, clear
sort commodity
merge commodity using sic_imp_final
tab _merge
drop _merge
replace sic_matchtype="From Census" if sic!=""
replace sic=new_sic if sic=="" & new_sic!=""
sort commodity
merge commodity using naics_imp_final
tab _merge
drop _merge
replace naics_matchtype="From Census" if naics!=""
replace naics=new_naics if naics=="" & new_naics!=""
drop new* descrip*
destring commodity, g(hs) force
append using imp_107_concord
append using imp_108_concord
append using imp_109_concord
order commodity hs year sic sic_matchtype naics naics_matchtype
sort commodity year
save hs_sic_naics_imports_89_109_20111004, replace
outsheet using hs_sic_naics_imports_89_109_20111004.csv, replace

*Exports
use exp_concord_89_106, clear
sort commodity
merge commodity using sic_exp_final
tab _merge
drop _merge
replace sic_matchtype="From Census" if sic!=""
replace sic=new_sic if sic=="" & new_sic!=""
sort commodity
merge commodity using naics_exp_final
tab _merge
drop _merge
replace naics_matchtype="From Census" if naics!=""
replace naics=new_naics if naics=="" & new_naics!=""
drop new* descrip*
destring commodity, g(hs) force
*This drops several special classification codes for U.S. goods returned from Puerto Rico
drop if hs<10
append using exp_107_concord
append using exp_108_concord
append using exp_109_concord
order commodity hs year sic sic_matchtype naics naics_matchtype
sort commodity year
save hs_sic_naics_exports_89_109_20111004, replace
outsheet using hs_sic_naics_exports_89_109_20111004.csv, replace
