Getting started with the SAS extraction programs¶
Overview¶
The goal of this project is to create a centralized repository of Medicare extraction programs. Centralizing the code makes it easier for bugs to be found and fixed. It also has the added impact of flattening the learning curve for new researchers to the Medicare claim files by leveraging the work done by researchers who have come before them. Many of the programs are nearly ready to be run "out of the box". The following section briefly describes an overview of the SAS code, outlines important restrictions which the program places on the extracts and details the process of setting up the programs to run in SAS.
The extraction occurs in four general steps, with each of the folders being
labeled in the order in which they should be run. For example the 0PreProcess
folder programs should be run before the 1PrimaryExtractionPrgms. Furthermore
programs which are either labeled 1A, 1B, or 1C should be run in the order
designated by the associated letter.
The /tmp drives, previously referred to as space drive, referenced in this
section are NBER specific drives which are local to the server processing the
code and helps avoid I/O constraints on the system, in effect speeding up how
fast your code is processed. Note this has recently changed due to system
upgrade and now the network drives are actually faster - depending on system
traffic.
Also for a general overview to thinking about scientific computing, this article provides a nice overview.
The following steps outline how to get started with the programs:
- Choose a text editor to use on the Linux servers - lots of choices
- Get comfortable with running code on the Linux servers
- Download code or use the git clone feature if you're working on the NBER servers
- Check to make sure that a proper DUA and user name folder has been set up for you on the space, disk and bulk drives. The following step will use these folders to create a directory structure for you.
- Define directories in the set_up_directory.sasandset_up_directory.do. This file will create the directory structure for you
- Decide on the group of ICD-9-CM and/or HCPCS codes which will define your cohort and which claims files will be used to flag candidate index events. Put these codes in the icd9_chrt_definition.sasand/orhcpcs_chrt_defintion.sas. If you are not using the carrier or outpatient files to define your cohort or are not using the HCPCS codes, then you can ignore thehcpcs_chrt_defintion.sasfiles
- Fill out the parameters in the shell programrunAllprgms.shfound under theTODO#tags. The program has been modified (April 2014) to allow users to simply define variables in this files which are then passed to SAS programs. Saves the hassle of opening and closing SAS files. Note users will still be able to run a single sas program but you will need to update the parameters. As you fill out the different variables it will be important to make sure you understand the different restrictions it passes onto the creation of the extract.
Warning
Only run programs in runAllprgms.sh that your DUA covers. Comment out those files you wish not to run.
Getting comfortable with the NBER Unix systems and SAS¶
Choosing a text editor¶
Choosing a text editor is important and is worth taking some time to choose one that fits your needs the best. Some text editors currently available on the servers are EMACS, pico, and vi, and there are many others available for use on your local machine. It is recommended that you use a text editor on the server or at least a text editor which is able to read files over ssh, in order to avoid the hassle of having to copy programs from your local machine to the server. Remember that results are often sensitive and should not be held on your local machine.
Although Emacs has a steep learning curve, I chose it because it is extensible and you can trust that the large user community has implemented most any feature you may want. Beside general coding ease, I have found it quite useful for writing tex files to display results and for keeping to do lists via the org-mode extension.
Emacs is customized through an initialization file read from the home directory.
The .emacs file needs to be created and placed in your home directory. A
sample .emacs file is provided at
Auxiliary/EMACSDOC.
Note that an old copy of ESS is supplied in the same directory. In order for
Emacs to successfully read your initialization file, the location of the actual
lisp programs (*.el) needs to be included in the
.emacs
file. Yet another useful Emacs module is the M-X-cua-mode which binds the key
short cuts for copy, paste, and undo to those found in the Windows environment.
Working with the Unix terminal¶
The terminal is an extremely powerful tool. It is accessible within Stata or
SAS, but you can also place a series of commands into a bash script to automate
parts your program. At a minimum, new users should get comfortable with
aliases and checking the system load with the top or uptime command. This
website provides a nice
overview of learning to use the terminal/shell.
For example, the SAS pack contains a bash script created to run the SAS and
Stata programs in the correct sequence and email the user in the case of a
crash. The program called
runAllprgms.sh
is a simple but useful example of what you can do with the terminal. To run bash
programs simply type bash runAllprgms.sh into the terminal, and assuming you
have assigned the right permissions it should run.
Learning SAS and how to use SAS macros¶
The programs developed to pull and harmonize the raw Medicare claims rely
heavily on the SAS macro facility. SAS macros allow users to create code which
takes on a series of parameters which replace keywords (¯o_parameter.) in
the code prior to execution. In this way a SAS macro can be though of as a do
file where a series of local variables at the top of the program are used to
change the code in meaningful ways.
Included in the SAS pack are SAS code snippets I have used to test different
feature of SAS code. The programs are available at
Auxiliary/codeTest,
and may be useful to someone getting started.
Setting up SAS programs¶
Assuming an individual has entered into a DUA with CMS, users need to gain access to the code, setup the directories correctly and define the type of episode interested in being studied. Each of these is reviewed in the following sub-sections.
Download programs¶
Download the zipped file and decompress it preserving the directory structure in a target location of your choosing.
Define target and input directories¶
Once the programs have been downloaded, the directory structure the programs use
needs to be configured. All directories which the programs use can be found in
the following two files
set_up_directory.sas
and
set_up_directory.do,
which define the same locations for the SAS and Stata files. The two files need
to be populated with the correct directories prior to running the program - many
of the directories should be the same.
There are generally two types of directories found in these files. Input
directories, which correspond to places where the programs will read data in
from. In these cases, the location of these raw files simply needs to be
defined. The other case, are output directories which the programs will write
to. The set_up_directory.sas file will generate the directories for you as
long as you define a base directory correctly.
Once the
set_up_directory.sas
and
set_up_directory.do
files are populated, run the SAS and/or Stata programs on the 0001% extracts
just to make sure that SAS/Stata has correctly read the directories which you
have defined. At the very beginning of the log file, SAS will check to make sure
that the directories defined exist. Check these notes which SAS outputs to make
sure that you directories are properly defined. The following code snippet
provides an example of the SAS WARNING LOG that occurs with incorrectly defined
directories.
libname tempDir "&medtmp."; WARNING: Apparent symbolic reference MEDTMP not resolved. NOTE: Library TEMPDIR does not exist.
Code: Sample SAS log error code: Incorrectly set up directories
Define ICD-9-CM/HCPCS cohort of interest¶
After defining the directory structure, the next order of business involves
correctly specifying your cohort. Studies using administrative claims use
ICD-9-CM or HCPCS codes to define a group of individuals with a similar disease.
ICD-9-CM codes come in both diagnostic and procedure types, whereas the HCPCS
codes are used to define a specific procedure. The code currently has several
examples embedded directly into the code. Note that folder
1AFlagcandidateEvents
holds a series of different programs corresponding to each claim file of
interest. Each program calls in
icd9_chrt_definition.sas
and/or
hcpcs_chrt_defintion.sas
depending on what is reported in the claim. The programs use these two files to
look for a cohort definition.
Once a series of ICD-9-CM/HCPCS codes of interest have been selected, a short
bit of code needs to be placed in the relevant file. Note that in addition to
choosing the ICD-9-CM code, a name stub (ami in the example below) also needs
to be referenced which all of the following programs will use reference the
cohort. For example, the heart attack example uses ami. All programs will use
this stub as a parameter in the macro to correctly identify the index of
interest. Future programs (e.g. carrier_clm_window.sas) will also need to use
the same exact stub name to load in the correct data sets.
As the following code snippet demonstrates, the new
icd9_chrt_definition.sas
SAS code now allows for users to define an index event through the use of a
procedure code. Users for the most part should only need to change the if
condition with the diagnostic code and the is_snf condition when changing to
another cohort of interest.
************************************************************; *** DEFINES AMI ICD9-CM CODE SEQUENCE; ************************************************************; %if &cohrt. = ami %then %do; ***!!!!!! DEAL WITH DIAGNOSTIC CODE- ONLY IF NUMDGN CODES >0; %if &numDgnCodes.>0 %then %do; do i=1 to &numDgnCodes.; **LOOP THROUGH DIAGNOSTIC CODES; if substr(dgns_cd_arr(i),1,3) = "410" and ((length(dgns_cd_arr(i)) = 5 and substr(dgns_cd_arr(i),5,1) ne "2") or (length(dgns_cd_arr(i)) < 5) ) then do; dgns_cnt_arr(i)=1; dgnIndxNum=i; dICD9_indx=dgns_cd_arr(i); end; %end; ***!!!INSERT PROCEDURE CODE OG INTEREST IF ANY- ONLY IF PRCDGN CODES >0; %if &numProcCodes.>0 %then %do; **LOOP THROUGH PROCEDURE CODES; do i=1 to &numProcCodes.; ** LOOP THROUGH PROCEDURE CODE; /* EXAMPLE PROCEDURE CODE ONLY if (length(prcdr_cd_arr(i)) =4 ) and ( substr(prcdr_cd_arr(i),1,4) = "3606" or substr(prcdr_cd_arr(i),1,4) = "3607" or substr(prcdr_cd_arr(i),1,3) = "360" or substr(prcdr_cd_arr(i),1,3) = "361" or substr(prcdr_cd_arr(i),1,4) = "3722" or substr(prcdr_cd_arr(i),1,4) = "3723" or substr(prcdr_cd_arr(i),1,4) = "8855" or substr(prcdr_cd_arr(i),1,4) = "8856" or substr(prcdr_cd_arr(i),1,4) = "8857" ) then do; prcdr_cnt_arr(i)=1; pICD9_indx=prcdr_cd_arr(i); prcIndxNum=i; end; */ end; %end; **Other options specific to claims file here; %if &file. = medpar %then %do; **Drop if index event occurs at SNF; if is_snf=1 then delete; %end; %end;
Code: Sample icd9_chrt_definition.sas code: Defining the AMI cohort
Define SAS parameters in the runAllprgms.sh file¶
The next step is to go through and define all of the parameters which are passed
onto SAS programs found in the runAllprgms.sh file. In doing so it is
important to understand the different restrictions used in building the cohort.
Note that an index event is defined as the first episode care for a cohort, it
denotes the beginning of an episode of care. The following list provides and
overview of the steps in building the cohorts and the relevant restrictions.
- Flag potential (candidate) index event using the procedure/diagnostic codes defined in icd9_chrt_definition.sasand/orhcpcs_chrt_defintion.sas. The subset of patients are selected by the 1A programs- Parameter COHRT: Thecohrtvariable should match the code block found in those files.
- Parameters IP,OP,MEDPAR,CAR: These binary flags denote which files were used to define the index event. In the example AMI covered in this documentation only theMEDPARparameter would be set to 1 and the others to 0.
- Paramters IP_DXN,IP_PROC, ...: The parameters denote how many diagnostic or procedure codes to look through when defining an index event. For example, in our AMI cohort, as in with many cohorts, the cohort is built only of the principal diagnostic code in the MedPAR file which coincides with the first diagnostic code. In that example only theMED_DXNparameter is set to 1 with the rest of the parameters set to 0.
 
- Parameter 
- 
Merge in enrollment information and impose enrollment restrictions ( 1B_impose_restrictions.sasprogram)- Parameter DAYS_BETWEEN_INDEX_EVENT: Number of days which must pass before an individuals can have another index event assigned. The assumption is very important because in a sense it defines how much time must pass before we want to treat a new candidate episode as a new event. It also takes care of transfers since the first index event will be assigned.
- Parameters FFSmonthpostandFFSmonthpre: The individual is continuously enrolled in Parts A and B of traditional Medicare (rather than a Medicare private plan) during the twelve months before AMI and until twelve months after AMI, or until death. (months can be changed by the variablesFFSmonthpost/FFSmonthpre)
- Parameters HMOmonthpostandHMOmonthpre: The individual is continuously not in HMO during the twelve months before AMI and until twelve months after AMI, or until death. (months can be changed by the variablesHMOmonthpost/HMOmonthpre)
- 
The individual is at least 66 years of age at the time of the AMI Note This is hard coded and would need to be changed in the 1B program. 
- 
No bad date of death or birthday Note This is hard coded and would need to be changed in the 1B program. 
 
- Parameter 
- 
Pull claims around the index event dates, using the index date (first treatment date for disease/procedure of interest) which have been harmonized 
- Run secondary programs to define comorbidities and cost statistics- Raw cost statistics: DAY1-DAY7variable denote theXday cost you want to be calculated. For example settingDAY1=30,DAY2=90andDAY3=365would create 30-day, 90-day, and 365-day costs where the day references the number of days occurring after the index event date.
- Parameter YEARIN: Since cost statistics combine information across claims, you need to define the earliest year of the claim you are using, it's usually the MedPAR data.
 
- Raw cost statistics: 
In order to run a specific SAS program, the wrapper macro needs to be invoked
with the correct set of parameters. Most of the time the parameters are very
similar across the programs and is why the runAllprgm.sh bash file was created
to feed each SAS program its parameters. Note that the parameters are documented
at the end of each program, with some of the most common parameters defined in
the following table. The programs are generally saved with a sample wrapper
program call at the end. The parameters in many cases will also define the name
of the data set created or loaded up depending on the program. Therefore, it is
important to use the same cohort stub in the different programs, so that the
correct files are loaded.
| Parameter | Description | 
|---|---|
| cohort1 | These reference the ICD-9-CM codes in the code | 
| cohort2 | You can add as many of these as you would like, however if you start running several (4 plus) the efficiently of the program maybe impacted | 
| cohortN | Define the total number of cohorts you are creating | 
| yearin | Start year for the run | 
| yearout | End year for the run | 
| pct | Define the percent extract you are interested in using | 
| dflag | Set equal to one if you want to delete raw files as you run the program | 
| clmDTA | Set equal to one if you want to convert the all files into a Stata format (DTA), as long as stat transfer is on the server you are running | 
| crashStart | If program crashed, set equal to the year program crashed and it will restart from that year and aggregate the program once finished processing from the yearin to yearout dates. (Carrier and Outpatient extract files only) | 
| win(year) | Define the window around the index event the program is to extract (in years). Note when changing this you may also be interested in changing the window which determines how consecutive index events get dropped. For example currently the program drops AMIs from the index events which occur within the past year. Search variable name last_dgn_datein the1B_impose_restrictions.sasprogram for more info. | 
| subsquentDgnDayRest(days) | Defines the minimum amount of time allowed to pass before a beneficiaries with more then one candidate index event can be treated as a index event. The restriction drops all candidate index events which are within XXdays of each other and it keeps the first one. The restriction is important because it helps ensure that no two index events in the final data set are directly related, through a transfer or something similar. | 
| HMOmonth[pre/post](months) | Defines the number of months of continuous HMO enrollment need before and after the index event month, including the month of the index event, used to restrict the sample; | 
| FFSmonth[pre/post](months) | Defines the number of months of continuous FFS enrollment need before and after the index event month, including the month of the index event, used to restrict the sample; | 
Table: Parameters definitions in wrapper macros
In the case that either the carrier or outpatient extraction programs are run
and they crash, it is easy to restart the programs from the most recent year
which they were running. Refer to the crashStart parameters. For example if
you are running a carrier extract between 2000 and 2008 and the program crashes
on 2002. First you would create a copy of your log file and save it some where
else. Next simply run the following wrapper call from within the Carrier
extraction program found at the end of the program.
%RunCarExtraction(pct=05, cohrt1=ami, cohrtN=1, yearin=2000,yearout=2008,crashStart=2002, win=1, clmDTA=1 );
Code: Sample Carrier Extraction Wrapper call: Running the program after a crash
Program sequencing¶
As noted above, some programs are dependent on the data created by other
programs. In particular, the files in the
1cohort_extraction
folder should be run first. The exact sequencing is given in the following
outline. Any of the programs found in the folder 1A_flag_candidate_evnt
(bullet point 2a below) should be run before the programs found in
1C_pull_claims (bullet point 2b below). 1B_impose_restrictions.sas should be
run before 2b op_clm_window.sas and any of the programs in
2create_index_level_measures should be run before any of the programs found in
3merge_cohort_level_data.
All programs rely on the set_up_directory.sas and set_up_directory.do files
to define the directory structure used throughout the project. In addition the
sequencing of programs is important and follows the hierarchy found below.
Conversely, the runAllprgms.sh bash script can be used to call all programs in
the correct order after the macros in each program have been correctly been
populated.
- 0pre_process- Individuals working on the NBER servers do not need to worry about running these programs. They have been run and the preprocessed datasets are held on the servers.
 
- Harmonization of the raw data (three sub steps)(folder:1cohort_extraction)- 1A_flag_candidate_evnt: the programs found here loop through the raw data and pull beneficiaries which match the criteria defined in the- icd9_chrt_definition.sasand/or- hcpcs_chrt_defintion.sas.
- 1B_impose_restrictions.sas: The single file is run after all the candidates have been flagged and saved in a separate folder. Beside the initial flagging of beneficiaries, all of the restrictions used to define the final index events are imposed in this file. The program merges all candidates, imposes restriction and defines index events with relevant demographic variables generated.
- 1C_pull_claims: The programs found in this files pull a relevant window around the index claims. This final dataset contains all the harmonized claims for each of the different claims files.
 
- Using the harmonized raw files, programs in this folder build index event level measures like 30-day costs or comorbidity profiles. (folder:2create_index_level_measures) These programs use output from the previous step to create variables which aggregate measures to the index event level.
- The final program attempts to build a final analysis file for you which is a the index event level. It merges in datasets onto the index event level. (folder:3merge_cohort_level_data). This is a good folder-file to get comfortable with.
Running the SAS program on the Linux servers¶
Running the SAS extraction programs on the Linux servers are fairly straightforward, however there are some additional options which should be used due to the length of time it take to run the programs.
Many of the extraction programs can run for several days due to the size of the
extracts. Using the nohup command on the UNIX servers is important when
running these programs because it will continue running the program even if you
log off or are disconnected from the server. For example, running the following
from the terminal
nohup sas carrier_clm_window.sas &
carrier_clm_window.sas program in the background until it
crashes or finishes running, even if you log off the servers. The ampersand
informs the servers to run the program in the background. The program will only
stop if it finishes or reaches an error which forces it to crash.
In order to avoid running out of space in the work directory, SAS allows you to
set the work directory. Generally the SAS work directory is used when any data
set in the code is referred to without a directory or explicitly with the
directory work.data set. The work directory is written by default to the /tmp
directory of the current server. However users can easily overrun the /tmp
directory and cause all programs using the /tmp to crash on the current server
(not just their program).
This problem is easily avoided by by invoking SAS with the additional options
from the terminal as follows. As a rule of thumb, you only need to worry about
setting the work directory if you are encountering problems with space in the
/tmp drives. Note that most users need not do this unless their programs are
crashing due to /tmp space.
Note that all the programs have been sequenced for you already and you could
just run the runAllprgms.sh script after setting up by typing bash
runAllprgms.sh from the terminal.
nohup sas -work INSERT/DIRECTORY/HERE carrier_clm_window.sas &
Code: Invoke SAS from terminal with separate user defined work directory - Not Suggested
Factors impacting run times¶
There are two main factors a user has control over that directly impact the run times. A user can determine which server to run their program on; making sure a server is not overloaded will increase the speed of your (and everyone else's!) programs. Being aware of which disks you are writing to can have drastic improvements to speed. The programs optimize this for NBER users.
It is important to check the system load before starting a SAS program, as
overloading a machine will slow the processing of all the jobs to a crawl.
Therefore it is best to check back later and submit the jobs once the server
load is reduced. Checking the server load can be done from the terminal by using
the uptime command, which prints out several pieces of information. The last
three numbers give system load average for the last minute, 5 minutes and 15
minutes. A number over 3 represents a machine that is loaded. For example:
> uptime 16:28:45 up 75 days, 1:51, 2 users, load average: 0.15, 0.05, 0.14
Server load is something which most users will not directly have control over as it depends what other people are running. By contrast, deciding which disks to write to is something which a new user will have direct control over and which also will greatly impact the speed of the program. For example, when decompressing files it is best to write the decompressed file to a local directory (i.e. space drive). Note that writing to a local directory can cut processing times in half.
Minor data extraction inconsistencies¶
There is the possibility that some inconsistencies occur in the extracted data due to the restrictions placed on the code. The percent of claims impacted are very small, but in the event that an individual attempts to benchmark a specific measure, like number of unique claims, some inconsistencies may occur. For these reasons the following section describes when we may expect claims to be duplicated in the extraction process.
Possible claim duplication¶
The possibility for claims to be duplicated in the extracted data exists because of the way in which claims are pulled. The extraction program goes through and identifies an index event and then pulls a one-year window of claims around the index event. Due to the procedure, the possibility for a claim to be duplicated exists in the event that an individual has more than one potential index event which occurred more than a year apart but not greater than two years.
Index events within a year of each other for a specific individual are dropped from the sample, leaving the first index event occurrence in the final data set. This leaves individuals with an index event occurring more than 1 year apart in the sample. However since the procedure take a 1-year window, around each index event, some duplication of claims maybe introduced. The same is true for individuals pulling claims greater then a 1-year window.
Note that this occurs for an extremely small part of the sample, something on the order of 0.001%.
EHIC switchers enrollment information¶
In some cases a bene_id associated with more than one EHIC may have two
separate EHIC entries in the denominator files. A subset of these bene_ids may
report different enrollment information on the separate EHICs associated with a
single bene_id, although they should be the same. The most recent enrollment
observation from the file is kept since it is unclear how to treat these cases.
Estimating run times¶
In order to help aid researchers in planning for projects and optimizing their
run times, the following tables provide approximate run times for each program.
Note that the op_clm_window.sas and carrier_clm_window.sas program can be
run at the same time as long as they are started about 15 minutes or so apart.
The first macro they both call uses the indx dataset output from the 1B
program. If you are experiencing much different run times and the servers are
not over loaded, then there is likely an issue with how you have set up the
directories. A first good check is to make sure you are utilizing the local
agesasscratch# disks for as much of the writing as possible. This can speed
processing times by a factor of two easily. Note that the times given here cover
all Medicare data years for each file type.
| Program | 5% | 20% | 100% | 
|---|---|---|---|
| Programs found in the 1A_flag_candidate_evntfolder  (some are optional) | |||
| ip_candidates.sas | 14 | ||
| medpar_candidates.sas | 5 | 240 | |
| opcandidatest.sas | 38 | <1700 | |
| carrier_candidates.sas | 78 | 370 | |
| Program found in 1cohort_extraction | |||
| 1B_impose_restrictions.sas | 30 | 45 | 60 | 
| Programs found in the 1C_pull_claimsfolder | |||
| ip_clm_window.sas | 1 | ||
| medpar_clm_window.sas | 5 | <50 | <190 | 
| op_clm_window.sas | 28 | <1000 | <1700 | 
| carrier_clm_window.sas | 28 | <350 | 
Table: Program run times (estimates) in minutes for an AMI cohort running all available years
Adding new variables to the extract¶
The current version of the code uses a series of tab delimited text files which have crosswalks between the final harmonized variable of interest and the raw variable name in the medicare claims files for each year. In addition, the crosswalk notes the max length of the variable, whether the variable is defined as a number or character and whether or not the variable was generated. A series of SAS macros take the information held within these crosswalks and generate a harmonized data set.
Looking at the .lst file for any extraction program prints out the crosswalk
which has the convenience of providing the exact crosswalk used for the current
program.
Using text file crosswalk has the added convenience that adding a new file to
the extracts is extremely easy as long as the variable is not changing from
numeric to character or vice versa during the time period. Even in this latter
case the process is simple although a bit more involved. Assuming that the
variable stays the same type throughout the time period of interest simply add a
row to the crosswalk file and the program will generate the new file for you.
The cross walks can be found in the claim_xw folder.
The latest version no longer uses views and the files are harmonized and saved on the servers - this save considerable run times.
| Program | Crosswalk Files | 
|---|---|
| Medpar*.sas | medparxw.txt | 
| Ip*.sas | ipxw.txt | 
| Op*.sas | opxw.txt | 
| Carrier*.sas | carrierxw.txt | 
Table: Crosswalk text files used by the claims files
Using Git¶
For NBER users working on the NBER servers who have pulled the code using the
git clone command, there are several other useful features which Git affords
you. To begin with Git is a distributed revision control system which is similar
to SVN with the exception that it is distributed. This basically means that each
instance of Git pulls the entire code history which allows users to easily look
at the entire history of change made to a file by typing in git log -p
direc/filename.here. The power of Git is in the fact that it allows you to
easily share/compare code with other users and easily created branches to your
code to test certain features and avoid messing with working code. To begin with
you can always type git log to see the most recent commit and version of the
code you are using.