Constructing the Extracts in SAS¶
Overview¶
The goal of the project is to create a series of SAS programs which extract a Medicare cohort which are scalable to the 100 percent Medicare claims files. The initial cohorts are extracted from the files in SAS. Next a series of SAS programs process the original extracts until information has been sufficiently aggregated to the index level, resulting in substantially reduced file sizes. Finally the data is transferred to Stata using Stat/Transfer. Note that SAS is the recommended program for extracting the cohorts and manipulating the data. Using SAS has the benefit of allowing your code to be easily extensible to a larger extract because SAS does not have the same memory constraints as Stata.
The extraction programs use SAS scripts, corresponding to the MedPAR files
(based on the Inpatient/SNF claims) and the Carrier, Outpatient, and denominator
claims. First the onset of a disease is identified using the programs found in
the folder
1A_flag_candidate_evnt
.
Users have the option of using claims from Inpatient, Medpar, Outpatient
and/or carrier files to define candidate index events. These disease events are
indexed and used to pull claims from the MedPAR, Carrier and Outpatient files
after a series of enrollment restrictions are place on the beneficiaries using
the denominator file. Next a series of scripts provide information on
- comorbidities
- costs
- procedures
- cost utilization measures.
In order to preserve the integrity of the SAS data files, the NBER has split the claims data into smaller compressed file parts per year. The SAS programs decompress one part at a time, manipulates it, saves the relevant observations and then deletes the original. While this type of iterative procedure saves on system resources, it can make the code difficult to follow because the program consists of a series of iterative macros defined to process one part at a time and finally combine them in a final step. Note that pre-2006 MedPAR files have not been chunked because of their relative size, where as the rest of the claims have been. Beginning in 2006 the MedPAR files will also be chunked.
In the case of a program crash or a partial run, the program will need to be run again. Rerunning the program in turn will cause SAS to automatically overwrite the log file, which has important information about how the data set was constructed. Care should be taken when debugging the program to make sure log files of interest are not overwritten.
Even if you successfully run the program, and then decide to run a separate extraction program SAS will overwrite the log file in the directory you are working in. For this reason, the programs automatically create a new folder and copies the log and lst file to the folder at the end of each run after tagging them with the date, years run and cohort of interest. Note that if you are running several cohorts the log file is simply copied at the end of the cohort loop and in some cases the full log file information — including total time run — will only be present on the log file of the last cohort.
Flagging candidate index events¶
Folder: 1A_flag_candidate_evnt
Before running any programs a cohort needs to be defined. This means defining the combination of HCPCS and ICD9-CM codes used to define a specific variable. At the same time the choice of HCPCS or ICD9-CM codes will depend on which claims files are used to pull potential candidate index events from. In many cases individuals choose to uses the primary diagnostic code from an inpatient stay, which can be found either on the Inpatient claims or the MedPAR file.
For example, limiting the sample to AMIs hospitalized will ensure that the AMIs being studied demark a clean start to a new episode of care, since nearly all AMIs are hospitalized soon after onset (unless the victim dies).
The first step in the AMI example is to generate a sample of candidate index
events, after which denominator information can be merged in with to enforce
enrollment restriction of interest. Therefore for this example only the the
medpar_candidates.sas
will be run. Since this program only calls the
icd9_chrt_definition.sas
codes, code has been included there which flags a potential AMI admission if it
meets two criteria. The restriction to the principal diagnosis (first ICD9-CM
code) is done in the
medpar_candidates.sas
program.
- The first ICD-9-CM diagnosis code on the MedPAR record corresponds to an initial episode of care for an AMI. The first diagnosis code on the record reports the primary reason for hospitalization. Since an AMI will almost always be the primary reason for hospitalization when present, AMI admissions can be readily identified using this field. ICD-9-CM diagnosis codes for the initial episode of AMI care are those with the first three digits 410 and fifth digit (if present) equal to 0 or 1. However it is important to note that the code allows you to define an encounter on principal plus any number of secondary diagnoses for exploratory purposes. If secondary diagnosis codes are used to identify the index event, in addition to the principal diagnosis the parameter
&dgnPos.
in the wrapper macro call simply needs to be changed. - The admission occurred at a hospital, not a skilled nursing facility. This restriction is imposed because it is unlikely that an AMI would result in an initial admission to a skilled nursing facility, rather than an acute care hospital. These AMIs are flagged and dropped.
Note that the following figure describes the role of the dgnPos
and prcPos
parameters in the wrapper programs. Note that carrier candidate file only allows
users to define drgPos
because the HCPCS codes happen at the line items level
and there is no meaningful ranking to procedures performed on the same day.
/* dgnPos=1 — informs the programs to only flag diagnostic codes in the first diagnostic position (most important). A 2 would have the program look into the first 2 procedure diagnostic codes */ /* prcPos=0 — informs the programs that no procedure code are being used to define cohort. Turns off the use of procedure codes. */ %TagcandidateIndexEvents( cohrt1 =ami, cohrtN =1, dgnPos =1, prcPos =0, pct =05, yearin =1999, yearout=2009, dflag =1, clmDTA =0);
Code: medpar_candidates.sas
wrapper to only flag the principal diagnosis
Each candidate program generates a data set at the stay, claim or line item depending on which files you are working with.
ami05_d1p0_medparcandidate
: stay level candidates.ami05_d1p0_opcandidate
: claims level candidates.ami05_d1_carcandidate
: line item level candidates.ami05_d1p0_ipcandidate
: claims level candidates.
Defining the final set of candidate index events¶
Program: 1B_impose_restrictions.sas
The first section describes the basic steps involved in flagging candidate index
events for an AMI. However not all candidates are usable especially if our
interest is in tracking costs over time. The issue is that since, Medicare
Advantage costs are not included in the claims files, patients switching in and
out of FFS will have inaccurate costs. For that reason the following step
performed by the
1B_impose_restrictions.sas
programs merges in denominator information and restricts the sample the final
set of index events.
Three data files are used to construct this part of the extract:
- Denominator file: This is an annual file that includes a record for each individual enrolled in Medicare Part A or Medicare Part B during the year. It provides demographic information, including sex, race, date of birth, and date of death (if any) and monthly data on the individual's enrollment status, including whether the individuals' benefits were provided through a private plan.
- Candidate index event file: This file is generated from the previous step. This step merges the data together and enforces a priority ranking so that index events happening on the same day are assigned to the claims file of interest.
Before any restrictions are place on the data, the program first merges all of
the candidate index events, since it is now possible to use several files. The
wrapper program for defining this has many parameters. The AMA example is given
in the following figure. Note that first off since only the MedPAR file is used
the the med= parameter is set to 1
with the rest set to 0
. Also the
med_dgnPos=
and med_prcPos=
parameters need to match that found in the
medpar_candidates.sas
program.
%CombinecandidateIndxEvents( cohrt = ami, pct = 05, end = 2009, ip = 0, ip_dgnPos = 0, ip_prcPos = 0, med = 1, med_dgnPos = 1, med_prcPos = 0, car = 0, car_dgnPos = 0, op = 0, op_dgnPos = 0, op_prcPos = 0);
Code: medpar_candidates.sas
wrapper to only flag the principal diagnosis
The approach used to defining the sample roughly follows Udvarhelyi et al. (1992), McClellan (1993), McClellan et al. (1994), and McClellan and Newhouse (1997)1. A major restriction is based on enforcing continuous non-HMO enrollment which accounts for a large percent of candidate index events dropping form the sample. The second most important restriction involves allowing the same individual to have more then one index event in the final index set, as long as the candidate events happened at least one year apart. Correspondingly, the procedure includes an AMI in the final sample if the following criteria are met 2:
- There is no evidence of another AMI admission in the last year. The assumption very important because it defines the amount of time which needs pass between events in order for them to be classified as separate or related. In addition, it also deals with the issue of transfers to some effect. Control via the
subsquentDgnDayRest
inX
days. - The individual is continuously enrolled in Parts A and B of traditional Medicare (rather than a Medicare private plan) during the twelve months before AMI and until twelve months after AMI, or until death, whichever comes first. Requiring continuous enrollment before AMI ensures that any AMI occurring during the prior year would indeed appear in the MedPAR file. Requiring continuous enrollment after AMI ensures that the claims files cover all post-AMI service utilization. This continuous enrollment requirement also implies that the earliest AMI included in the data set will be plus 1 year from the first year of the analysis. The last point will correspond to minus 1 year from the last year of data used in the analysis. Finally, note that excluding individuals who were enrolled in a Medicare private plan before or after AMI may not be innocuous. As noted above, the share of Medicare enrollees enrolled in traditional Medicare changes meaningfully over time and selection into private plans is thought to be non-random. While there is probably not a good alternative to excluding these individuals, this feature of the extract should be kept in mind during analysis. Control via the
contHMOmonth
andcontFFSmonth
parameters. Note that it enforces aX
month window on each side of the index event. - The individual is at least 66 years of age at the time of the AMI. This restriction excludes individuals who are either: 1) currently eligible for Medicare on the basis of disability; or 2) meet the continuous Medicare enrollment requirement only because they were eligible for Medicare on the basis of disability before turning 65. This ensures that the resulting sample is basically representative of the old-age Medicare population (except for 65 year olds). Hard coded.
The following code shows that in addition to defining the cohort with some combination of procedure or diagnostic codes, years of interest and percent of sample, the code also allows for users to tweak the first continuous enrollment restrictions and the days which need to pass before index events before a new index event for a patient can be defined.
%ImposeSampleRest(cohrt=ami,pct=05,start=1999,end=2009, subsquentDgnDayRest=365, contHMOmonthpre=12,contHMOmonthpost=12, contFFSmonthpre=12,contFFSmonthpost=12);
Code: medpar_candidates.sas
wrapper to only flag the principal diagnosis
The only complication in obtaining this demographic and enrollment data is that
the names and formats of variables in the denominator file change somewhat over
time and differ from those in the enrollment file. The file
ConformDenom.sas
contains the macro used to convert the denominator files into a common format.
The differences between the format of the denominator files are handled directly
in
CreatePopDenominatorFile.sas
.
Running this program once for the entire population is recommended for users
constructing a Medicare repository.
The exact portion of the code which enforces these restrictions in the code can
be found in the following figure. Note that the data set with these flags
created but without dropping any variables is available in the data set created
by the 1B_ImposeDenominatorRestrictions.sas
program. The data set is stored in
the index extract folder and is labeled &cohrt.&pct._candidateflag
(i.e.
ami05_candidateflag
).
* Implement criteria for including diags like those in * McClellan and Newhouse and Udvarhelyi, et al.; * 1) Must be no other diag admission in the last year * 2) Must have traditional Medicare enrollment for a X-month (X+1+X) window around index event; * 3) Must be 66 at the time of the diag; * 4) Has a date of death on or after the diag date; * 5) Has a valid date of birth; if (not diagLastYear and contEnrol_post and contEnrol_pre and ageAtdiag >= 66 and not dobMissing and not badDOD) then do; output out.&diag.&pct._indx; end;
Code: Sample LongitudinalExtract
code: Defining the AMI cohort
The
1B_impose_restrictions.sas
program creates the following files, assuming a 5 percent AMI was run using
bene_id
s and the principal diagnosis only:
ami05_candidateflag
: Data set is unique at the index event level (diag_id
) and includes relevant information for the stay associated with the defining index event. The file is created by imposing all of the restrictions on the on the data (unlike thecoref
file which simply flags the restrictions).ami05_indx
:Data set is unique at the index event level (diag_id
). A file containing all of the potential index event candidates observations in addition to the relevant variables used to further restrict the data set. The data set is useful for better understanding how different restrictions impact the final sample. Note the observations in this data set are the same as if all MedPAR (non SNF) observations with a principal AMI ICD9-CM codes are pulled from the data set.ami05_demo
: Data set is unique at the individual level (bene_id
). Contains information on enrollment and demographic information. Note includes information on all individuals in the final index event data set plus all individuals qualifying as an index event (observations in thecoref
data set).ami05_indxdemo
: Data set is unique at the index event level (diag_id
). The data set merges relevant demographic information to the index even data set and formats that data further.
The macros have been arranged in a "wrapper" program, called %RunPrgram
, which
provides the correct sequencing and allows the macro to run several different
types cohorts within a single run.
After running the %RunPrgram
macro, a report writing program contained in a
separate SAS file, %CreateReports
can be run which provides several reports,
see table below. Just make sure to specify the location where the Excel reports
should be written to.
Filename | Description of Report |
---|---|
&cohrt._flagcount..xls |
Tabulates number of encounters by year for each restriction flag defined (i.e. number of individuals under 65) |
Table: Description of report created by %CreateReports
macro
Pulling harmonized claims around the index events 1C_pull_claims¶
The programs found in the
1C_pull_claims
folder each use the final set of index events defined previously to pull a
window of claims around the index event. The is basically the second pass over
the claims, the first being used to define the index events. The length of the
window can be set by the researcher.
Extracting harmonized MedPAR claims¶
After the AMI index events are defined, the code makes another pass over the MedPAR files in order to obtain records for all hospital and skilled-nursing facility visits in the period around each AMI. In general, the code selects all visits that start at most one year before AMI and have discharge dates at most one year after AMI3. However, note that the user can control this via the parameter found in the wrapper macro.
Some SNF claims indicate that the individual was still a patient at the time the claim was finalized and, correspondingly, have missing discharge dates. For these records, the record is included if the admission date and the length of stay field indicate that the services on the claim occurred within one year of AMI4.
Finally, the extract includes any record with an admission date falling on the
date of AMI. As part of this selection process, each claims record is tagged
with the diag_id
for the "AMI episode" of which it is a part.
Note that, as with files containing enrollment information, the MedPAR variable
names and formats change significantly from year to year. The file
ConformMedPAR.sas
contains the macro used to convert the MedPAR files to a common format.
After all of the raw claims files have been processed, the MedPAR records are put "in order" according to admission and discharge dates. When dates do not permit an unambiguous ordering of the records (e.g. when an individual is admitted to and discharged from multiple hospitals on the same day), the script uses other information on the record, notably admission and discharge status codes, to determine the order in which the various hospital admissions occurred.
The script uses Stat/Transfer to export the files to Stata, however the additional code is included in the Auxiliary folder which uses SAS XPT files. The preferred method is StatTransfer as it makes use of less disk space and avoids an extra Stata import step. See Step 6 in Appendix B.
ami05_medparclms
: Data set us unique at the MedPAR claim/stay level. The data set pulls claims associated with the index events in the fileami05_1_indx_mn_bene_id
. Claims one year after and one year before the admission date of the index event are found in this file.
Extracting harmonized Carrier/Physician line items¶
For many projects, including the current one, it is useful to also obtain claims records for services performed by physicians and other non-institutional providers like ambulance operators and laboratories.
Each year's Carrier file provides detailed information on each service provided, including the type of service (indicated by HCPCS codes), the date of provision, the amount of payment, and the identity of the provider. Different years' Carrier files structure this information in different ways.
For years 1999 and later, however, the Carrier file begins grouping individual services, or "line items," together into "claims." Typically, a claim consists of a group of services provided by the same physician (or other provider) during a single episode of care. How the files present claim and line item information differs by year.
For 1999-2001, each Carrier record continues to correspond to a single line item, but each record also contains all claim-level information for the claim of which it is a part. Most fields on the Carrier record appear only at the claim level (e.g. patient's zip code) or only at the line item level (e.g. procedure codes). But some fields appear at both the claim level and the line item level, which can make these years' files treacherous to work with. Most importantly, each record typically includes a field containing the amount paid for the current line item and another field containing the amount paid for all line items in the current claim. Extracting the claim-level payment field rather than the line item-level payment field would lead to dramatic errors.
Starting with the 2002 Carrier file, CMS splits the Carrier file into two sub files: a "claims" file, which contains one record for each claim and contains all claim-level information; and a "line items" file, which contains one record for each line item and contains the line-item level information as well as unique identifiers that facilitate matching to the claims file. While the split means that processing for these years typically requires a merge, it does make it much clearer which fields are claim-level fields and which are line item fields.
The script used to process the Carrier file is called
carrier_clm_window.sas
.
It first loads the index events ( individual identifiers and diagnosis dates)
created and saved by the
1B_impose_restrictions.sas
.
Next, it uses that information to extract all carrier line items (i.e.
individual services) with a start date at most one year before the date of AMI
and an end date less than one year after AMI5. The code excludes all line
items with a missing start or end date, which appears to affect only about one
in every million claims. In the process of extracting the line items, it tags
each one with the diag_id
for the "AMI episode" of which it is a part.
Finally, the script writes all records to a SAS and Stata file.
carrier_clm_window.sas
must handle significant differences in variable names and formats across Carrier
file years. The SAS macro that handles these differences is contained in the
file
ConformCarrier.sas
.
As a final note, keep in mind that, because the Carrier files are extremely
large, the code described above can take more than a day to run. There are
significant gains to running a program which extracts several diagnosis at once
because the file only needs to be unzipped once. The program allows individuals
to run several cohorts by adding in different parameters. Furthermore extending
the code to other diagnosis cohorts, is straightforward and needs to be
coordinated with the
1cohort_extraction
family of scripts. Note that if you do run a multi-disease cohort run, the
Carrier program needs a lot of space dedicated to it. Of course these files are
temporary and could be deleted afterwards, however one must be careful before
firing of the runs. Additionally, due to amount of time the script takes to run
and amount of system resources needed it is possible for the program to crash.
Hence the code allows the user to start from any date so that, if the code stops
unexpectedly, it need not be re-run from the beginning. See the crashStart
parameter.
The
carrier_clm_window.sas
program creates the following files, assuming a 5 percent AMI extract was run
using bene_id
s and the principal diagnosis only
ami05_carlnits
: Data set us unique at the carrier line item level - each index event can have several associated claims, a single claim can have several associated line items. The data set pulls line items associated with the index events in the fileami05_indx
. Line items one year after and one year before the admission date of the index event are found in this file. Note that collapsing to the claim level requires collapsing byfileyear
andclm_id
.
Extracting harmonized outpatient claims and revenue centers¶
Outpatient claim records provide summary information, including the
beneficiary's EHIC or beneficiary ID (bene_id
), the total amount of
reimbursement, relevant ICD-9-CM diagnosis codes, and the start and end dates of
the claim. Each claim also includes a series of "revenue center" records, which
provide detailed information on each service provided as part of the claim,
including the revenue center (i.e. division) of the institution that provided
the service, the amount reimbursed for the service.
For years prior to 2002, revenue center data resided directly on the main claim
record. In order that each claim could report dozens of services, each claims
record contains dozens of copies of each revenue center field, with the result
that some years' files contain upwards of a thousand fields. Since most
outpatient claims contain only a handful of individual services, the bulk of
these revenue center fields are typically empty. Starting in 2002, CMS began
splitting the outpatient file into a "claims" file, which contains the summary
information for each claim, and a "revenue center" file, which contains one
record for each revenue center. Records in the revenue center file contain a
unique identifier that facilitates matching the revenue center record to the
corresponding claims record. In many instances a revenue center code of 0000
is output which corresponds to an empty record. As the code currently exists,
these are kept for consistency, however at a later time we may simply choose to
get rid of these empty, observations.
The script
op_clm_window.sas
processes the outpatient files. The script first loads the list of individual
identifiers and AMI dates from the file saved by the 1B sas program. Like the
other files, it then selects all claims with start dates at most one year before
the AMI date and end dates at most one year after the AMI date. The code drops
all claims with a missing start or end date, but this restriction appears to
affect virtually no outpatient records. At this point, the script tags all
selected claims with the diag_id
for the "AMI episode" of which the claim is a
part.
Next, for those years where the revenue center information resides directly on
the claims records (i.e. years before 2002), the code takes the revenue center
information off of the claims record and instead saves it in a separate revenue
center file where each revenue center has its own record. That is, the code
reshapes all years' data so that they match the basic format of the post-2002
files. These two files can be merged using the fileyear
and claimindex
fields as a (compound) unique identifier. The amount the provider charged is
taken from the revenue centers and saved in the extract created. In cases where
a revenue center code for the total charge exists(0001
), the associated charge
is simply added to those found on the claims files. In the case where a 0001
code does not exist, the charge for each revenue center code are summed
together. On the other hand, the total amount which Medicare paid for each
service is located in a variable which includes the amount reimbursed for all
services on a claim (including any revenue center ). These total charge and
payment amounts are kept on the extract created.
The extracts do not create costs/reimbursement variables from the revenue center levels despite the fact that this more granular level costs are available. Again, the costs found in the claims files due however account for all costs found in the revenue center level.
There are several year-specific complications that arise in generating these
files. First, for 1999-2001, a single claim can stretch across multiple records
if the total number of revenue center records in the claim exceeds the number of
available revenue center fields in a single record. Thus, throughout, the code
makes special provision for handling these records6. Finally, as with the
other files, variable names and formats change from year to year. The SAS macro
contained in
ConformOp.sas
handles these differences.
As with the Carrier file, processing the outpatient files takes a significant
amount of time. For this reason, the outpatient code, like the Carrier code,
saves data at various intermediate stages. These intermediate save points often
allow the script to be started mid-way through after a crash. The code is able
to start at any midyear crash point, by simply adding the year to restart the
run to the parameter crashStart
. That will cause the program to restart the
year you indicate in crashStart
and then merge all of the results together
from the original start and end years ( identified by the yearin and yearout
parameters).
The op_clm_window.sas program creates the following files, assuming a 5 percent
AMI extract was run using bene_id
s and the principal diagnosis only:
ami05_opclms
: Data set is unique at the outpatient claim level. The data set pulls OP claims associated with the index events in the fileami05_indx
. Claims one year after and one year before the admission date of the index event are found in this file.ami05_oprevcntrs
: Data set us unique at revenue center level — each index event can have several claims, a single claim can have several different revenue centers related to it. The data set pulls revenue centers associated outpatient claims from the above data set (opcmls
).
-
The code also includes an approach based upon Skinner, Staiger, and Fisher (2005), which aims to construct a data set of individuals having first AMIs. ↩
-
Two additional data-quality requirements are also imposed: 1) The individual's death date (if listed) must be after the date of the AMI admission; and 2) The individual must have a valid date of birth. These restrictions affect a very small number of individuals. ↩
-
The date of discharge rather than date of admission is used to define the post-AMI period in order to ensure that AMIs that occur in the file year of the data are treated symmetrically to those in earlier years. In particular, suppose that the end of the post-AMI period was instead determined by the date of admission and consider an individual who experiences an AMI in late 2004 who is re-admitted to the hospital in late 2005. If this hospitalization were to last into 2006, it would reported in the 2006 claims file and, thus, fall outside of the set of hospitalizations observed here. ↩
-
In addition, for these records, the final extract sets the discharge date field equal to the admission date plus the length of stay (minus 1 day) since it is useful to be able to treat the discharge date as a "claim end" date rather than a true discharge date. ↩
-
The start and end dates of carrier line items are often equal or within just a couple of days of each other. Thus, it makes relatively little difference precisely which date is used. ↩
-
Keeping together claims that stretch across multiple records requires a unique claim identifier. The ResDAC website suggests that the fields
daily_dt
andlink_num
together form a unique claim identifier in the 1999-2001 outpatient records. Examining the data file suggests, however, that this is not actually the case. In place of these fields, the code uses the “claim control number” as a unique identifier. This field uniquely identifies claims, but also takes on a common value for all segments of the same claim. ↩