The 2006 Earnings Public-Use Microdata File: An Introduction
Social Security Bulletin, Vol. 71, No. 4, 2011
This article introduces the 2006 Earnings Public-Use File (EPUF) and provides important background information on the file's data fields. The EPUF contains selected demographic and earnings information for 4.3 million individuals drawn from a 1-percent sample of all Social Security numbers issued before January 2007. The data file provides aggregate earnings for 1937 to 1950 and annual earnings data for 1951 to 2006. The article focuses on four key items: (1) the Social Security Administration's experiences collecting earnings data over the years and their effect on the data fields included in EPUF; (2) the steps taken to "clean" the underlying administrative data and to minimize the risk of personal data disclosure; (3) the potential limitations of using EPUF data to estimate Social Security benefits for some individuals; and (4) frequency distributions and statistical tabulations of the data in the file, to provide a point of reference for EPUF users.
Michael Compson is with the Division of Policy Evaluation, Office of Research, Evaluation, and Statistics, Office of Retirement and Disability Policy, Social Security Administration.
Acknowledgments: The author gratefully acknowledges the assistance of many individuals in the process of creating the 2006 Earnings Public-Use File and this article: John Hennessey, for graciously sharing his programming and methodological expertise; Russell Hudson, for his programming expertise and sharing his vast knowledge of the earnings data; Paul Davies, for his guidance and support throughout the project; Scott Muller, Greg Diez, and Bill Piet, for sharing programmatic and earnings knowledge; Sirisha Anne, Brenda South, Stu Friedrich, and Randall Miles, for their assistance in providing the data extracts used in the process of creating EPUF; Bill Davis and Justin Ronca, for their statistical expertise; and Susan Grad, Howard Iams, Hilary Waldron, and Anya Olsen, for their comments on previous drafts of the article.
The findings and conclusions presented in the Bulletin are those of the authors and do not necessarily represent the views of the Social Security Administration.
|BEPUF||Benefits and Earnings Public-Use File|
|EPUF||Earnings Public-Use File|
|IRS||Internal Revenue Service|
|MEF||Master Earnings File|
|QC||quarter of coverage|
|SSA||Social Security Administration|
|SSN||Social Security number|
|YOB||year of birth|
This article introduces the 2006 Earnings Public-Use File (EPUF), a data file containing earnings records for individuals drawn from a 1-percent sample of all Social Security numbers (SSNs) issued before January 2007. EPUF is the latest public-use data file released by the Social Security Administration (SSA) to contain earnings data from its administrative files. EPUF comprises a much larger sample than previously released public-use files containing earnings histories, and significantly enhances the ability of researchers and policy analysts to analyze SSA programs.
EPUF consists of two linkable files. One contains selected demographic and aggregate earnings information for all 4,348,254 individuals in the file, and the second contains annual earnings records for the 3,131,424 individuals who had positive earnings in at least 1 year during 1951–2006. EPUF data reflect capped Social Security taxable earnings. As such, the earnings data contained in EPUF do not present complete measures of the number of workers or the amount of wage-and-salary and self-employment income in the US economy.
The data fields included in EPUF are nearly identical to those in SSA's most recent public-use file containing administrative earnings, the 2004 Benefits and Earnings Public-Use File (BEPUF). This was done (1) to address the critical need to meet data disclosure standards, (2) because of the complexity of the earnings data that SSA has collected over the life of the program, and (3) to maximize EPUF's timeliness. SSA plans to continue working on data disclosure standards for several key detailed earnings data fields from its administrative files. Combining this work with direct feedback from EPUF users, SSA hopes to include new data fields in future releases.
This article informs potential users about the EPUF and provides background information about the data contained in the file. Specifically, the article discusses SSA's experiences collecting earnings data over the years and the effect of those experiences on the data fields included in EPUF; the steps taken to "clean" the data and to minimize the risk of personal data disclosure; and the potential limitations of using the data to estimate benefits for some individuals. Finally, the article presents frequency distributions and statistical tabulations of the data to provide points of reference for EPUF users.
Developing the Earnings Public-Use File
In 2006, SSA released BEPUF, a data file based on a systematic random 1-percent sample of all individuals who were receiving Social Security benefits in December 2004. The file contains benefit and earnings information for the 473,366 individuals in the sample. SSA and Internal Revenue Service (IRS) Data Review Boards reviewed the file to assess the risk of personal data disclosure before approving its release to the public.
The critical question in the initial EPUF development phase involved which data fields to include in the file. Users would undoubtedly like SSA to include all of the data fields from its administrative files. However, SSA has a legal obligation to protect the confidentiality of the individuals included in the file. This creates a tradeoff between the user's need for complete and accurate data and the need to ensure that the file's data fields do not disclose individual identities. Because BEPUF met the disclosure standards set by SSA and the IRS, its data fields served as a starting point for selecting fields for EPUF.
A second critical issue was the need to balance the desire to add data fields with the time needed to prepare the underlying data and conduct the required data-disclosure analysis. SSA originally hoped to include earnings data fields beyond those included in BEPUF. However, choosing fields to add to the file was complicated by more than data-disclosure limitations. Reconciling the types of earnings data in SSA's administrative files with the different data-collection timelines over the life of the program made seemingly simple choices fairly complicated.
To include new data fields would be much more complex because the additional fields would come from the detailed segment of the Master Earnings File (MEF).1 For each individual, the detailed segment is likely to contain more than one earnings record in a given year. As a result, working with the detailed segment of the MEF is much more complicated and would take more time and effort than working with data fields from the summary segment of the MEF, as was done for BEPUF.
In addition, the only earnings data field that is available for all years from 1951 through 2006 is taxable earnings. Other fields of interest, such as noncovered earnings, covered earnings above the taxable maximum, and contributions to 401(k) retirement plans, are only available for selected years.2 Consider self-employment income: From 1951 through 1977, self-employment income is included in the earnings data field only to the extent that it is covered under the Social Security program. If an individual had wage-and-salary earnings above the taxable maximum and also had self-employment income, none of the self-employment income would be included in the earnings record. This produces undercounts of both the number of individuals with self-employment income and the dollar amount of that income. From 1978 through 1993, the detailed segment of the MEF contains a separate value for covered self-employment income. However, the amount reported in this field is still limited to earnings covered under the program. The full amount of self-employment income does not appear in the MEF until 1994, when the cap for covered earnings subject to the Medicare Hospital Insurance payroll tax was eliminated. As a result, the administrative files do not contain a complete history of an individual's self-employment income.
After accounting for all of these considerations, SSA designed EPUF to contain nine data fields in two linkable data tables. The first linkable file contains a single record for each of the 4,348,254 individuals included in EPUF. Each record contains the following data fields:
- ID (a unique identification number)
- year of birth (YOB)
- aggregate capped Social Security taxable earnings from 1937 through 1950
- aggregate quarters of coverage (QCs) earned from 1937 through 1950
- aggregate QCs earned in 1951 and 1952
The second linkable file contains 60,326,474 earnings records with positive earnings values. There are 3,131,424 individuals in this file who had positive earnings for at least 1 year during 1951–2006. Each of the records in this file contains the following data fields:
- ID (a unique identification number)
- the year(s) when the individual had taxable Social Security earnings
- the amount of capped Social Security taxable earnings for each of those years
- the number of QCs earned for each year (except 1951 and 1952) based on the amount of capped Social Security taxable earnings
These data fields are identical to those included in the BEPUF with one minor exception. EPUF contains multiple data fields for the QCs: aggregate QCs earned 1937–1950 and aggregate QCs earned in 1951 and 1952 in the first linkable file; and annual QCs earned from 1953 through 2006 in the second linkable file. By contrast, the BEPUF contains a single aggregate value for QCs earned as of December 31, 2004. Because of this difference, an EPUF user can determine an individual's eligibility for retired-worker and disabled-worker benefits at any given time.
Overview of Earnings Records
SSA's primary objective in collecting earnings data is to meet the operational needs of the program.3 As a result, the data contained in EPUF will be, in some aspects, somewhat limited from a researcher's perspective. However, the uniqueness of the data and the large sample size should outweigh these limitations in many cases.
To use EPUF appropriately, users must understand the nature of its earnings data. For example, analysts must be aware that the earnings data in EPUF do not reflect all workers in the US labor market, nor the aggregate earnings generated by those workers.4 Putting the EPUF earnings data in their proper context requires an understanding of three measures of earnings distinct to the Social Security program: covered earnings, Social Security taxable earnings, and capped Social Security taxable earnings.
The first measure refers to earnings "covered" for purposes of determining eligibility for the Social Security program. The Social Security Act defines the types of employment covered under the program, and coverage has expanded significantly over the years.5 Currently, nearly all types of employment are covered under Social Security. There are three primary exceptions: "state and local government employees whose employer has not elected to be covered under Social Security and who are participating in an employer-provided pension plan, current Federal civilian workers hired before 1984 who have not elected to be covered, and self-employed workers earning less than $400 in a calendar year" (Board of Trustees, 2010). "Covered earnings" has two components: wage-and-salary earnings from covered employment, and self-employment income covered under the program.
The second measure is called Social Security taxable earnings because it reflects all covered earnings that are subject to the payroll tax.6 The annual earnings data in the MEF summary segment are a running total of an individual's taxable earnings up to the taxable maximum for each job in a given year, plus any taxable self-employment income. For the self-employed, "taxable earnings consists of net self-employment income which, when combined with any taxable wages for that individual, is at or below any applicable annual maximum taxable amount" (SSA 2009, G.17). If an individual has more than one employer, the amount of earnings in this data field may be greater than the taxable maximum in a given year.
EPUF uses the third measure, capped Social Security taxable earnings, defined as the total amount of a worker's taxable earnings (including any taxable self-employment income) up to the taxable maximum in a given year. It does not include any earnings beyond the taxable maximum, as the previous measure can when a worker has multiple employers. This measure allows an observer to determine total amounts contributed to the program by workers and self-employed individuals.7 The primary reason EPUF uses this measure is that capped taxable earnings do not need to be top-coded for data disclosure purposes. Second, because the IRS and SSA approved BEPUF for release using capped taxable earnings, using the same measure in EPUF was deemed likely to expedite its approval.
Two adjustments were made in moving the taxable earnings data from the MEF summary segment to the capped taxable earnings information contained in EPUF. First, all earnings values were top-coded at the taxable maximum in a given year. Second, any records with negative covered earnings were set to zero (this occurred very infrequently).
Through 2006, SSA used three distinct mechanisms to collect the earnings data required to administer its programs: (1) paper and microfilm records that yield an individual's total covered earnings from 1937 through 1950, (2) quarterly earnings data reported by the individual's employer from 1951 through 1977, and (3) annual earnings reported by the individual's employer on Form W-2 from 1978 through 2006 (Chart 1).
Types of earnings data available from Social Security administrative files, 1937–2006
|Time period||Type of earnings data available|
|1937–1950||Aggregate covered earnings|
|1951–1977||Data reported quarterly by employer|
|1990–2006||Aggregate deferred compensation|
|1991–2006||Taxable Medicare earnings above the Social Security taxable maximum b|
|2004–2006||Detailed deferred compensation|
In the years since the adoption of Form W-2, three additional types of earnings data have been collected to reflect expanded data needs: (1) aggregate deferred compensation, used to calculate the national average wage index, beginning in 1990; (2) Medicare taxable wage-and-salary and self-employment income, beginning in 1991; and (3) detailed items for the deferred compensation field, beginning in 2004.8 These changes are also reflected in Chart 1.
1937–1950 Earnings Data
Before the arrival of electronic data storage, SSA stored earnings data on either paper or microfilm. Given the limited storage capacity of early computers and the prohibitive costs associated with converting these data to electronic format, the earnings data for 1937–1950 on the MEF summary segment are available only as an aggregate number. As a result, the data extract from which EPUF is drawn contains two data fields for aggregate taxable earnings—one for 1937–2006, and the other for 1951–2006. The EPUF data field for aggregate Social Security taxable earnings from 1937–1950 was generated by subtracting the 1951–2006 aggregate earnings from the 1937–2006 aggregate earnings.
Another data field of interest is the QCs earned during this period. An individual can earn up to four QCs in a year depending on his or her taxable earnings amount. QCs determine an individual's eligibility for retirement and disability benefits and a family's eligibility for survivor benefits. The MEF summary segment contains no annual values for QCs for 1937–1953. Instead, the extract contains data fields from the MEF that contain the "known" aggregate number of QCs earned during the following periods: 1947–2006, 1951–2006, 1947–1952, and 1953–2006. For EPUF, these data fields are manipulated to generate the aggregate number of QCs earned for the periods 1947–1950 and 1951–1952.
Because the MEF has no known values for QCs from 1937 through 1946, SSA devised a three-step method to estimate the aggregate number of QCs earned by individuals with covered earnings during these years.9 The first step assigns one QC for each $500 of aggregate taxable earnings from 1937 through 1950. The second step subtracts the known sum of QCs earned from 1947 through 1950. (The QCs from 1947 through 1950 are generated by subtracting the known number of QCs earned from 1951 through 2006 from the known number of QCs earned from 1947 through 2006.) If the resulting number is positive, this value is assigned to the number of QCs earned from 1937 to 1946. If this number is negative, a value of 0 is assigned for the number QCs earned from 1937 to 1946. The final step adds the estimated QCs from 1937 to 1946 to the known QCs from 1947 to 1950 for the estimated number of QCs earned from 1937 to 1950.10
1951–1977 Earnings Data
From 1951 through 1977, the earnings data used to administer Social Security came from two sources: the individual's employer and the IRS. SSA required employers to report covered wage-and-salary income quarterly. For the self-employed, the IRS processed the annual Social Security taxable self-employment income reported on the individual's Form 1040 on Schedule C and Schedule SE and transferred the data to SSA. Values in these data fields were added together to create a single entry for taxable Social Security earnings, which is stored on the Summary Earnings Record. As a result, it is not possible to determine whether covered earnings in a given year are from wages and salaries or from self-employment income. The MEF also contains separate indicators for the presence of self-employment income (Schedule C) or agriculture income (Schedule F) in a given year. However, if there are combinations among salary and wages, self-employment income, and income from agriculture, the amounts attributable to each source cannot be determined. As a result, these flags were not included in EPUF.11
As previously noted, the MEF has no annual values for the number of QCs earned in 1951 and 1952. This value is estimated by manipulating data used to calculate QCs from 1937 through 1950. Beginning in 1953, the MEF contains annual QC values based on quarterly earnings data.
1978–2006 Earnings Data
In 1978, SSA earnings data underwent major changes involving sources, processing, and types of data collected. Because requiring quarterly earnings reports had led to processing delays and administrative burdens, new legislation required employers to report their employee's earnings annually on Form W-2. The legislation also made SSA responsible for processing the W-2 earnings data. The source for self-employed taxable earnings, Form 1040 Schedule SE, remained unchanged.
The move to annual collection of earnings data resulted in three significant changes in the types of data collected:
- The W-2 included earnings from employment that was not covered under Social Security. Prior to 1978, SSA was only concerned with taxable earnings from covered employment.
- The ability to store data electronically and the need for more detailed earnings information to administer the program led SSA to establish separate data fields for taxable wage-and-salary income and taxable self-employment income. Prior to 1978, administrative data contained a single entry for all taxable earnings.
- The W-2 allowed SSA to capture covered wage-and-salary income above the taxable maximum. Earnings reported to SSA for all previous years were capped at the taxable maximum.
It is important to note that the inclusion of taxable self-employment income on an individual's earnings record reflects the reporting criteria used during two distinct periods. For 1978 through 1993, self-employment income appears on an individual's earnings record only when Social Security or Medicare taxes were due on that income. It was not until 1994, when the cap for taxable earnings subject to the Medicare payroll tax was eliminated, that SSA's earnings data began to include uncapped values for covered self-employment income.
Several examples illustrate how the amount of taxable self-employment income differs from the amount of self-employment income reported for federal income tax purposes across these two periods. Suppose an individual earned $25,000 in covered wages and $25,000 in self-employment income, and assume a taxable maximum of $40,000. Prior to 1994, the individual's earnings record for that year would contain $25,000 for wage-and-salary income and only $15,000 for self-employment income. Now consider an individual with self-employment income of $55,000 and no covered wages. In this example, the individual's earnings record would have $40,000 for taxable self-employment income. From 1994 onward, there is no cap on the amount of covered earnings subject to the Medicare payroll tax. As a result, the full amount of both wage-and-salary and self-employment income in the examples above would be included in the individual's earnings record on the MEF, but is not in EPUF.
The Revenue Act of 1978 also affected the earnings data collected by SSA by allowing the elective deferral of wage earnings.12 Elective deferrals enabled individuals to postpone the receipt and the taxation of certain types of earnings. This led to the creation of 401(k) retirement plans, 403(b) plans for employees of nonprofit organizations, and 457 plans for state and local government employees. From 1978 through 1983, these elective deferrals were not covered under Social Security. As a result, the taxable earnings data in EPUF for these years do not include contributions to these plans.
Starting in 1984, elective deferrals are covered under the program and are reflected in the taxable earnings in EPUF (up to the taxable maximum). In 1990, SSA was required to include elective deferrals in the calculations of the average wage index, and created a separate data field in the MEF detailed section to capture this information.
Data on annual QCs earned during 1978–2006 are based on taxable earnings in a given year. As noted earlier, the MEF contains annual QC values after 1952.
Sample Selection, Data Cleaning, and Disclosure Protection
EPUF consists of earnings records drawn from a 1-percent sample of the MEF (the "underlying EPUF sample"). A series of data cleaning and disclosure protection procedures produced the final EPUF. This section describes the process of selecting the underlying EPUF sample, the data cleaning steps, and the disclosure protections that were applied to the data to produce the EPUF.
The sample universe for the EPUF is all SSNs issued before January 2007. Thus, any individual who does not have an SSN cannot be included in the EPUF. The EPUF sample was created using a systematic sampling process that closely approximates a random sample. For each area-group combination, an algorithm selects 100 out of the possible 10,000 SSNs.13 SSA then determines if the SSNs have been issued. The sampling algorithm is systematic in that it avoids any overlap between the BEPUF, EPUF, and any potential future public-use samples generated using the algorithm.14 SSA has determined that the design effect for the systematic random sample is effectively equal to one.15
The SSNs generated using this algorithm were checked for inclusion in the Numident file to confirm their presence in the Social Security administrative files.16 A final check verified that none of the SSNs in the sample overlapped those in the BEPUF. The individuals in the resulting underlying EPUF sample numbered 4,413,024.17 Note that the sample is not strictly representative of the US population because the sampling universe (all SSNs issued) includes individuals in Puerto Rico and the US territories.
A number of analyses were undertaken to determine if there were any problems with the data and, if so, what to do about them. Three key issues were identified: (1) a coding error incorrectly assigned a YOB value equal to 1900 to many individuals, (2) some YOB values were missing, and (3) some extreme age values occurred for individuals who had taxable earnings (values ranged from -47 years to 179 years).18 Several other smaller issues were discovered in the process of generating the EPUF and a number of steps were taken to "clean" the data before releasing the file to the public.
The first check involved graphing the distribution of individuals in the underlying sample by their YOB. This graph produced an abnormally large spike in the number of individuals with a YOB value equal to 1900. For these 24,843 individuals, a check against the Numident file confirmed a YOB value of 1900 on 21,269 records. There were 3,464 individuals whose YOB value was missing on the Numident file; these were removed from EPUF. This left 110 individuals with an alternative (non-1900) YOB value on the Numident file. The Numident's alternative value was assigned for those individuals.
The next data-cleaning issue involved the 13,405 individuals in the underlying sample whose MEF records had a missing value for YOB. The overwhelming majority (12,142) also had a missing value for YOB on the Numident file; these individuals were removed from EPUF. Of the remaining 1,263 individuals, 1,234 had a single YOB value on the Numident file; for them, the Numident YOB was used. This left 29 individuals who had multiple YOB values on the Numident file; for these, we assigned a "best" YOB value.
The analysis of the age at which an individual in the underlying sample recorded taxable earnings found 77,458 individuals who either had age values of less than 14 or greater than 79, or had earnings during 1937–1950 but a YOB value after 1950. Again, MEF records were validated against the Numident file. Records for 5,810 individuals were removed for one of the following reasons: there was no logical choice among multiple alternative YOB values on the Numident, age when recording taxable earnings was either negative or greater than 100, or the YOB value was after 1950 although earnings were recorded during 1937–1950.
The final adjustments included removing 5,935 individuals whose YOB value was before 1870, removing 1,096 individuals whose YOB value was equal to 2007, and removing 4 individuals who were assigned a missing YOB value. Individuals born before 1870 were removed because they were unlikely to have received Social Security benefits. The data for the underlying sample were extracted in 2007 and it is possible that a small number of individuals who were enumerated after December 31, 2006 were part of the sample.
Data "cleaning" procedures resulted in the removal of records for 28,451 individuals from the underlying sample. The effect of removing these individuals on the number of earnings records and on the amount of earnings by year is discussed later in conjunction with the effect of the data disclosure procedures.
The most critical determinant of whether data fields can be included in the public-use file is disclosure risk. To protect confidentiality, SSA removes all identifying information, evaluates disclosure risk posed by administrative earnings data for individuals that overlap other public-use files,19 and modifies any distinguishing characteristics that could identify individuals in the file. The data disclosure procedures applied to the EPUF fall into three broad categories: (1) removing any identifiable information from the file and evaluating the disclosure risk of public-use file overlap, (2) adjusting the earnings amounts to create a range of uncertainty between the amount of earnings reported to SSA and the amount released in EPUF, and (3) zeroing out earnings records because of age considerations. These categories are described in detail below.
Removing identifiable information and evaluating disclosure risk from public-use file overlap. To minimize disclosure risk, the following steps were taken:
- All SSNs were removed from the file.
- The records in the final EPUF were randomly sequenced.
- Where possible, EPUF sample records were checked for overlap with other public-use files.
As previously noted, there is no overlap between individuals in BEPUF and EPUF. There were 319 individuals in the underlying EPUF sample who were included in the New Beneficiary Data System (NBDS). These individuals were removed from the sample.20
Although minimal overlap between individuals in EPUF and individuals in the Synthetic SIPP Beta files (SSB) is likely, the SSA and IRS have concluded that there is no disclosure risk because all of the earnings data in the SSB are synthetic.21
The number of individuals in EPUF who are potentially included in the public-use files created from the 1964 Pilot Link Study, the 1973 Exact Match Study, and the Retirement History Study is very small (see text box). SSA and the IRS have determined that disclosure resulting from overlap of these files is very unlikely.
SSA has released a number of public-use microdata files that contain earnings data from its administrative files. The first six items listed below are products of two interagency studies undertaken in the 1970s and 1980s: the 1963 Pilot Link Study and the 1973 Exact Match Study, conducted by SSA, the Census Bureau, and the IRS. SSA produced items 7 and 8 independently.
- The 1964 Current Population Survey—Administrative Record Pilot Link File
- The 1973 Current Population Survey—Summary Earnings Record Exact Match File
- The 1973 Current Population Survey—Administrative Record Exact Match File
- The Social Security Longitudinal Earnings Exact Match Public Use File, 1937–1975
- The 1972 Augmented Individual Income Tax Model Exact Match File
- The Retirement History Longitudinal Survey, 1969–1973, and Summary of Social Security Earnings: Merged Data
- The New Beneficiary Data System
- The 2004 Beneficiary and Earnings Public-Use File
The 1963 Pilot Link Study matched data from Census Bureau's Current Population Survey with SSA and IRS administrative data files. The 1973 Exact Match Study refined the 1963 Pilot Link Study processes. The primary objective of both studies was to improve the quality of statistical output related to income distribution and redistribution.
The Retirement History Study matched survey data with Social Security administrative data to create public-use data files useful for researching retirement decisions and circumstances.
The New Beneficiary Data System consists of two separate surveys. The original survey was the New Beneficiary Survey, a nationally representative survey of beneficiaries who were in payment status during a 12-month period from mid-1980 to mid-1981. In 1992, SSA conducted the New Beneficiary Followup (NBF) survey and attached limited earnings data to all 18,599 individuals in the original survey.
The 2004 Beneficiary and Earnings Public-Use file, released in 2006, is a systematic random sample of individuals who were on the benefit rolls as of December 2004.
Adjusting earnings to create a range of uncertainty and limit potential disclosure. With a few exceptions, the earnings amounts in EPUF were random-rounded to a base of $25, $100, or $1,000, depending on the amount of earnings reported to SSA.22 Specifically,
- earnings greater than $100 and less than $1,000 were random-rounded to a base of $25;
- earnings greater than $1,000 and less than $50,000 were random-rounded to a base of $100; and
- earnings greater than $50,000 were random-rounded to a base of $1,000.
Using this process, earnings near the taxable cap could be rounded up to the taxable maximum, and very low earnings could be rounded down to zero. SSA was concerned that this could affect two key research issues: (1) analyses of the differences between workers and nonworkers (as defined in terms of covered employment) and (2) analyses comparing individuals with earnings above and below the taxable maximum in a given year. To maintain the integrity of the data in these two areas, and to eliminate the possibility of rounding down to zero or rounding up to the taxable maximum in a given year, the following steps were taken:
- All annual earnings values less than $100 were replaced with the average amount of all earnings less than $100 in a given year.
- All annual earnings within the random rounding base of the taxable maximum ($100 or $1,000, depending on the taxable maximum in a given year) were replaced by the average of all values within the rounding base for that year.
- Any values for the aggregate amount of earnings from 1937 to 1950 greater than $37,000 were replaced with $41,500 (the average value of all aggregate earnings amounts greater than $37,000).
- Any values for the aggregate amount of earnings from 1937 to 1950 that were less than $100 were replaced with $39 (the average dollar amount for all values of aggregate earnings less than $100).
These adjustments to the random-rounding process may reduce the amount of uncertainty between the earnings reported to SSA and those contained in EPUF for a select group of individuals. Consider an individual with $100 in earnings. We know that the actual value of earnings reported to SSA for this individual had to be between $100 and $124. This creates a range of uncertainty of only $25 instead of plus or minus $25. However, this limited range of uncertainty only occurs for the $100 value of earnings.
Second, consider an individual with earnings of $95,250 in a year when the taxable maximum was $96,000. This individual's earnings value was replaced with the average value for all individuals with earnings from $95,001 and $95,999. In this case, we know the actual value of earnings reported to SSA to within $1,000. This is a much smaller range of uncertainty than the difference of plus or minus $1,000 that applies to earnings greater than $50,000 and not within the random-rounding base of the taxable maximum.
Third, the random-rounding process may also affect the number of annual QCs included in EPUF for 1953–2006. On the MEF, QCs are calculated based on the quarterly earnings (1951 to 1977) and on annual earnings (1978 to 2006) recorded for a given year. However, the random-rounding process can change the value of earnings by plus or minus $25, $100, or $1,000, depending on the amount of taxable earnings in a given year. Thus, QCs based on randomly rounded earnings values may differ from those based on the MEF.
This potential discrepancy raises questions about the effectiveness of the random-rounding process. Consider a case in which the amount of earnings on the MEF is $735 and the rounded earnings value is $750 for a year in which $250 are needed to earn a QC. The QCs based on MEF earnings would be two, and the rounded-earnings QC value would be three. By using the MEF QC value in EPUF we would know that the actual earnings reported to SSA would be between $725 and $750. In addition to reducing the range of uncertainty for the individual's earnings, this could affect analyses of eligibility for benefits.
In this light, the question arises: What is the appropriate value for QCs to include in EPUF? A comparison of the QC measure on the MEF with that based on randomly rounded earnings found the following four items:
- Of 60,326,474 records with positive earnings, QC values differed on only 175,609 (0.29 percent).
- When records differed, the maximum difference was plus or minus one QC.
- The aggregate number of QCs based on randomly rounded earnings (213,915,632) was 39,389 fewer than the aggregate number of quarters on the MEF, a difference of only 0.018 percent.
- The net impact of random rounding on total QCs earned at the individual level was very small. Among those whose records were affected, nearly 97 percent had a net difference of plus or minus one quarter over their work histories.
Given the very small differences between the two QC measures, SSA included the MEF measure in EPUF because it reflects an individual's actual number of QCs earned.
Zeroing out earnings for certain ages. When the BEPUF was created, the IRS requested that SSA zero out all earnings for individuals born after 1937 who had earnings at ages 14 or younger to prevent disclosure of potentially identifiable data.
SSA applied these same data disclosure procedures to EPUF. In addition to zeroing out any earnings for individuals who were very young, SSA assigned a value of zero to any earnings records that had a positive value when the individual was aged 86 or older.
Table 1 shows the number of records that SSA either removed from the underlying EPUF sample because of data cleaning or assigned a value of $0 because of data disclosure procedures, along with the dollar value of earnings represented by these omitted records.23 Table 2 shows the number of records and the value of earnings represented in the entire underlying EPUF sample, in the omitted records, and in the resulting final EPUF, revealing that the omitted records are a very small share of the original underlying sample.
|Year||Records removed for data cleaning||Records with earnings values set to zero for individuals aged—||Total|
|14 or younger||86 or older|
|Records||Dollar amount||Records||Dollar amount||Records||Dollar amount||Records||Dollar amount|
|SOURCE: Author's calculations based on underlying EPUF sample.|
|Year||Records from the underlying EPUF sample with positive earnings||Records affected by data cleaning or disclosure protection procedures a||Final EPUF||Final EPUF as a percentage of underlying EPUF sample|
|Records||Dollar amount||Records||Dollar amount||Records||Dollar amount||Records||Dollar amount|
|SOURCE: Author's calculations based on underlying EPUF sample.|
|a. Includes records removed because of data cleaning and records with earnings values set zero for indivudals with earnings at age 14 or younger or at age 86 or older.|
After all of the data cleaning and data disclosure procedures were applied, several steps were taken to evaluate the validity of the data contained in EPUF. A forthcoming Research and Statistics Note compares the data in the underlying sample and the final EPUF with the earnings estimates published by SSA in the Annual Statistical Supplement to the Social Security Bulletin.
Caveats on Using EPUF Data
Any user should be fully aware of three caveats on using the EPUF: (1) earnings data in EPUF are capped taxable Social Security earnings, (2) EPUF does not contain all of the information needed to calculate benefits accurately for everyone in the file, and (3) there may be some errors in the administrative data underlying EPUF.
Capped Taxable Social Security Earnings
As previously noted, earnings data in EPUF are limited to capped taxable Social Security earnings. The file excludes data for workers whose only earnings are from noncovered employment. Additionally, the file does not contain covered earnings above the taxable maximum.
Table 3 compares the number of workers covered under the Social Security program with all US workers. Although the percentage working in covered employment has increased dramatically over time—from 55 percent in 1939 to nearly 94 percent in 2006—6 percent of the US workforce in 2006 still worked in noncovered employment.
|Year||Paid civilian workers a
|Workers in covered employment
|Number (millions)||As a percentage of paid civilian workers|
|SOURCE: Unpublished data from SSA's Office of the Chief Actuary.|
|NOTE: Data for 1939, 1944, and 1949 are monthly averages; data for all other years are as of December.|
|a. Includes wage-and-salary earners and the self-employed.|
Chart 2 shows that the amount of covered earnings expressed as a percentage of all earnings in the economy increased from approximately 70 percent in 1950 to nearly 85 percent in 2006. This represents a large increase in the share of earnings covered under the program, but it also reveals that approximately 15 percent of earnings in 2006 were not in covered employment.
Social Security earnings (weighted) as a percentage of all earnings
|Year||Social Security covered earnings||Social Security taxable earnings||Capped Social Security taxable earnings in EPUF|
However, noncovered earnings account for only part of the earnings "missing" from EPUF. Chart 2 also shows taxable Social Security earnings and the capped taxable Social Security earnings measure used in EPUF. As a percentage of total earnings in the economy, EPUF's capped taxable earnings ranges from around 55 percent in the early 1950s to 78 percent in 1986, then declines gradually to 70 percent by 2006.
The relatively large differences between covered and taxable earnings from 1951 through the mid-1970s stem from the low taxable maximum earnings amounts during those years. The jagged pattern of the differences results from ad hoc changes to the taxable maximum. Prior to the 1972 Social Security Amendments, the taxable maximum was set by statute. From 1937 to 1950, the taxable maximum was $3,000. The first increase in the taxable maximum, to $3,600, occurred in 1951, and it increased four more times through 1971. The 1972 amendments provided an automatic annual increase in the taxable maximum proportional to the increase in the national average wage. The key point for EPUF users is that using different methodologies for increasing the taxable maximum has affected the number (and proportion) of workers with earnings at or above the taxable maximum. For example, in 1951, nearly 25 percent of workers with covered earnings had earnings equal to or greater than the taxable maximum. In 1960 and 1970, the percentages of workers with earnings at or above the taxable maximum were 28 percent and 26 percent, respectively. In 1980, the percentage dropped to 9 percent and by 2006, it had dropped even further, to 6 percent (SSA 2009, Table 4.B4).24
Chart 2 reveals that the earnings in EPUF do not account for a significant portion of the total earnings in the economy from 1951 through 2006. Thus, using EPUF to analyze work patterns for individuals with a mix of covered and noncovered earnings may produce inaccurate results. Suppose an individual started working in a noncovered job in 1945 that was redefined as covered employment in 1955. This individual's work history in the EPUF would begin in 1955, with no indication that he or she really started working in 1945. Another example is an individual who worked in covered employment during high school and college and subsequently worked in a job that was not covered. This would result in a covered work history that starts in the individual's early work years and stops shortly thereafter.
Limitations on Estimating Benefits
One expected use of EPUF is to evaluate how programmatic changes affect benefit amounts. However, such analysis is limited to estimating an individual's primary benefits; that is, benefits based on one's own earnings record. For example, auxiliary benefits—those to which individuals would be entitled based on their spouses' or parents' earnings record—cannot be estimated because there is no way to identify a spousal or parental link among individuals in EPUF.25 This is problematic because many female beneficiaries receive part or all of their benefits based on a current or former spouse's higher earnings. Nevertheless, analysts can make reasoned assumptions about family size and estimate hypothetical family benefits based on an individual's own earnings records.
Analysts cannot use EPUF to estimate disability benefits because the file does not contain information about an individual's period(s) of disability. In addition, any calculation of retirement benefits for a disabled beneficiary would be inaccurate because it would exclude periods of disability. However, one can use EPUF to determine an individual's insured status in a given year and to estimate hypothetical disability benefits that could be awarded if an individual became disabled.
The EPUF does not contain a date of death for deceased individuals. As a result, one cannot determine if a string of years with zero earnings reflects that the individual has retired, become disabled, or died.
The accuracy of estimates for primary benefits may be affected by the lack of detailed information for some individuals in the file. When calculating an individual's benefit amount, SSA uses the certified earnings record, which includes any ancillary earnings information such as military credits, railroad employment income, or having multiple SSNs.26 Because EPUF omits this information, estimates of benefits for individuals who had these sources of income or had multiple SSNs are suspect. Although the number of individuals having multiple SSNs or railroad income is relatively small, accurate assessments of the effects of programmatic changes on these individuals would require such information. The number of individuals with military credits is likely to be much larger, but the impact on benefits is likely to be relatively small for those with limited military service.
Incomplete information in the EPUF also hinders accurate estimates of benefits for individuals with earnings during 1937–1950. Recall that SSA had to estimate the number of QCs associated with earnings from this period. Consider an individual who applies for benefits but is a couple of quarters short of being eligible. In such a case, SSA reviews the microfilm record to determine the individual's actual amount of covered earnings during the period. SSA posts this amount to the detailed segment of the MEF then determines the QCs earned using the usual procedures. However, EPUF does not include the information from the microfilm. Therefore, analysts should exercise caution when using EPUF data on QCs for this period, and should note this fact in any analysis using that data field.
The user should also note that precise computation of monthly benefits paid is not possible with the EPUF because age at entitlement, on which monthly benefit amounts are based, cannot be observed in the file. With EPUF, it is also not possible to adjust benefits for workers subject to the Windfall Elimination Provision, which reduces benefits of "individuals who have only minimal Social Security coverage and will receive a pension based on years of work in noncovered employment" (SSA 2009).
Errors in Underlying Earnings Data
SSA has been collecting data on individual workers covered under the program since its inception. The agency uses administrative files to determine eligibility for benefits, to determine benefit amounts, to estimate future benefit payments, and for a variety of other purposes.
Each year, capturing the earnings data reported on Form W-2 and used for program purposes is a massive undertaking. For earnings reported in tax year 2006, SSA processed W-2s for nearly 155 million workers and generated approximately 250 million wage items. SSA processed nearly 80 percent of the wage items reported on the W-2s electronically, and the remaining 20 percent were scanned using character recognition software or keyed in manually. In addition, SSA received information on self-employment income from the IRS based on data reported on Schedule SE. This information accounted for approximately 20 million items posted to the MEF. In total, SSA posted nearly 270 million earnings-related items for tax year 2006 to its MEF.
With so many items posted every year, the MEF is clearly susceptible to missing or erroneous earnings data. Each step of the process introduces potential errors. The employer may enter an incorrect amount for a given individual, or may put the correct information in the wrong box on the W-2. In addition, the SSN may not be valid or the name on the W-2 may not match the one to which the SSN was enumerated.27 Errors can also arise as SSA posts the data in the MEF.
SSA has an elaborate set of checks to identify and correct improperly reported earnings information.28 The agency verifies that the information on all the W-2s submitted by an employer corresponds to the amounts reported by the employer on Form W-3. SSA continuously updates the MEF as corrected W-2s (W-2c's) and delinquent W-2s stream in throughout the year. Workers may also file amended tax returns to correct errors reported in previous filings.
If SSA detects errors in a worker's earning record, it sends a letter to the employer seeking clarification. In response, the employer may file a W-2c. In some instances, an employer files a W-2c and the employee supplies information to correct the same error; the resulting double-correction also produces errors on the MEF.
Another opportunity to catch earnings-record errors arises when SSA mails out its annual Social Security statement to workers aged 25 or older. Errors detected by the worker can be resolved at any SSA field office.29 Finally, workers can catch errors in their earnings data when they apply for benefits. Applicants see their complete earnings histories and can direct SSA to correct any verifiable errors they spot. Nevertheless, despite extensive efforts to ensure accurate earnings records, the EPUF may contain erroneous information.
Highlights from the EPUF
This section presents statistical highlights of the earnings data for the 4,384,254 individuals whose records are included in EPUF. Figures cited are unweighted.
Individuals by YOB
There are five distinct trends in the distribution of individuals by birth year in EPUF (Chart 3). The first is a steep increase in the number of individuals in the file, starting with 1,813 born in 1870 and peaking at 31,877 born in 1921. The second is a steady decline from 31,104 born in 1922 to 26,568 in 1933. The third trend is a dramatic increase to nearly 53,000 who were born in 1962, nearly doubling the number of individuals born in 1933. The fourth is a steep decline from 52,138 individuals born in 1963 to 41,792 born in 1975. The final trend reflects relatively flat numbers of individuals born from 1976 through 2006, from 41,822 to 41,241, respectively.
Number of individuals in EPUF, by year of birth
|Year of birth||Number (in thousands)|
Chart 4 presents the distribution of individuals by YOB and sex.30 For birth years from 1870 to about 1925, men outnumber women in EPUF. With a few exceptions, the numbers of women and men in the file are nearly the same for birth years from 1926 to 1947. The number of men born from 1948 to 2006 is consistently higher than the number of women, although not by very much.
Number of individuals in EPUF, by year of birth and sex
|Year of birth||Number (in thousands)|
Workers and Nonworkers
There are four distinct categories of individuals in EPUF depending on whether they had any Social Security taxable earnings and, if so, the period in which they were earned. The four categories are nonworkers (individuals with no taxable earnings), workers with taxable earnings during 1937–1950 only, workers with taxable earnings during 1951–2006 only, and workers with taxable earnings in both periods. More than one-half of the individuals in EPUF had earnings during 1951–2006 only, about 4 percent had earnings only during 1937–1950, and 16 percent had earnings in both periods (Chart 5).
Percentage distribution of individuals in EPUF, by capped Social Security taxable earnings status
Initially, the 24.7-percent figure for individuals in EPUF who did not have any earnings seems very large. However, Chart 6 reveals that the bulk of these individuals (68 percent) were born after 1987. Thus, the main reason so many individuals in EPUF have no earnings is that most of them are not old enough to participate in the labor market.31
Cumulative distribution of individuals in EPUF with no capped Social Security taxable earnings, by year of birth
|Year of birth||Cumulative percentage|
Chart 7 presents the distribution by sex of individuals in EPUF in each earner status. Women outnumber men among those who do not have any earnings (52 percent versus 48 percent). Among individuals with earnings during 1937–1950 only, a large majority are men (57 percent versus 43 percent). This result was expected because women were much less active in the labor market during that period. Individuals in EPUF with earnings during both periods skew even more towards men, 61 percent versus 39 percent. Individuals with earnings during 1951–2006 only are more evenly distributed between men (51 percent) and women (49 percent), reflecting women's substantial increases in labor force participation during the period.
Percentage distribution of individuals in EPUF in each capped Social Security taxable earnings status, by sex
|Earnings during 1937–1950 only||56.6||43.2|
|Earnings during 1951–2006 only||50.8||49.2|
|Earnings during both periods||60.7||39.3|
Individuals in EPUF with any earnings during 1937–1950 number 874,287. Approximately 60 percent are men (523,465) and 40 percent are women (350,229). There are also records for 593 individuals whose sex is unknown and who had earnings during this period. Appendix Chart A1 presents the distribution of individuals with earnings during this period by YOB and sex. The average and median values for all earnings during this period are $9,106 and $4,600, respectively (not shown). The average earnings for men ($11,990) is much higher than that for women ($7,521). The median earnings for men and women diverge even more, at $7,900 and $1,800, respectively.
Earnings in EPUF
Chart 8 shows that the gap between the number of men and women with earnings in a given year has decreased significantly between 1951 and 2006. Chart 9 shows a slow but steady climb in aggregate earnings for men and women over the same period.32 The difference between the total amount of earnings for men and women has been increasing over time. However, women's taxable earnings as a percentage of all taxable earnings has increased from 22.1 percent in 1951 to 39.7 percent in 2006 (see Table A2). Table 4 presents the average and median earnings of men, women, and individuals with unknown sex in the EPUF.
Number of individuals with capped Social Security taxable earnings in EPUF, by sex, 1951–2006
Aggregate amount of capped Social Security taxable earnings in EPUF, by sex of earner, 1951–2006
|Year||All workers||Men||Women||Sex unknown|
|SOURCE: Author's calculations based on the 2006 EPUF.|
The 2006 EPUF contains earnings data for individuals drawn from a 1-percent sample of all SSNs issued before January 2007. The file contains limited demographic information and earnings data related to the Social Security program for 4,348,254 individuals. Although the file contains limited data fields, it is much larger than other public-use files with earnings histories. EPUF will provide policymakers and researchers with a unique tool to evaluate the Social Security programs and potential reforms.
|Year||All workers||Men||Women||Sex unknown|
|Number||Percentage of workers||Number||Percentage of workers||Number||Percentage of workers|
|SOURCE: Author's calculations based on the 2006 EPUF.|
|NOTE: Rounded components of percentage distributions do not necessarily sum to 100.|
|a. Less than 0.05 percent|
|Year||Total Social Security taxable earnings ($)||Men||Women||Sex unknown|
|Dollar amount||Percentage of earnings||Dollar amount||Percentage of earnings||Dollar amount||Percentage of earnings|
|SOURCE: Author's calculations based on the 2006 EPUF.|
|NOTE: Rounded components of percentage distributions do not necessarily sum to 100.|
|a. Less than 0.05 percent|
Number of individuals in EPUF with capped Social Security taxable earnings during 1937–1950, by year of birth and sex
|Year of birth||Number (in thousands)|
1 The MEF contains all of the earnings data collected to administer the Social Security programs.
2 Noncovered earnings are wage and salary income not covered under the Social Security programs.
3 For a discussion of SSA earnings data, see Olsen and Hudson (2009).
4 This limitation is discussed later in the article.
5 For historical changes in coverage, see SSA (2009, Table 2.A1).
6 SSA's Office of Research, Evaluation, and Statistics uses this measure to generate its published estimates of earnings.
7 Technically, this is not always correct because some earnings are reported on the Earnings Suspense File and not posted on the MEF. For a detailed discussion, see GAO (2005).
8 The average wage index is calculated annually using wages subject to federal income taxes and contributions to deferred compensation plans. The index is used in determining an individual's retirement benefit amount as well as to determine several other key dollar amounts in the administration of the Social Security programs. For more detail, see SSA (2010).
9 This process is done because of the prohibitive costs associated with going back to the microfilm to determine the exact number of QCs earned by individuals with earnings during the 1937–1946 period.
10 For individuals with earnings during this period who did not meet program criteria for benefits or coverage (using this technique to estimate QCs), a detailed manual search of microfilm records determines if the individual was eligible for benefits and, if so, the benefit amount.
11 Including these flags would have created serious data disclosure problems because they provide much more individually identifiable information.
12 For a detailed discussion of deferred earnings in SSA data, see Pattison and Waldron (2008).
13 For a description of the three components of the SSN (area, group, and serial number), see Puckett (2009).
14 Nonoverlapping samples are important from a data disclosure perspective if SSA decides to release any additional public-use data files.
15 The sample design is equal to the ratio of the variance of the systematic random sample for EPUF and the variance assuming a simple random sample without replacement.
16 The Numident is a master file of all SSNs ever assigned. It contains the identifying information given when an individual applies for an SSN.
17 This includes 319 individuals who were ultimately removed from the underlying EPUF sample because they were also in the New Beneficiary Data Systems (discussed in the data disclosure section of the article).
18 The source for YOB data in EPUF is the MEF summary record, which may not contain the same value that appears in the Numident or Master Beneficiary Record files.
19 See the text box for a brief description of the other public-use data files that contain earnings data from Social Security administrative files. To evaluate the disclosure risk for individuals in EPUF who are included in other publicly available data files, SSA considers four key points: the potential magnitude of the overlap between files, the possibility of matching records across files with any certainty, the additional information that would be revealed in the unlikely event that records could be matched with any certainty, and the ability to reidentify someone in EPUF based on publicly available data.
20 Thus, the total number of individuals removed from the underlying EPUF sample because of data cleaning and data disclosure is 28,770.
21 The SSB is a set of files containing individual-level data synthesized from Census Bureau's Survey of Income Program Participation (SIPP) results linked to various Social Security administrative files. The Census Bureau produces the SSB, which is the result of an interagency project that also includes SSA and IRS.
22 Under random rounding, a multiple of the rounding base will not change, while a number that is not a multiple of the base will round to either of the two closest multiples of the base. For example, when random-rounding to a base of $25, the value $550 will not change. However, a value of $562 may round to either $550 or $575. The random-rounding process provides some uncertainty about the actual number reported on the individual's SSA earnings record. For example, if the earnings contained in EPUF are $550 we know the actual amount reported to SSA was between $526 and $574. The interval of uncertainty increases with the amount of earnings reported.
23 Unless otherwise noted, the numbers of records and the amounts of earnings shown in the charts and tables are unweighted.
24 Additionally, in many years, the percentage of individuals with earnings at or above the taxable maximum differs substantially by sex.
25 SSA cannot determine married-couple or parent-child relationships in the file based on the information derived from the MEF. SSA establishes such linkages after an individual applies for benefits. In any event, linking currently or previously married individuals or indicating a familial relationship in EPUF would create serious data disclosure risks.
26 An electronic folder (created when an individual applies for benefits) contains the certified earnings record, which summarizes all the earnings records from the MEF and provides the basis for computing an individual's benefits.
27 Enumeration is the process by which SSA assigns a unique SSN for every person in order to create a work and benefit record for the Social Security program. SSA verifies all of the information on the SSN application.
28 Earnings that cannot be properly assigned to an individual's earnings records on the MEF are placed on the Earnings Suspense File. The amount of earnings assigned to the Earnings Suspense File has grown dramatically over the past 20 years (GAO 2005).
29 In March 2011, budget constraints led the SSA to suspend the production and mailing of printed statements. The agency is working toward developing an online alternative.
30 This chart omits individuals whose sex is unknown. Appendix Table A2 shows distributions by sex, including individuals of unknown sex.
31 Recall that any earnings reported before the individual was 15 years old were assigned a value of zero for data disclosure reasons.
32 Appendix Tables A1–A2 present the data underlying Charts 8–9.
[Board of Trustees] Board of Trustees of the Old-Age, Survivors, and Disability Insurance Trust Funds. 2010. Annual Report of the Board of Trustees of the Old-Age, Survivors, and Disability Insurance Trust Funds, 2010. Washington, DC: SSA.
[GAO] Government Accountability Office. 2005. Better Coordination Among Federal Agencies Could Reduce Unidentified Earnings Reports. Report no. GAO-05-154. Washington, DC: Government Printing Office.
Olsen, Anya, and Russell Hudson. 2009. "Social Security Administration's Master Earnings File: Background Information." Social Security Bulletin 69(3): 29–45.
Pattison, David, and Hilary Waldron. 2008. "Trends in Elective Deferrals of Earnings from 1990–2001 in Social Security Administrative Data." Research and Statistical Note No. 2008-03. Washington, DC: SSA.
Puckett, Carolyn. 2009. "The Story of the Social Security Number." Social Security Bulletin 69(2): 55–74.
[SSA] Social Security Administration. 2009. Annual Statistical Supplement to the Social Security Bulletin, 2008. Washington, DC: SSA.
———. 2010. "Automatic Increases: National Average Wage Index." http://www.socialsecurity.gov/OACT/COLA/AWI.html.