Skip to content
Open
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ The licences are included in the respective dataset folders as well.
2. [nslm](https://github.com/grf-labs/grf/tree/master/experiments/acic18): CC0 1.0 Universal. Last downloaded on 2026-03-04.
3. [Tuebingen-pair-wise-dataset](https://webdav.tuebingen.mpg.de/cause-effect/): Last downloaded on 2026-03-02.
4. [Twins-datasets](http://www.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html)
5. [angrist-krueger-cps](https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive): CC0 1.0 Universal. Last downloaded on 2026-03-02.
121 changes: 121 additions & 0 deletions angrist-krueger-cps/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
Creative Commons Legal Code

CC0 1.0 Universal

CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.

Statement of Purpose

The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").

Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.

For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.

1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:

i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.

2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.

3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.

4. Limitations and Disclaimers.

a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.
96 changes: 96 additions & 0 deletions angrist-krueger-cps/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly was this file generated? Was it taken from somewhere or has been LLM generated?

Copy link
Author

@Rasesh2005 Rasesh2005 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description has been LLM Generated, many of then made sense which correlated directly with https://cps.ipums.org/cps-action/variables/{tag} description like educ https://cps.ipums.org/cps-action/variables/educ had similar description, so I verified most of them this way, but some tags were not found using the same name, so had to assume what the LLM gave was correct, I have a list of verified and unverified tags If u want I can send that as well.. these 2 tags were the only suspicious one in the list acc to me

Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Angrist-Krueger-CPS Dataset

This dataset is an extract of the CPS data containing 30,967 observations on men born 1944-53 from the 1979 and 1981-85 March CPS, matched to lottery number dummies for groups of 25 lottery numbers. There are 72 variables including all covariates. The raw files (`extract.dta` and `samplcps.do`) were replicated and processed into a ready-to-use tabular `.mixed.txt` format suitable for `pgmpy` consumption.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw files (extract.dta and samplcps.do) were replicated and processed into a ready-to-use tabular .mixed.txt format suitable for pgmpy consumption.

Can you explain how this processing was done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract.dta is a stata file, I used pandas pd.read_stata, then I did all the preprocessing similarly as given in samplecps.do file.
the initial size of dataframe was (30967, 72), and after preprocessing and column selection, the final size is Final shape: (13993, 58).

Also samplecps.do simply is a sample Stata program that analyzes the CPS data set.

Also, the website says

Follow the sample selection rules in the notes to the tables to reproduce the 25, 781 observation working sample.

Should I do the preprocessing as mentioned in the paper instead that provides this ?? that will give a final shape of dataframe to (25782, 75).

Also, should I include all the preprocessing I did into the readme as well or something?


## Column Descriptions

The dataset contains 72 variables derived from CPS extracts used to estimate the causal return to schooling using Vietnam draft lottery instruments.

### Core Economic Variables
- educ: Years of completed education.
- annwage: Annual wage income.
- weeks: Number of weeks worked during the previous year.
- hrsly: Hours worked during the previous year.
- hrslw: Hours worked during the last week.
- wageflag: Indicator that wage information is valid/observed.

### Demographic Variables
- age: Age of the respondent.
- agesq: Age squared, used to model nonlinear age effects.
- age2: Alternative squared age variable used in some regressions.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, agesq and age2

- race: Race category from CPS.
- black: Dummy variable indicating Black respondents.
- other: Dummy variable for race other than White or Black.
- marital: Marital status indicator.
- spsepres: Indicator for spouse present in the household.

### Education Variables
- higratt: Highest grade attended.
- higrcomp: Highest grade completed.
- educ: Years of completed schooling.
- college: Indicator for college education.
- someco: Indicator for some college attendance.

### Labor Market Variables
- esr: Employment status recode.
- esrflag: Indicator for valid employment status data.
- class: Class of worker (private, government, self-employed, etc.).
- ind: Industry classification code.
- occ: Occupation classification code.
- vet: Veteran status indicator.
- veteran: Recoded veteran status variable.

### Geographic Variables
- state: State code.
- division: Census division classification.
- smsa: Indicator for residence in a Standard Metropolitan Statistical Area.
- metcode: Metropolitan area code.
- city: Indicator for residence in a central city.
- balsmsa: Balanced SMSA classification.

### Regional Indicator Variables
These variables represent U.S. census regions used as regression controls.

- neweng: New England region indicator.
- midatl: Mid-Atlantic region indicator.
- eastnth: East North Central region indicator.
- westnth: West North Central region indicator.
- sthatl: South Atlantic region indicator.
- eaststh: East South Central region indicator.
- weststh: West South Central region indicator.
- mount: Mountain region indicator.
- pacific: Pacific region indicator.

### Birth Year Variables
Dummy variables indicating the respondent’s year of birth.

- yob: Year of birth.
- yob44–yob53: Indicator variables for birth years 1944 through 1953.

### Survey Year Variables
Dummy variables identifying the CPS survey year.

- year: CPS survey year.
- yr81: Indicator for survey year 1981.
- yr82: Indicator for survey year 1982.
- yr83: Indicator for survey year 1983.
- yr84: Indicator for survey year 1984.
- yr85: Indicator for survey year 1985.

### Draft Lottery Instrument Variables
These variables represent grouped Vietnam draft lottery numbers used as instruments for education.

- lott1–lott13: Lottery number group indicator variables.

### Sampling and Administrative Variables
- marchwt: CPS March supplement sampling weight.
- recode: Observation identifier used in the replication dataset.

## Dataset Purpose

This dataset is used to estimate the causal effect of education on wages using instrumental variables derived from Vietnam draft lottery numbers. The lottery provides exogenous variation in schooling decisions among men born between 1944 and 1953.

## References
**Source Citation:**
Angrist, J. D., & Krueger, A. B. (1995). Split-Sample Instrumental Variables Estimates of the Return to Schooling. Journal of Business & Economic Statistics, 13(2), 225-235.
Data extracted from the [Angrist Data Archive](https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive).
Loading