The Hospital Chargemasters dataset includes downloaded and parsed tab delimited files for over 100 U.S. hospitals. For a writeup, see here.
The datasets are provided via the Github repository:
git clone https://www.github.com/vsoch/hospital-chargemaster
wget https://www.github.com/vsoch/hospital-chargemaster/archive/0.0.1.zip
wget https://www.github.com/vsoch/hospital-chargemaster/archive/0.0.1.tar.gz
This is the hospital chargemaster Dinosaur Dataset
As of January 1, 2019, hospitals are required to share their price lists. However, it remains a problem that the data released is not intended for human consumption. To make the data more readily available, in a single place, and update annually (and hopefully with community contribution!), I’ve created this repository.
We have compiled a list of hospitals and links in the hospitals.tsv file, generated via the 0.get_hospitals.py script. The file includes the following variables, separated by tabs::
-
This represents the original set of hospitals that I obtained from a compiled list, and is kept for the purpose of keeping the record. Notably, some of the list included hospital systems that might include more than one hospital.
Each hospital has records kept in a subfolder in the data folder. Specifically,
each subfolder is named according to the hospital name (made all lowercase, with spaces
replaced with -
). Within each folder, you will find:
scrape.py
: A script to scrape the databrowser.py
: If we need to interact with a browser, we use selenium to do this.latest
: a folder with the last scraped (latest data files)data-latest*
and data-<year>.tsv
to reflect a best effort parsing into a standard format.YYYY-MM-DD
folders, where each folder includes:
records.json
the complete list of records scraped for a particular data*.csv
or *.xlsx
or *.json
: the scraped data files.The first iteration was run locally (to test the scraping). One significantly different scraper is the oshpd-ca folder, which includes over 795 hospitals! Way to go California! However, this means that we have more than one latest file, to afford the files being under 100MB and being allowed on GitHub. Additionlly, avent-health provides (xml) charge lists for a ton of states.
It is the case that the code in the scrape.py files (and browser.py) is redundant. We do this so that each folder is a modular solution to retrieve the data. If you are interested in just one hospital, you can use the folder on its own. The one exception is with the browser (Chrome) driver that is shared in the drivers folder at the root of the repository.
This is likely one of the hardest steps. I wanted to see the extent to which I could create a simple parser that would generate a single TSV (tab separted value) file per hospital, with minimally an identifier for a charge, and a price in dollars. If provided, I would also include a description and code:
Each of these parsers is also in the hospital subfolder, and named as “parser.py.” The parser would output a data-latest.tsv file at the top level of the folder, along with a dated (by year data-<year>.tsv
). At some point
I realized that there were different kinds of charges, including inpatient, outpatient, DRG (diagnostic related group) and others called
“standard” or “average.” I then went back and added an additional column
to the data:
This is publicly available data, provided with good intention that transparency is important. The authors make no guarantees about the data, and are not liable for how you might use it. If you find an issue, you are encouraged to help to fix it by opening an issue. If you will like to open a pull request to add missing data or fix an issue, it would be greatly appreciated! My original work was optimized for efficiency and so I didn’t go back (yet) to fix all the tiny details, knowing that the community could come in to contribute and help.
This would likely need to be done on a yearly basis, and it is unlikely the hospitals would go out of their way to update the documents any more frequently than they are required. In order to make this automated, we will do the following:
also under development
The original dataset was obtained from an article that listed the top 115 US Hospitals, but this isn’t to say that other hospitals aren’t important and deserving to belong here! If you want to add a hospital:
hospital_uri
from the file.scrape.py
script in the folder. You can use others as templates, but the file should generate an output directory with the present date, and recursive copy the new folder to be latest.parse.py
file to generate the latest-* data frames (you can use other folders as starting templates).The data will be updated on an annual basis, or when a pull request is issued to update the repository. Upon merge, the generated latest data will be pushed back to the repository.
If you have other questions, or want help for your project, please don’t hesitate to open an issue. If you use any of the datasets in your work, please remember to include the doi.