Hospital Chargemasters

04 Feb 2019

Summary

The Hospital Chargemasters dataset includes downloaded and parsed tab delimited files for over 100 U.S. hospitals. For a writeup, see here.

Download

The datasets are provided via the Github repository:

git clone https://www.github.com/vsoch/hospital-chargemaster
wget https://www.github.com/vsoch/hospital-chargemaster/archive/0.0.1.zip
wget https://www.github.com/vsoch/hospital-chargemaster/archive/0.0.1.tar.gz

Hospital Chargemaster

This is the hospital chargemaster Dinosaur Dataset

What is this data?

As of January 1, 2019, hospitals are required to share their price lists. However, it remains a problem that the data released is not intended for human consumption. To make the data more readily available, in a single place, and update annually (and hopefully with community contribution!), I’ve created this repository.

How does it work?

1. Get List of Hospital Pages

We have compiled a list of hospitals and links in the hospitals.tsv file, generated via the 0.get_hospitals.py script. The file includes the following variables, separated by tabs::

hospital_name is the human friendly name
hospital_url is the human friendly URL, typically the page that includes a link to the data.
hospital_uri is the unique identifier for the hospital, the hospital name, in lowercase, with spaces replaced with -

This represents the original set of hospitals that I obtained from a compiled list, and is kept for the purpose of keeping the record. Notably, some of the list included hospital systems that might include more than one hospital.

2. Organize Data

Each hospital has records kept in a subfolder in the data folder. Specifically, each subfolder is named according to the hospital name (made all lowercase, with spaces replaced with -). Within each folder, you will find:

scrape.py: A script to scrape the data
browser.py: If we need to interact with a browser, we use selenium to do this.
latest: a folder with the last scraped (latest data files)
data-latest* and data-<year>.tsv to reflect a best effort parsing into a standard format.
YYYY-MM-DD folders, where each folder includes:
- records.json the complete list of records scraped for a particular data
- *.csv or *.xlsx or *.json: the scraped data files.

The first iteration was run locally (to test the scraping). One significantly different scraper is the oshpd-ca folder, which includes over 795 hospitals! Way to go California! However, this means that we have more than one latest file, to afford the files being under 100MB and being allowed on GitHub. Additionlly, avent-health provides (xml) charge lists for a ton of states.

Why do you have some redundant code?

It is the case that the code in the scrape.py files (and browser.py) is redundant. We do this so that each folder is a modular solution to retrieve the data. If you are interested in just one hospital, you can use the folder on its own. The one exception is with the browser (Chrome) driver that is shared in the drivers folder at the root of the repository.

3. Parsing

This is likely one of the hardest steps. I wanted to see the extent to which I could create a simple parser that would generate a single TSV (tab separted value) file per hospital, with minimally an identifier for a charge, and a price in dollars. If provided, I would also include a description and code:

charge_code
price
description
hospital_id
filename

Each of these parsers is also in the hospital subfolder, and named as “parser.py.” The parser would output a data-latest.tsv file at the top level of the folder, along with a dated (by year data-<year>.tsv). At some point I realized that there were different kinds of charges, including inpatient, outpatient, DRG (diagnostic related group) and others called “standard” or “average.” I then went back and added an additional column to the data:

charge_type can be one of standard, average, inpatient, outpatient, drg, or (if more detail is supplied) insured, uninsured, pharmacy, or supply. This is not a gold standard labeling but a best effort. If not specified, I labeled as standard, because this would be a good assumption.

4. What if I have an issue?

This is publicly available data, provided with good intention that transparency is important. The authors make no guarantees about the data, and are not liable for how you might use it. If you find an issue, you are encouraged to help to fix it by opening an issue. If you will like to open a pull request to add missing data or fix an issue, it would be greatly appreciated! My original work was optimized for efficiency and so I didn’t go back (yet) to fix all the tiny details, knowing that the community could come in to contribute and help.

5. (Future) Automation

This would likely need to be done on a yearly basis, and it is unlikely the hospitals would go out of their way to update the documents any more frequently than they are required. In order to make this automated, we will do the following:

Set the repository up with continuous integration, scheduled to run once a month
We test that each hospital (subfolder in data) is added to hospitals.tsv, is named correctly, and has a script to scrape.
We are flexible with allowing each page to have differently named files, however if there is an error with obtaining the files, we are alerted (and tests fail)

also under development

How do I contribute?

The original dataset was obtained from an article that listed the top 115 US Hospitals, but this isn’t to say that other hospitals aren’t important and deserving to belong here! If you want to add a hospital:

Add your hospital name, identifier, and (human friendly) link to the hospitals.tsv file. If you add a hospital folder and fail to update this file, or update the file and forget or misname the folder an error will be triggered.
Create a subfolder based on the hospital_uri from the file.
Write a scrape.py script in the folder. You can use others as templates, but the file should generate an output directory with the present date, and recursive copy the new folder to be latest.
Write a parse.py file to generate the latest-* data frames (you can use other folders as starting templates).

The data will be updated on an annual basis, or when a pull request is issued to update the repository. Upon merge, the generated latest data will be pushed back to the repository.