You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.8 KiB

Parsing of lottery websites

  • Parses scratchoff lottery ticket data from state websites.
  • Writes game and remaining prize info to stdout as json.
  • Writes errors to stderr.
➜  lottery_data_scraper git:(main) ✗ python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq | head -n 50
[
  {
    "state": "la",
    "game_id": "1450",
    "prizes": [
      {
        "prize": "$200,000",
        "available": 2,
        "value": 200000,
        "claimed": 3
      },
# ...

Demo

The following script should put you in a state where the last line will make a bunch of requests to the Pennsylvania lottery website, parse the tables of games/prizes, and print to your terminal a JSON structure of all of the games.

git clone https://github.com/owogawc/lottery_data_scraper
cd lottery_data_scraper
python3 -m venv ~/.virtualenvs/lottery_data_scraper
. ~/.virtualenvs/lottery_data_scraper
pip3 install -e .

PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 

If you have jq installed, you can get some formatted output by piping it to jq (and redirecting STDERR to /dev/null).

PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 2> /dev/null | jq

Or, if you want to capture error logs:

python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq

Data models

We're using marshmallow to validate and serialize data.

I'm including the schemas here just so you can quickly get a general idea of what data fields we're able to scrape from most lottery websites. What you see in this README might not be up-to-date with what's in schemas.py.

As of 2023-04-07 the schemas are a work-in-progress. The remaining TODO is to determine and specify which fields are absolutely required and which are optional.

Game Schema

class GameSchema(Schema):
    class Meta:
        render_module = json

    id = fields.Integer()
    created_at = fields.DateTime(load_default=datetime.utcnow)
    game_id = fields.Str()
    name = fields.Str()
    description = fields.Str()
    image_urls = fields.Function(
        lambda x: json.loads(x.image_urls) if x.image_urls else [],
        deserialize=lambda x: None if x.image_urls == [] else json.dumps(x.image_urls),
    )
    how_to_play = fields.Str()
    num_tx_initial = fields.Integer()
    price = fields.Number()
    prizes = fields.Nested(PrizeSchema, many=True)
    state = fields.Str()
    updated_at = fields.DateTime()
    url = fields.Str()

Prize Schema

class PrizeSchema(Schema):
    class Meta:
        render_module = json

    id = fields.Integer()
    game_id = fields.Integer()
    available = fields.Integer()
    claimed = fields.Integer()
    created_at = fields.DateTime(load_default=datetime.utcnow)
    value = fields.Number()
    prize = fields.Str()

Tests

Testing is kind of tricky because you can't rely on just python with its requests library. Some states have some scrape protections that require you actually run JavaScript. Some states have extreme scrape protection that require you to actually run a display. They check for some rendering context that doesn't exist when you run a headless browser in Selenium. To scrape those sites, you actually have to run a X virtual framebuffer. Testing in these cases isn't as simple as running python3 -m unittest discover.

Contributing

git clone https://github.com/owogawc/lottery_data_scraper
cd lottery_data_scraper
python3 -m venv ~/.virtualenvs/lottery_data_scraper
. ~/.virtualenvs/lottery_data_scraper
pip3 install -e .

Then you should be able to run make test and see the tests pass.