# Parsing of lottery websites - Parses scratchoff lottery ticket data from state websites. - Writes game and remaining prize info to stdout as json. - Writes errors to stderr. ``` text ➜ lottery_data_scraper git:(main) ✗ python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq | head -n 50 [ { "state": "la", "game_id": "1450", "prizes": [ { "prize": "$200,000", "available": 2, "value": 200000, "claimed": 3 }, # ... ``` ## Demo The following script should put you in a state where the last line will make a bunch of requests to the Pennsylvania lottery website, parse the tables of games/prizes, and print to your terminal a JSON structure of all of the games. ``` sh git clone https://github.com/owogawc/lottery_data_scraper cd lottery_data_scraper python3 -m venv ~/.virtualenvs/lottery_data_scraper . ~/.virtualenvs/lottery_data_scraper pip3 install -e . PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania ``` If you have [jq](https://stedolan.github.io/jq/) installed, you can get some formatted output by piping it to `jq` (and redirecting STDERR to /dev/null). ``` sh PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 2> /dev/null | jq ``` Or, if you want to capture error logs: ``` sh python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq ``` ## Data models We're using [`marshmallow`](https://marshmallow.readthedocs.io/en/stable/index.html) to validate and serialize data. I'm including the schemas here just so you can quickly get a general idea of what data fields we're able to scrape from most lottery websites. What you see in this README might not be up-to-date with what's in [schemas.py](./lottery_data_scraper/schemas.py). As of 2023-04-07 the schemas are a work-in-progress. The remaining TODO is to determine and specify which fields are absolutely required and which are optional. ### Game Schema ``` python class GameSchema(Schema): class Meta: render_module = json id = fields.Integer() created_at = fields.DateTime(load_default=datetime.utcnow) game_id = fields.Str() name = fields.Str() description = fields.Str() image_urls = fields.Function( lambda x: json.loads(x.image_urls) if x.image_urls else [], deserialize=lambda x: None if x.image_urls == [] else json.dumps(x.image_urls), ) how_to_play = fields.Str() num_tx_initial = fields.Integer() price = fields.Number() prizes = fields.Nested(PrizeSchema, many=True) state = fields.Str() updated_at = fields.DateTime() url = fields.Str() ``` ### Prize Schema ``` python class PrizeSchema(Schema): class Meta: render_module = json id = fields.Integer() game_id = fields.Integer() available = fields.Integer() claimed = fields.Integer() created_at = fields.DateTime(load_default=datetime.utcnow) value = fields.Number() prize = fields.Str() ``` # Tests Testing is kind of tricky because you can't rely on _just_ python with its `requests` library. Some states have some scrape protections that require you actually run JavaScript. Some states have extreme scrape protection that require you to actually run a _display_. They check for some rendering context that doesn't exist when you run a headless browser in Selenium. To scrape those sites, you actually have to run a [X virtual framebuffer](https://en.wikipedia.org/wiki/Xvfb). Testing in these cases isn't as simple as running `python3 -m unittest discover`. # Contributing ``` sh git clone https://github.com/owogawc/lottery_data_scraper cd lottery_data_scraper python3 -m venv ~/.virtualenvs/lottery_data_scraper . ~/.virtualenvs/lottery_data_scraper pip3 install -e . ``` Then you should be able to run `make test` and see the tests pass.