|
|
|
# Parsing of lottery websites
|
|
|
|
|
|
|
|
- Parses scratchoff lottery ticket data from state websites.
|
|
|
|
- Writes game and remaining prize info to stdout as json.
|
|
|
|
- Writes errors to stderr.
|
|
|
|
|
|
|
|
``` text
|
|
|
|
➜ lottery_data_scraper git:(main) ✗ python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq | head -n 50
|
|
|
|
[
|
|
|
|
{
|
|
|
|
"state": "la",
|
|
|
|
"game_id": "1450",
|
|
|
|
"prizes": [
|
|
|
|
{
|
|
|
|
"prize": "$200,000",
|
|
|
|
"available": 2,
|
|
|
|
"value": 200000,
|
|
|
|
"claimed": 3
|
|
|
|
},
|
|
|
|
# ...
|
|
|
|
```
|
|
|
|
|
|
|
|
## Demo
|
|
|
|
|
|
|
|
The following script should put you in a state where the last line will make a
|
|
|
|
bunch of requests to the Pennsylvania lottery website, parse the tables of
|
|
|
|
games/prizes, and print to your terminal a JSON structure of all of the games.
|
|
|
|
|
|
|
|
``` sh
|
|
|
|
git clone https://github.com/owogawc/lottery_data_scraper
|
|
|
|
cd lottery_data_scraper
|
|
|
|
python3 -m venv ~/.virtualenvs/lottery_data_scraper
|
|
|
|
. ~/.virtualenvs/lottery_data_scraper
|
|
|
|
pip3 install -e .
|
|
|
|
|
|
|
|
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania
|
|
|
|
```
|
|
|
|
|
|
|
|
If you have [jq](https://stedolan.github.io/jq/) installed, you can get some
|
|
|
|
formatted output by piping it to `jq` (and redirecting STDERR to /dev/null).
|
|
|
|
|
|
|
|
``` sh
|
|
|
|
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 2> /dev/null | jq
|
|
|
|
```
|
|
|
|
|
|
|
|
Or, if you want to capture error logs:
|
|
|
|
|
|
|
|
``` sh
|
|
|
|
python3 -m lottery_data_scraper.louisiana 2> /tmp/louisiana.log | jq
|
|
|
|
```
|
|
|
|
|
|
|
|
## Data models
|
|
|
|
|
|
|
|
We're using [`marshmallow`](https://marshmallow.readthedocs.io/en/stable/index.html) to validate and serialize data.
|
|
|
|
|
|
|
|
I'm including the schemas here just so you can quickly get a general idea of
|
|
|
|
what data fields we're able to scrape from most lottery websites. What you see
|
|
|
|
in this README might not be up-to-date with what's in
|
|
|
|
[schemas.py](./lottery_data_scraper/schemas.py).
|
|
|
|
|
|
|
|
As of 2023-04-07 the schemas are a work-in-progress. The remaining TODO is to
|
|
|
|
determine and specify which fields are absolutely required and which are
|
|
|
|
optional.
|
|
|
|
|
|
|
|
### Game Schema
|
|
|
|
|
|
|
|
``` python
|
|
|
|
class GameSchema(Schema):
|
|
|
|
class Meta:
|
|
|
|
render_module = json
|
|
|
|
|
|
|
|
id = fields.Integer()
|
|
|
|
created_at = fields.DateTime(load_default=datetime.utcnow)
|
|
|
|
game_id = fields.Str()
|
|
|
|
name = fields.Str()
|
|
|
|
description = fields.Str()
|
|
|
|
image_urls = fields.Function(
|
|
|
|
lambda x: json.loads(x.image_urls) if x.image_urls else [],
|
|
|
|
deserialize=lambda x: None if x.image_urls == [] else json.dumps(x.image_urls),
|
|
|
|
)
|
|
|
|
how_to_play = fields.Str()
|
|
|
|
num_tx_initial = fields.Integer()
|
|
|
|
price = fields.Number()
|
|
|
|
prizes = fields.Nested(PrizeSchema, many=True)
|
|
|
|
state = fields.Str()
|
|
|
|
updated_at = fields.DateTime()
|
|
|
|
url = fields.Str()
|
|
|
|
```
|
|
|
|
|
|
|
|
### Prize Schema
|
|
|
|
|
|
|
|
``` python
|
|
|
|
class PrizeSchema(Schema):
|
|
|
|
class Meta:
|
|
|
|
render_module = json
|
|
|
|
|
|
|
|
id = fields.Integer()
|
|
|
|
game_id = fields.Integer()
|
|
|
|
available = fields.Integer()
|
|
|
|
claimed = fields.Integer()
|
|
|
|
created_at = fields.DateTime(load_default=datetime.utcnow)
|
|
|
|
value = fields.Number()
|
|
|
|
prize = fields.Str()
|
|
|
|
```
|
|
|
|
|
|
|
|
# Tests
|
|
|
|
|
|
|
|
Testing is kind of tricky because you can't rely on _just_ python with its
|
|
|
|
`requests` library. Some states have some scrape protections that require you
|
|
|
|
actually run JavaScript. Some states have extreme scrape protection that require
|
|
|
|
you to actually run a _display_. They check for some rendering context that
|
|
|
|
doesn't exist when you run a headless browser in Selenium. To scrape those
|
|
|
|
sites, you actually have to run a [X virtual
|
|
|
|
framebuffer](https://en.wikipedia.org/wiki/Xvfb). Testing in these cases isn't
|
|
|
|
as simple as running `python3 -m unittest discover`.
|
|
|
|
|
|
|
|
# Contributing
|
|
|
|
|
|
|
|
``` sh
|
|
|
|
git clone https://github.com/owogawc/lottery_data_scraper
|
|
|
|
cd lottery_data_scraper
|
|
|
|
python3 -m venv ~/.virtualenvs/lottery_data_scraper
|
|
|
|
. ~/.virtualenvs/lottery_data_scraper
|
|
|
|
pip3 install -e .
|
|
|
|
```
|
|
|
|
|
|
|
|
Then you should be able to run `make test` and see the tests pass.
|