You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
102 lines
3.2 KiB
Markdown
102 lines
3.2 KiB
Markdown
# Parsing of lottery websites
|
|
|
|
## Demo
|
|
|
|
The following script should put you in a state where the last line will make a
|
|
bunch of requests to the Pennsylvania lottery website, parse the tables of
|
|
games/prizes, and print to your terminal a JSON structure of all of the games.
|
|
|
|
``` sh
|
|
git clone https://github.com/owogawc/lottery_data_scraper
|
|
cd lottery_data_scraper
|
|
python3 -m venv ~/.virtualenvs/lottery_data_scraper
|
|
. ~/.virtualenvs/lottery_data_scraper
|
|
pip3 install -e .
|
|
|
|
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania
|
|
```
|
|
|
|
If you have [jq](https://stedolan.github.io/jq/) installed, you can get some
|
|
formatted output by piping it to `jq` (and redirecting STDERR to /dev/null).
|
|
|
|
``` sh
|
|
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 2> /dev/null | jq
|
|
```
|
|
|
|
## Data models
|
|
|
|
We're using [`marshmallow`](https://marshmallow.readthedocs.io/en/stable/index.html) to validate and serialize data.
|
|
|
|
I'm including the schemas here just so you can quickly get a general idea of
|
|
what data fields we're able to scrape from most lottery websites. What you see
|
|
in this README might not be up-to-date with what's in
|
|
[schemas.py](./lottery_data_scraper/schemas.py).
|
|
|
|
As of 2023-04-07 the schemas are a work-in-progress. The remaining TODO is to
|
|
determine and specify which fields are absolutely required and which are
|
|
optional.
|
|
|
|
### Game Schema
|
|
|
|
``` python
|
|
class GameSchema(Schema):
|
|
class Meta:
|
|
render_module = json
|
|
|
|
id = fields.Integer()
|
|
created_at = fields.DateTime(load_default=datetime.utcnow)
|
|
game_id = fields.Str()
|
|
name = fields.Str()
|
|
description = fields.Str()
|
|
image_urls = fields.Function(
|
|
lambda x: json.loads(x.image_urls) if x.image_urls else [],
|
|
deserialize=lambda x: None if x.image_urls == [] else json.dumps(x.image_urls),
|
|
)
|
|
how_to_play = fields.Str()
|
|
num_tx_initial = fields.Integer()
|
|
price = fields.Number()
|
|
prizes = fields.Nested(PrizeSchema, many=True)
|
|
state = fields.Str()
|
|
updated_at = fields.DateTime()
|
|
url = fields.Str()
|
|
```
|
|
|
|
### Prize Schema
|
|
|
|
``` python
|
|
class PrizeSchema(Schema):
|
|
class Meta:
|
|
render_module = json
|
|
|
|
id = fields.Integer()
|
|
game_id = fields.Integer()
|
|
available = fields.Integer()
|
|
claimed = fields.Integer()
|
|
created_at = fields.DateTime(load_default=datetime.utcnow)
|
|
value = fields.Number()
|
|
prize = fields.Str()
|
|
```
|
|
|
|
# Tests
|
|
|
|
Testing is kind of tricky because you can't rely on _just_ python with its
|
|
`requests` library. Some states have some scrape protections that require you
|
|
actually run JavaScript. Some states have extreme scrape protection that require
|
|
you to actually run a _display_. They check for some rendering context that
|
|
doesn't exist when you run a headless browser in Selenium. To scrape those
|
|
sites, you actually have to run a [X virtual
|
|
framebuffer](https://en.wikipedia.org/wiki/Xvfb). Testing in these cases isn't
|
|
as simple as running `python3 -m unittest discover`.
|
|
|
|
# Contributing
|
|
|
|
``` sh
|
|
git clone https://github.com/owogawc/lottery_data_scraper
|
|
cd lottery_data_scraper
|
|
python3 -m venv ~/.virtualenvs/lottery_data_scraper
|
|
. ~/.virtualenvs/lottery_data_scraper
|
|
pip3 install -e .
|
|
```
|
|
|
|
Then you should be able to run `make test` and see the tests pass.
|