Taylor Hood ebe3e6e174 | 2 years ago | |
---|---|---|
lottery_data_scraper | 2 years ago | |
tests | 2 years ago | |
.gitignore | 2 years ago | |
CHANGELOG.md | 2 years ago | |
Makefile | 2 years ago | |
README.md | 2 years ago | |
TODO.md | 2 years ago | |
setup.py | 2 years ago |
README.md
Parsing of lottery websites
Demo
The following script should put you in a state where the last line will make a bunch of requests to the Pennsylvania lottery website, parse the tables of games/prizes, and print to your terminal a JSON structure of all of the games.
git clone https://github.com/owogawc/lottery_data_scraper
cd lottery_data_scraper
python3 -m venv ~/.virtualenvs/lottery_data_scraper
. ~/.virtualenvs/lottery_data_scraper
pip3 install -e .
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania
If you have jq installed, you can get some
formatted output by piping it to jq
(and redirecting STDERR to /dev/null).
PY_LOG_LVL=DEBUG USE_CACHE=true python3 -m lottery_data_scraper.pennsylvania 2> /dev/null | jq
Data models
We're using marshmallow
to validate and serialize data.
I'm including the schemas here just so you can quickly get a general idea of what data fields we're able to scrape from most lottery websites. What you see in this README might not be up-to-date with what's in schemas.py.
As of 2023-04-07 the schemas are a work-in-progress. The remaining TODO is to determine and specify which fields are absolutely required and which are optional.
Game Schema
class GameSchema(Schema):
class Meta:
render_module = json
id = fields.Integer()
created_at = fields.DateTime(load_default=datetime.utcnow)
game_id = fields.Str()
name = fields.Str()
description = fields.Str()
image_urls = fields.Function(
lambda x: json.loads(x.image_urls) if x.image_urls else [],
deserialize=lambda x: None if x.image_urls == [] else json.dumps(x.image_urls),
)
how_to_play = fields.Str()
num_tx_initial = fields.Integer()
price = fields.Number()
prizes = fields.Nested(PrizeSchema, many=True)
state = fields.Str()
updated_at = fields.DateTime()
url = fields.Str()
Prize Schema
class PrizeSchema(Schema):
class Meta:
render_module = json
id = fields.Integer()
game_id = fields.Integer()
available = fields.Integer()
claimed = fields.Integer()
created_at = fields.DateTime(load_default=datetime.utcnow)
value = fields.Number()
prize = fields.Str()
Tests
Testing is kind of tricky because you can't rely on just python with its
requests
library. Some states have some scrape protections that require you
actually run JavaScript. Some states have extreme scrape protection that require
you to actually run a display. They check for some rendering context that
doesn't exist when you run a headless browser in Selenium. To scrape those
sites, you actually have to run a X virtual
framebuffer. Testing in these cases isn't
as simple as running python3 -m unittest discover
.
Contributing
git clone https://github.com/owogawc/lottery_data_scraper
cd lottery_data_scraper
python3 -m venv ~/.virtualenvs/lottery_data_scraper
. ~/.virtualenvs/lottery_data_scraper
pip3 install -e .
Then you should be able to run make test
and see the tests pass.