Go to bottom

Introducing the Pouët.net data dumps

category: general [glöplog]
Probably should open a thread that's harder to miss and has the chances for responses:

https://data.pouet.net/ - weekly API-formatted data dumps of the important parts of the prod db so that datamining projects don't have to bludgeon the site as it is live.

Additional tables (comments etc.) coming later.
added on the 2019-07-26 01:13:43 by Gargaj Gargaj
added on the 2019-07-26 01:41:13 by psenough psenough
data is sexy
added on the 2019-07-26 08:30:51 by nagz nagz
As requested :-) https://pastebin.com/E6nuHkat shows the missing parts of prods: cdc, credits, platforms, downloadLinks. (Well, platforms is replaced by an empty array, but I guess that counts as missing. Ignore the difference in rank, of course.)

In particular, once downloadLinks shows up, I can stop polling the database every night for new URLs added to old prods, and just use the weekly dumps instead.

It would also be really nice with a way of finding the latest JSON dumps without having to parse HTML :-)
added on the 2019-07-26 10:32:22 by Sesse Sesse
> makes feature to avoid people hitting the API
> gets request for an API for the feature to avoid people hitting the API
added on the 2019-07-26 10:38:31 by Gargaj Gargaj
BB Image
added on the 2019-07-26 12:58:05 by Emod Emod
Data dumps

BB Image

[insert smutty toilet jokes]
added on the 2019-07-26 13:05:26 by d0DgE d0DgE
(We do make backups too.)
added on the 2019-07-26 13:05:39 by Gargaj Gargaj
Any privacy datas? :P
added on the 2019-07-26 18:41:27 by AntDude AntDude
The good thing about a machine-readable way to get the latest file is that it can point to a static file that's refreshed along with the HTML, zero API calls needed :-P
added on the 2019-07-26 20:02:08 by Sesse Sesse
because it is like superdifficult to assess the current date vs downloaded filename date when the cron job apparently runs weekly anyway!
I'm talking of the "prods" zipfile:

-numbers (glops, voteups, etc.) should be serialized as integers, so it's easy to import them
-data needs some cleanup maybe? there is prod 97 where releaseDate is 1994-00-15

added on the 2019-07-27 10:37:32 by friol friol
I never said "superdifficult"; but parsing nonstructured data is brittle. But since it is apparently no problem, I welcome your API point that gives me the latest dump. Let me know when you have an URL. 🤷
added on the 2019-07-27 10:55:31 by Sesse Sesse
@friol: I believe all dates are taken to be the 15th (Pouët doesn't support full release dates, just months and years), and 00 is used to specify unknown month (the front end allows it, and there are 17k+ prods in the database with zero month).

Eventually MySQL will stop these kinds of (invalid) zero dates—it's already a non-default SQL mode in newer versions. I believe the only real recourse for Pouët is to store month and year as separate columns instead of using a date column with fake day; that would allow month to be null.
added on the 2019-07-27 11:18:27 by Sesse Sesse
Amended the data with the downloadLinks/credits/etc, also https://data.pouet.net/json.php is now available.
added on the 2019-07-27 15:05:53 by Gargaj Gargaj
how about symlinking the latest version as *-latest.json.gz?

Or do it with some rewrite rule magic, whatever :)

Saves some jq calls in case you want to download the dumps in a shell script and don't want to guess the date in the filename.
added on the 2019-07-27 15:19:55 by mihi mihi
Nah it's good.
added on the 2019-07-27 15:42:48 by Gargaj Gargaj
Brilliant! Everything adjusted and in order, so now I can stop the job that refreshes old prods gradually every night.
added on the 2019-07-27 16:06:48 by Sesse Sesse


Go to top