Introducing the Pouët.net data dumps

category: general [glöplog]

Probably should open a thread that's harder to miss and has the chances for responses:

https://data.pouet.net/ - weekly API-formatted data dumps of the important parts of the prod db so that datamining projects don't have to bludgeon the site as it is live.

Additional tables (comments etc.) coming later.

added on the 2019-07-26 01:13:43 by Gargaj

nice!

added on the 2019-07-26 01:41:13 by psenough

data is sexy

added on the 2019-07-26 08:30:51 by nagz

As requested :-) https://pastebin.com/E6nuHkat shows the missing parts of prods: cdc, credits, platforms, downloadLinks. (Well, platforms is replaced by an empty array, but I guess that counts as missing. Ignore the difference in rank, of course.)

In particular, once downloadLinks shows up, I can stop polling the database every night for new URLs added to old prods, and just use the weekly dumps instead.

It would also be really nice with a way of finding the latest JSON dumps without having to parse HTML :-)

added on the 2019-07-26 10:32:22 by Sesse

> makes feature to avoid people hitting the API
> gets request for an API for the feature to avoid people hitting the API

added on the 2019-07-26 10:38:31 by Gargaj

added on the 2019-07-26 12:58:05 by Emod

Quote:

Data dumps

[insert smutty toilet jokes]

added on the 2019-07-26 13:05:26 by d0DgE

(We do make backups too.)

added on the 2019-07-26 13:05:39 by Gargaj

Any privacy datas? :P

added on the 2019-07-26 18:41:27 by AntDude

The good thing about a machine-readable way to get the latest file is that it can point to a static file that's refreshed along with the HTML, zero API calls needed :-P

added on the 2019-07-26 20:02:08 by Sesse

because it is like superdifficult to assess the current date vs downloaded filename date when the cron job apparently runs weekly anyway!

added on the 2019-07-27 02:28:00 by maalinstrippari

I'm talking of the "prods" zipfile:

-numbers (glops, voteups, etc.) should be serialized as integers, so it's easy to import them
-data needs some cleanup maybe? there is prod 97 where releaseDate is 1994-00-15

Thanks

added on the 2019-07-27 10:37:32 by friol

I never said "superdifficult"; but parsing nonstructured data is brittle. But since it is apparently no problem, I welcome your API point that gives me the latest dump. Let me know when you have an URL. 🤷

added on the 2019-07-27 10:55:31 by Sesse

@friol: I believe all dates are taken to be the 15th (Pouët doesn't support full release dates, just months and years), and 00 is used to specify unknown month (the front end allows it, and there are 17k+ prods in the database with zero month).

Eventually MySQL will stop these kinds of (invalid) zero dates—it's already a non-default SQL mode in newer versions. I believe the only real recourse for Pouët is to store month and year as separate columns instead of using a date column with fake day; that would allow month to be null.

added on the 2019-07-27 11:18:27 by Sesse

Amended the data with the downloadLinks/credits/etc, also https://data.pouet.net/json.php is now available.

added on the 2019-07-27 15:05:53 by Gargaj

how about symlinking the latest version as *-latest.json.gz?

Or do it with some rewrite rule magic, whatever :)

Saves some jq calls in case you want to download the dumps in a shell script and don't want to guess the date in the filename.

added on the 2019-07-27 15:19:55 by mihi

Nah it's good.

added on the 2019-07-27 15:42:48 by Gargaj

Brilliant! Everything adjusted and in order, so now I can stop the job that refreshes old prods gradually every night.

added on the 2019-07-27 16:06:48 by Sesse

pouët.net

Introducing the Pouët.net data dumps

login