News from pouet-mirror.sesse.net
category: general [glöplog]
The previous thread used for this was really about something else, so I'm starting a new one to share the occasional tidbit. Hopefully only good news. :-)
First, the JSON now has a “mirror” field for each URL; it lists, if known and relevant, a more well-known place (scene.org, amigascne.org, my static ftp.untergrund.net dump, etc.) where you can find the same file. The idea is that if it's lost on upstream, you don't have to pick out the file from my mirror, but can just link directly to the file. If there are multiple candidates, it has some heuristics to pick a “preferred” one (e.g., scene.org parties is preferred over compilations, scene.org is preferred above most other sites, etc.).
This is a content match, not a filename match; it can detect repacks (looks at hash of the bytes inside the archive, not the archive itself), filters out scene.org's ad files (both old and new format), and looks recursively inside archives. It supports most common and a few uncommon formats used across platforms. (If you're now firing up your unpacker exploits, don't bother; the unpacking runs in a sandbox :-) ) So if you're missing a .dms, don't be surprised if it tells you that the only available copy is an .adf inside a .zip inside an .iso inside a multi-part .rar.
There is now a browser extension (works with anything supporting MV3, e.g. Chrome or Firefox) that consumes status.json; it shows the archive status and a mirror link (if existing) for each indexed URL on Cardboard and Pouët. It's really unpolished, so I haven't bothered publishing it, but contact me if you want a copy. :-)
As an offshoot from this, I made a makeshift search engine. It simply does case-insensitive substring searches in all the filenames I have available (1M+ archives, 20M+ files after unpacking). It's already recovered a fair amount of prods that Pouët has been missing for a decade or more; perhaps good searchers can squeeze out yet more. Searches hit disk (rotating rust) and there's the occasional bug, but they should be pretty fast. If you want to do something more heavy-duty, the database it uses is downloadable (/files.db, in plocate format).
However… all of these can only find things where my machine can actually see the content. That means that I must either have crawled it, or I must have a mirror. I have a pretty good collection of those after a while (especially since I mirror scene.org's mirror section, too), but there are definitely things I'm missing; e.g., there are 200k+ CSDB releases that I cannot easily get my hands on, and post-untergrund.net, I don't think I will be getting Amigamega or Fujiology updates either. So if you want your stuff searchable; please open up rsync (or whatever) and let me pull a mirror from you. :-)
Sorry, that got long. Go rescue a demo about it.
First, the JSON now has a “mirror” field for each URL; it lists, if known and relevant, a more well-known place (scene.org, amigascne.org, my static ftp.untergrund.net dump, etc.) where you can find the same file. The idea is that if it's lost on upstream, you don't have to pick out the file from my mirror, but can just link directly to the file. If there are multiple candidates, it has some heuristics to pick a “preferred” one (e.g., scene.org parties is preferred over compilations, scene.org is preferred above most other sites, etc.).
This is a content match, not a filename match; it can detect repacks (looks at hash of the bytes inside the archive, not the archive itself), filters out scene.org's ad files (both old and new format), and looks recursively inside archives. It supports most common and a few uncommon formats used across platforms. (If you're now firing up your unpacker exploits, don't bother; the unpacking runs in a sandbox :-) ) So if you're missing a .dms, don't be surprised if it tells you that the only available copy is an .adf inside a .zip inside an .iso inside a multi-part .rar.
There is now a browser extension (works with anything supporting MV3, e.g. Chrome or Firefox) that consumes status.json; it shows the archive status and a mirror link (if existing) for each indexed URL on Cardboard and Pouët. It's really unpolished, so I haven't bothered publishing it, but contact me if you want a copy. :-)
As an offshoot from this, I made a makeshift search engine. It simply does case-insensitive substring searches in all the filenames I have available (1M+ archives, 20M+ files after unpacking). It's already recovered a fair amount of prods that Pouët has been missing for a decade or more; perhaps good searchers can squeeze out yet more. Searches hit disk (rotating rust) and there's the occasional bug, but they should be pretty fast. If you want to do something more heavy-duty, the database it uses is downloadable (/files.db, in plocate format).
However… all of these can only find things where my machine can actually see the content. That means that I must either have crawled it, or I must have a mirror. I have a pretty good collection of those after a while (especially since I mirror scene.org's mirror section, too), but there are definitely things I'm missing; e.g., there are 200k+ CSDB releases that I cannot easily get my hands on, and post-untergrund.net, I don't think I will be getting Amigamega or Fujiology updates either. So if you want your stuff searchable; please open up rsync (or whatever) and let me pull a mirror from you. :-)
Sorry, that got long. Go rescue a demo about it.
Oh, and another thing I forgot: I now crawl git archives. It's just a single one-off dump like everything else (so not a continuous mirror), but if you have a GitHub (non-single-file) or git:// URL, I will send git to download the repository and make a bundle file (which you can download and then clone from) instead of just the HTML front page. Will of course update with support for Codeberg etc. when/if Pouët allows that.
The crawler now stores HTTP headers and redirect chains internally (for newly-crawled only at this moment). This means that it can expose two new fields in status.json; “redirect”, which shows the final redirect target (if any), and “filename”, which shows the download filename (from e.g. Content-disposition), if different from just the last segment of the original URL.
At some point, perhaps I should start re-crawling things, but it has a lot of problems on its own. :-)
At some point, perhaps I should start re-crawling things, but it has a lot of problems on its own. :-)
Nice!
cool! i'll have a look at this after ascension day weekend
I am now re-crawling.
It's a bit rough around the edges, but it's a start. The aim is to check every link (sans scene.org and a bit) every day, but that is very much the happy case; due to per-domain cooldowns, slow servers, download counters, API quotas, 429 status codes, missing ETag headers and such, many can be checked less often than that and some are kept out entirely for now.
Generally, it tries to check, not re-download; the ideal case is that it sends off a HTTP request and gets a 304 Not Modified back, which should be kind to the origin servers and not require much bandwidth on either side. (However, since we don't have the HTTP headers for older downloads, there will be a fair bit of re-downloading to begin with.) Everything has two layers of randomized cooldowns to spread out the load over time as much as possible (with the exception of FTP, where a bunch of MDTM checks and such are clustered together so we don't keep reopening and closing FTP connections). There's a bunch of site-specific logic and timeouts because the scene is a diverse place; I'm hoping to be as unintrusive as possible in general.
status.json has changed format a bit to reflect that we may now have multiple different contents for the same URL (people tend to e.g. repack Zip files but keep the same URL). I don't think anything except my own browser extension actually consumed this data regularly. There may or may not be a web interface later, similar to cardboard's “broken links” feature (eventually, I hope to all but obsolete it, but the road there is kind of long).
There are some data quality issues in the new crawls (e.g., prods being replaced by HTML files that still serve 200 OK) that will have to be dealt with over time.
It's a bit rough around the edges, but it's a start. The aim is to check every link (sans scene.org and a bit) every day, but that is very much the happy case; due to per-domain cooldowns, slow servers, download counters, API quotas, 429 status codes, missing ETag headers and such, many can be checked less often than that and some are kept out entirely for now.
Generally, it tries to check, not re-download; the ideal case is that it sends off a HTTP request and gets a 304 Not Modified back, which should be kind to the origin servers and not require much bandwidth on either side. (However, since we don't have the HTTP headers for older downloads, there will be a fair bit of re-downloading to begin with.) Everything has two layers of randomized cooldowns to spread out the load over time as much as possible (with the exception of FTP, where a bunch of MDTM checks and such are clustered together so we don't keep reopening and closing FTP connections). There's a bunch of site-specific logic and timeouts because the scene is a diverse place; I'm hoping to be as unintrusive as possible in general.
status.json has changed format a bit to reflect that we may now have multiple different contents for the same URL (people tend to e.g. repack Zip files but keep the same URL). I don't think anything except my own browser extension actually consumed this data regularly. There may or may not be a web interface later, similar to cardboard's “broken links” feature (eventually, I hope to all but obsolete it, but the road there is kind of long).
There are some data quality issues in the new crawls (e.g., prods being replaced by HTML files that still serve 200 OK) that will have to be dealt with over time.
There is now a web interface: https://superglue.sesse.net/
There is _so_ _much_ I'd like to do before showing it to the world, but I figured I had to get it out eventually. Stuff might break at any time. Stuff might change at any time. There's a lot to be said, but I think the most important is: It tries to detect breakage quickly, and it tries to detect un-breakage even more quickly. (“Tries” being the operative word; it isn't perfect and it's also a bit hampered by not having an updated stream of Pouët changes.)
Feedback appreciated, although I can't guarantee I will do anything about any of your wishes. :-)
There is _so_ _much_ I'd like to do before showing it to the world, but I figured I had to get it out eventually. Stuff might break at any time. Stuff might change at any time. There's a lot to be said, but I think the most important is: It tries to detect breakage quickly, and it tries to detect un-breakage even more quickly. (“Tries” being the operative word; it isn't perfect and it's also a bit hampered by not having an updated stream of Pouët changes.)
Feedback appreciated, although I can't guarantee I will do anything about any of your wishes. :-)
Nice work, Sesse! Thank you! Looks cardboardy enough and pretty serviceable to my lamer eye.
(and by lamer, I mean untrained)
Thanks for these lists! Found quite a few Ephidrena releases with a broken link or two.
