Danbooru

BigQuery dataset (queryable dump)

Posted under General

With some assistance from albert, I've set up a Google BigQuery data dump of the Danbooru posts tables so anyone who cares to do so can run queries. You can access the dump here. You may see Konachan and Yandere tables there too but those aren't complete so I don't recommend bothering with them.

It's updated nightly and should contain basically anything you can get through the API. If you really want to, you can download the whole thing as a dump using these instructions. Should save you from trying to scrape the API or something silly like that.

The query syntax is very similar to SQL, with a few differences due to fancier datatypes like repeated/nested fields.

An example query:

SELECT tags.name, COUNT(id) AS num, SUM(file_size)/1000000000 AS GB FROM [danbooru.posts] GROUP BY tags.name ORDER BY GB DESC LIMIT 20;

This returns the tags with the most file_size associated with them. For example this query will show you that 1girl has just over a terabyte of image data over 1.6 million images.

SELECT favs, COUNT(id) AS num, SUM(file_size)/1000000000 AS GB FROM [danbooru.posts] GROUP BY favs ORDER BY GB DESC LIMIT 20;

This groups by user favorites instead of tags. You can see that user 19831 has 414000 favourites, which are around 380GB.

Just found this last night and holy crap, this is so useful. @Allynay, don't know if you're still around, but thanks for making this.

Here are some quick overall stats on the site, total filesizes and average scores broken down by rating.

This is also useful for doing queries that timeout or don't fit into the 6 tag limit. Here's one to find unbanned artists, since banned_artist -status:banned times out:

evazion said:

Just found this last night and holy crap, this is so useful. @Allynay, don't know if you're still around, but thanks for making this.

Here are some quick overall stats on the site, total filesizes and average scores broken down by rating.

This is also useful for doing queries that timeout or don't fit into the 6 tag limit. Here's one to find unbanned artists, since banned_artist -status:banned times out:

No worries, glad someone found it useful. I'll be adding Yande.re and Gelbooru at some point as well. Konachan maybe.

How are you keeping this updated? Are you scraping the API or using the database replica (I think I saw albert say something about giving people access to that once?)?

Because if you have more tables, and it wouldn't be too much trouble to put them on BigQuery too, that'd be extremely helpful to me. It'd be nice to have the users table to resolve user IDs. And post_versions in particular would save me a lot of time in scraping it, if you happen to already have it.

evazion said:

How are you keeping this updated? Are you scraping the API or using the database replica (I think I saw albert say something about giving people access to that once?)?

Because if you have more tables, and it wouldn't be too much trouble to put them on BigQuery too, that'd be extremely helpful to me. It'd be nice to have the users table to resolve user IDs. And post_versions in particular would save me a lot of time in scraping it, if you happen to already have it.

I'm using the DB replica. Until very recently there was no easy way to keep an up to date copy of the DB (other boorus have a change_seq column you can order by, Danbooru only has updated). I'm working on the new way (PubSub) and if it seems to work without compromising integrity I'll probably stick to that.

I had a look and I don't have permissions for the post_versions or user tables (I do have tags and tag_aliases though). While there doesn't seem to be anything sensitive in the post_versions table and Albert might be okay with giving me access, the users table contains waaaay too much sensitive info. Postgres does have column level access though.

I might work on scraping it but it'll take a bit of work. What do you need the post_versions table and the ability to resolve users for? (if it's something interesting to me it's more likely I'll put in the effort :p)

The users table is no problem. I already have a scraper and most things, except the posts table and the various *_versions tables, can be scraped quickly. Main thing I need it for is matching uploader/approver IDs with usernames in the posts table.

As for post_versions, the big thing is for doing "Which posts did user X add tag Y to?" queries. That came up twice today in needing to mass revert certain edits in topic #13160 and in topic #13153.

The other thing I'm interested in is answering various tag usage questions. Namely:

For any given tag:

  • Who coined this tag?
  • How has this tag grown over time?
  • How many people have ever used this tag?
  • Who are the top taggers for this tag?
  • What posts has it been added to recently? What posts has it been removed from?

For any given user:

  • How many different posts have they tagged?
  • What are the top tags that they've added? That they've removed? And to which posts?

I had a crazy idea of writing a userscript that would query these things from BigQuery and add reports to user pages and to wiki pages. I think it would be doable, assuming the cost of these queries isn't prohibitive.

But I can scrape post_versions myself, don't worry about it. I was just hoping that you might have access in the replica already.

I'm not that familiar with SQL, so is there a way to search for positive and negative tags, or various tag combinations?

I originally tried...
SELECT id,tags FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'tank' AND tags.name <> 'girls_und_panzer'
...as a test case, but that returned zero results.

If I can search with tag combinations, one thing I'd like to do is to search for all posts that have the tank tag but aren't tagged with a specific tank name.

Try this:

I'm not sure if that's the best way of doing it, but it seems to work.

Here are some more interesting queries:

Some of this deserves to be looked at in more detail than I have time for now, but tl;dr:

  • Contributor uploads on average score a couple points higher than member uploads, no matter the rating.
  • One approver accounts for ~33% of approvals. The top two account for ~50%. The top ~20% of approvers (<10 people of 50) account for ~80% of approvals. (This was for one month only though).
  • Pixiv has 1.6 million uploads, Twitter is second at only ~80,000 (twitter.com + twimg.com).
  • ...but interestingly, nijie.info and hentai-foundry.com uploads have the highest average scores out of any source.

That seems to have done the trick... thanks for that! ^_^

Now I get the fun task of constructing a query subtracting every named tank on Danbooru ... :p

Here's a list of the top 50 uploaders ranked by highest total scores. Some things to note:

  • Score is more a measure of popularity than of quality, and porn is more popular than non-porn. So take scores with a grain of salt.
  • Upvotes and downvotes don't add up to total score. I think that's because favorites used to give a free point that wasn't counted as an upvote.
  • rating:e attracts more downvotes than rating:s, so if a person has a lot of downvotes it could be because they upload a lot of rating:e.
  • This is only the top 50, but there's Google spreadsheet below that lists all 14265 people who have ever uploaded anything. Which is not as many uploaders as I'd thought we'd have tbh.

evazion said:

Ayyy, some hard evidence that is proof of my dedication to quality. It's a shame I can't code for shit, I'd like to do some querying myself but I can't see the benefit of taking hours to learn something I'll never use again.

CodeKyuubi said:

Ayyy, some hard evidence that is proof of my dedication to quality. It's a shame I can't code for shit, I'd like to do some querying myself but I can't see the benefit of taking hours to learn something I'll never use again.

Well to be fair, nearly all above rank 15 are some veteran users, in terms of join date :p.

Sacriven said:

Well to be fair, nearly all above rank 15 are some veteran users, in terms of join date :p.

He's talking about his avg score ration and not total scores. Which is more of an accomplishment

CodeKyuubi said:

Ayyy, some hard evidence that is proof of my dedication to quality.

Score =/= Quality
Score = Popularity
And with that, users like Zettamorose, oecchi or bad mongo will never rank high there, since they mostly upload male_focus :3.

Comics are another thing that rarely get high scores. Just 2 of the 390 images with 100+ score are comic rating:safe, despite this combination making up over 1 in 20 posts on the site. An average of 50 purely from uploading hentai is probably equivalent to about an average of 5 for uploading non-hentai comics.

Rastamepas said:

He's talking about his avg score ration and not total scores. Which is more of an accomplishment

You missed the point.

The above data cannot be taken wholly at face value. Should someone that uploads a million posts with an average score of 1 be the top uploader? Should someone with only 40 posts with an average score of 50 be the top uploader? Should someone with the highest score count but also the lowest ratio of positive to negative score be the top uploader? I say no to all of the above.

Total positive score, total negative score and total post count must all be taken into consideration individually otherwise you start losing context. Some kind of weighting system where all of those values are normalized would be a bit more useful, but the weighting would have to be tailored to the interests of that particular user.

Provence said:

Score =/= Quality
Score = Popularity
And with that, users like Zettamorose, oecchi or bad mongo will never rank high there, since they mostly upload male_focus :3.

To be fair, less than 10% of my posts are Kancolle and Touhou, which are the biggest culprits of score = popularity != quality. About a third of my posts are original or have no copytag due to being unknown, and separately, about 30% of my posts are also scans.

You don't have to believe me, but I've always held myself to a high standard for uploading high-quality art, and not min-tagging images just to get an upload.

Yay! This is fun to play with. I decided to do some analysis of tags. There's been a lot of argument that pictures with certain tags get more points than others, but how true is it?

It looks like the answer is that tags do make a difference. comic doesn't show up in the top 100 at all, which probably means a very low average score. High-scoring tags on safe images include ass, small_breasts, bikini, collarbone, and panties. Low-scoring tags on safe images include monochrome, 1boy, school_uniform, glasses and weapon. Only two copyrights showed up: kantai_collection is unusually high-scoring, but touhou posts have average scores.