Danbooru

BigQuery dataset (queryable dump)

Posted under General

There's has_children, has_active_children, and has_visible_children on the one side, and parent_id on the other.

parent_id == null

...is equivalent to...

has_parent == false

... and vice versa...

parent_id != null

...is equivalent to...

has_parent == true

BrokenEagle98 said:

Strange, then why don't the table details show an option for has_parent?

Edit: I see, you edited it, nevermind.

Edit2: Everytime I use parent_id as a requirement, query returns 0 results :(. This is wasting too much time when I can do it manually on DB without querying it.

Updated

SELECT 
  id,parent_id
FROM 
  [danbooru-data:danbooru.posts]
WHERE 
  parent_id != 0
LIMIT
  1000

Apparently when it was imported, it set the parent_id from null to 0. See the above example which worked for me.

lkjh098 said:

It looks like the answer is that tags do make a difference. comic doesn't show up in the top 100 at all, which probably means a very low average score. High-scoring tags on safe images include ass, small_breasts, bikini, collarbone, and panties. Low-scoring tags on safe images include monochrome, 1boy, school_uniform, glasses and weapon. Only two copyrights showed up: kantai_collection is unusually high-scoring, but touhou posts have average scores.

Great analysis. The tags with the highest average scores are mostly just the ones indicating a post's level of sexiness. It's interesting that for certain tags - breasts, cleavage, underwear - average scores barely differ across ratings; rating:s cleavage is almost as popular as rating:e cleavage. So it goes to show that score is driven by sex appeal, even for rating:s.

It's also interesting to compare average scores across classes of tags. Among hair colors, silver hair is the most popular and green hair is the least. Among breast sizes, small breasts is Danbooru's favorite.

Let's look specifically at the top copyrights. Do Touhou and Kantai Collection really have higher scores than average?

Answer: Yes and no. They're the most uploaded and they do have higher averages than many copyrights. But original actually isn't far behind. idolmaster_cinderella_girls, fate_(series), and love_live! beat them in the rating:e category. overwatch, re:zero_kara_hajimeru_isekai_seikatsu and kono_subarashii_sekai_ni_shukufuku_wo! beat them across the board. Although these things do have far fewer uploads, so perhaps their scores are less watered down in comparison to the tens of thousands of touhou / kantai collection posts.

And perhaps unsurprisingly: Male focus copyrights like jojo_no_kimyou_na_bouken and touken_ranbu are incredibly unpopular here in terms of scores, despite having respectable numbers of uploads.

Updated

evazion said:

Yeah, part of the problem with comparing kancolle and touhou tags to others is gonna also be that, due to their popularity, very mediocre images which might otherwise fall through may be approved on the strength of the fandom, and in much larger quantities, as well as the large number of comics attributed to those two copyrights, which naturally fall low on the score table.

Here's a table showing how the top 100 copyrights have grown over the years:

Basically: touhou peaked at 71000-72000 uploads per year in 2010-2012, until kantai_collection overtook it in 2014.

Updated

Alright, so comics: let's quantify exactly how much they affect scores.

The above table shows what percentage of a copyright was tagged comic, and what the average score is for those comic posts. tl;dr: kantai_collection was 18% comics this year versus 10.7% for touhou, and both had average scores of around 3. That does bring down their rating:s scores. But girls_und_panzer and love_live! were 11%-12% comics too, so touhou doesn't actually have an unusually high number of comics. Just kantai collection does.

What if we exclude comic posts and then compare scores?

When comics are excluded then kantai_collection and touhou scores do rise, but there are still other copyrights that are even more popular in terms of scores.

Just popping in to say how awesome and informative this is, you guys.

Not to mention how useful it can be: revealing trends, approval biases, user habits, user activity graphs when considering promotions, etc. So many things.

At times like this I wish I knew more than the most basic knowledge required to appreciate a good spreadsheet..

For the curious a flattened post version table has finished syncing now and is available for query. You can contact me for access (I don't want to make it public because it has ip addresses).

I posted this in topic #13112 but I guess I should post it here too. This is a dump of nearly everything publicly available from the API:

This should cover everything except bans and mod actions (not available in the API due to a bug), pool versions (skipped it out of laziness), and posts (OP already has that covered). I haven't automated this though so everything's about a week or two out of date at this point.

Crossposting forum #122158 here for reference:

Missing cosplay tags (*_(cosplay) -cosplay)

BigQuery: https://bigquery.cloud.google.com/savedquery/657582419813:dbddc4af3565484c9773d076744061be
Spreadsheet: https://docs.google.com/spreadsheets/d/1UXnVGTQ6DUkMh5elzWaEf7luan3MxYXLGIHVb-EzJWw/edit?usp=sharing.

SELECT
  CONCAT("post #", STRING(id)),
  CONCAT("http://danbooru.me/posts/", STRING(id))
FROM [danbooru-data:danbooru.posts]
WHERE
  id     IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE REGEXP_MATCH(tags.name, r'.*_\(cosplay\)')) AND
  id NOT IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'cosplay') AND
  TRUE

@Allynay, the posts table is still updating every 24 hours, but it seems to be getting only a random subset of the posts each time. Today it has 1,419,809 rows, about half what it should, and over the last few days it's ranged from 400,000 to 1,600,000. Do you know what's going on? (Thank you for all your work, regardless.)

norainu said:

@Allynay, the posts table is still updating every 24 hours, but it seems to be getting only a random subset of the posts each time. Today it has 1,419,809 rows, about half what it should, and over the last few days it's ranged from 400,000 to 1,600,000. Do you know what's going on? (Thank you for all your work, regardless.)

Thanks for letting me know, I'll have a look tonight and see what's going on. I suspect the script I'm running is crapping out early for some reason.

Okay, figured it out. For some reason there's a post that had a tag that was in the posts table but not in the tags table. I assumed that would never happened so the script was written to shit itself if it did, since it needed the tags table for the tag ID and category.

If a post is found with a tag that was missing from the tags table, it will now have a tag with ID 0, its original name and category -42.

For reference, the bad posts and tags were: 2351218, 1858377 and 2540508 with anila_(granblue_fantasy)), idunn_(p&d) and mika respectively.

The table should be kept up to date now. Sorry it took so long to respond.

@Allynay is this resource still available? When I try to access the dump, I get a message

Unable to find table: danbooru-data:danbooru.posts

Thanks.

kevo said:

@Allynay is this resource still available? When I try to access the dump, I get a message

Thanks.

Yep, it's still there. Can you share the query you're trying to run? Maybe your query is the newer (not Legacy) SQL? If you're using the new SQL you need to use "`danbooru-data.danbooru.posts`" instead.

I'm new to BigQuery and had the same problem as kevo (red banner at the top of the page saying "Unable to find table: danbooru-data:danbooru.posts").

I resolved it by going to the Google APIs dashboard (https://console.developers.google.com/apis/dashboard) (you may need to create a dummy project to be able to access it) then clicking "Enable APIs and services" and enabling the BigQuery API.

I hope this helps anyone else who may have trouble :)