Danbooru

Tags vs score: Analysis

Posted under General

Inspired by discussion in the approval changes thread, I wondered if some tags really are worth more points than others. Fortunately that can be answered with math! I took all posts from the last 30 days, found the tags that appeared on at least 100 of them, and ran a linear regression against score (capped at 10). That gave me values for each tag which are roughly the number of points of score that you would expect a post to gain from having that tag.

It turns out that posts with some tags really do get much higher average scores than others...

Some selected results:

Copyrights
dungeon_ni_deai_wo_motomeru_no_wa_machigatteiru_darou_ka 2.547 but see hestia, below
mahou_shoujo_madoka_magica 2.519
love_live!_school_idol_project 2.118
touhou 1.645
league_of_legends 1.515
hibike!_euphonium 1.259
kantai_collection 0.774
original 0.451
street_fighter -1.068
jojo_no_kimyou_na_bouken -1.552

Characters
tatara_kogasa 1.543
cirno 1.338
tenryuu_(kantai_collection) 1.331
mutsu_(kantai_collection) 1.148
flandre_scarlet 1.144
murakumo_(kantai_collection) 1.115
shibuya_rin 1.037 most popular character not from touhou or kancolle
izayoi_sakuya -1.064
tifa_lockhart -1.122
akashi_(kantai_collection) -1.500
northern_ocean_hime -1.601
hestia_(danmachi) -2.441

General tags
motor_vehicle 1.660 but see vehicle, below
fellatio 1.398
mittens 1.362 what
yuri 1.243
animated 1.209
girl_on_top 1.124
hair_bun 1.101
jpeg_artifacts 0.901 maybe only high-scoring posts get tagged with it?
nude 0.884
bondage 0.535
piercing -0.860
vehicle -0.970
futanari -1.014
nose -1.028 I don't even
couple -1.041
bdsm -1.049
genderswap -1.200
tentacles -1.272
sharp_teeth -1.565
comic -1.705

Artists
matasabuyarou -2.305 the only artist worth more than a point in either direction

Baseline points by rating
rating:q 2.709
rating:e 2.254
rating:s 2.185

Conclusion: Tatara Kogasa wearing mittens on top of Cirno in a motor vehicle would score a lot of points.

Updated

Thanks for crunching numbers.

lkjh098 said:
Artists
matasabuyarou -2.305 the only artist worth more than a point in either direction

This was kind of unexpected. I'd have expected popular artists to have somewhat inflated scores.
Maybe this is an artifact of your 100posts/30days requirement (thereby omitting less popular artists anyway) or capping the score at 10?

e.g.: hammer_(sunset_beach) date:2015-06-13..2015-07-14 order:score <- over half of this has score >= 10.

Just make sure to remember the warning that correlation is not causation, and that there can be confounding variables.

NWF_Renim said:

Just make sure to remember the warning that correlation is not causation, and that there can be confounding variables.

Absolutely. You can see that pretty clearly from the score jpeg_artifacts got. Linear regression without regularization also isn't very good with strongly correlated variales, which you can see in the scores for hestia and danmachi.

lkjh098 said:

Absolutely. You can see that pretty clearly from the score jpeg_artifacts got. Linear regression without regularization also isn't very good with strongly correlated variales, which you can see in the scores for hestia and danmachi.

Generally, jpeg artifact only gets thrown on an image after the parent image gets uploaded and the first image gets parented. I honestly don't use the tag unless the artifacts are really severe.

r0d3n7z said:

This was kind of unexpected. I'd have expected popular artists to have somewhat inflated scores.
Maybe this is an artifact of your 100posts/30days requirement (thereby omitting less popular artists anyway) or capping the score at 10?

It looks like matasabuyarou was the only artist that hit the 100 post requirement in the sample.

lkjh098 said:
It looks like matasabuyarou was the only artist that hit the 100 post requirement in the sample.

Hmm, yeah, and only because one user has been on a bit of a spree uploading his older works.
100 posts in last 30 days isn't a realistic cutoff for artists, now that I paused to think about it. Without older works, even a prolific artist, at one upload a day (e.g. mizuki_hitoshi), would take three months to get to 100 posts. If you're re-running this for artists, I think 100 posts is still a reasonable cutoff for statistical purposes, but maybe set duration as over the past year?
Or you could drop the time requirement altogether and just do all the posts of the artists with the highest post counts -- there are only about 240 artist tags with over 500 posts anyway (remember to exclude banned artist)

I can bump it up to 60 days, but not much higher. At 90 days my math software runs out of memory and crashes.

Welp, I forgot that it's not just crunching the artist tags, but you'd have to do the analysis vs. the general population of posts in order to derive the expected delta score gain/loss due to having a particular tag. So yeah, forget what I said if it's not practicable.

lkjh098 said:

Inspired by discussion in the approval changes thread, I wondered if some tags really are worth more points than others. Fortunately that can be answered with math! I took all posts from the last 30 days, found the tags that appeared on at least 100 of them, and ran a linear regression against score (capped at 10). That gave me values for each tag which are roughly the number of points of score that you would expect a post to gain from having that tag.

It turns out that posts with some tags really do get much higher average scores than others...

Some selected results:

Copyrights
dungeon_ni_deai_wo_motomeru_no_wa_machigatteiru_darou_ka 2.547 but see hestia, below
mahou_shoujo_madoka_magica 2.519
love_live!_school_idol_project 2.118
touhou 1.645
league_of_legends 1.515
hibike!_euphonium 1.259
kantai_collection 0.774
original 0.451
street_fighter -1.068
jojo_no_kimyou_na_bouken -1.552

Characters
tatara_kogasa 1.543
cirno 1.338
tenryuu_(kantai_collection) 1.331
mutsu_(kantai_collection) 1.148
flandre_scarlet 1.144
murakumo_(kantai_collection) 1.115
shibuya_rin 1.037 most popular character not from touhou or kancolle
izayoi_sakuya -1.064
tifa_lockhart -1.122
akashi_(kantai_collection) -1.500
northern_ocean_hime -1.601
hestia_(danmachi) -2.441

General tags
motor_vehicle 1.660 but see vehicle, below
fellatio 1.398
mittens 1.362 what
yuri 1.243
animated 1.209
girl_on_top 1.124
hair_bun 1.101
jpeg_artifacts 0.901 maybe only high-scoring posts get tagged with it?
nude 0.884
bondage 0.535
piercing -0.860
vehicle -0.970
futanari -1.014
nose -1.028 I don't even
couple -1.041
bdsm -1.049
genderswap -1.200
tentacles -1.272
sharp_teeth -1.565
comic -1.705

Artists
matasabuyarou -2.305 the only artist worth more than a point in either direction

Baseline points by rating
rating:q 2.709
rating:e 2.254
rating:s 2.185

Conclusion: Tatara Kogasa wearing mittens on top of Cirno in a motor vehicle would score a lot of points.

This would only work if the tags are independent, right? If two or more tags (among the 100 within 30 days criteria) is present in the same image, and the presence of one tag is also codependent on another related tag, wouldn't we need to run covariance tests like MANCOVA? Or at least some type of t-test between co-ocurring tags?

Moving here from topic #11837

Now that DB dump is available, here's more data: http://puu.sh/j0fCc.zip

For all data, only posts from July 07, 2015 or older are counted. Deleted posts ARE included.

peer_groups.csv shows total post count per peer group.

score_histogram.csv and favcount_histogram.csv show distribution of score and favcounts per peer group. Only peer groups with 500+ post count are included. No graphical data, it weighs a lot, easy to create in excel/openoffice/pretty much any analytical or math software. Now, if someone could tell me how to calculate percentiles and harmonic means correctly, I'll be able to get "percentile per peer group per approver for certain dates" report running on my machine, to see the results.

I have thought about this for awhile now too.

Some copyrights or even obscure tag descriptions just seem to have a stronger association to higher quality art.

Ok, math failed me on this one. I've resorted to most common meanings and calculated "percentile" as "percent of images that have lower score than this one". However, since all histograms are skewed at zero values, and for some peer groups zero is the lowest score, it's highly possible for a post to legitimately be in 0th percentile? there are no posts with scores below zero, 0/total = 0.

It's no big deal, but then harmonic mean kicks in. If I go with wiki, it's N/(1/x1+1/x2+...+1/xN). The thing is, it's meant for positive numbers only. I learned about this the hard way - after half-an-hour of processing I got "mean percentile" of 150 zeros and three 85s to be over 4000.

So, which one do I fix, and how exactly? :3 I can treat 0s as 1s for harmonic mean calculation, but that's not exactly correct.

Updated

Looking further, I'll try "Nearest rank" method for percentile calculation. Will also ceil non-integer percentiles instead of floor'ing them. This way, percentile distribution will be 0 < P <= 100, integer, which is good for harmonic mean. I will also compute arithmetic mean at the same time, just to check out how well is that working.

Also, regarding peer groups, shouldn't we also have groups with negated tags? Specifically, comic tag, no other general tags are counted in currently. There would be groups like rating:s -comic then.

Ok, here's some interesting data: http://puu.sh/j1ugz.csv

It's harmonic and arithmetic means for favcount and score percentiles, as well as post count, per peer group per approver, for posts between Jan 1, 2015 and Jul 7, 2015. Basically, it's that approver performance report we discussed in topic #11837, for last half-a-year.

Similar data is currently calculated for non-contributor and contributor users. Somebody should make pretty diagrams out of it :3

Updated

http://puu.sh/j1wGx.zip - here's the same data per uploader instead of approver, one file ("members") for posts approved by someone, one ("contributors") for auto-approved posts.

Now that I think about it, deleted posts from regular users will also appear in contributors.csv... but those should be easy to filter out based on user ID, anyway.

E: though, no. 2/3 of contributors.csv is not, in fact, contributors. I will redo that one.

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

CodeKyuubi said:

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

Yeah. It seems to mean that average post of x tag by y user has score greater or equal than 92% of other posts with x tag.

Type-kun said:

Ok, math failed me on this one. I've resorted to most common meanings and calculated "percentile" as "percent of images that have lower score than this one". However, since all histograms are skewed at zero values, and for some peer groups zero is the lowest score, it's highly possible for a post to legitimately be in 0th percentile? there are no posts with scores below zero, 0/total = 0.

It's no big deal, but then harmonic mean kicks in. If I go with wiki, it's N/(1/x1+1/x2+...+1/xN). The thing is, it's meant for positive numbers only. I learned about this the hard way - after half-an-hour of processing I got "mean percentile" of 150 zeros and three 85s to be over 4000.

So, which one do I fix, and how exactly? :3 I can treat 0s as 1s for harmonic mean calculation, but that's not exactly correct.

I'd suggest treating "percentile" as "percent of images that have equal or lower score than this one". Basically, you give the post scores the benefit of the doubt by saying "this post is at least as good as x% of others" rather than having to be "strictly better than". This way, even the lowest scoring post in a peer group will have a non-zero percentile (even that one lone post with score -116 in * would be at a miniscule, but non-zero, 1/2029669 = 4.92691173e-7 = 0.0000492691173th percentile)

-

Type-kun said:
Will also ceil non-integer percentiles instead of floor'ing them. This way, percentile distribution will be 0 < P <= 100, integer, which is good for harmonic mean. I will also compute arithmetic mean at the same time, just to check out how well is that working.

Hang on... why would integers be any better for harmonic mean? It's all numbers anyway.

Percentages are rates, i.e. fractions. e.g. 86% should be treated as 0.86. Now, numerically it doesn't make a difference to the result whether you plug 86 or 0.86 into the harmonic mean computation, but when you use ceil, that's consistently rounding up nearly all of your numbers -- you're not likely to get perfect integers anyway.

... actually, never mind, I paused to think about it, and this probably just means the final output is at worst going to be inflated by up to 1%, which is tolerable, I think. We're not doing mission-critical science here. :p

-

Type-kun said:
Also, regarding peer groups, shouldn't we also have groups with negated tags? Specifically, comic tag, no other general tags are counted in currently. There would be groups like rating:s -comic then.

Sure, in cases where it makes sense to do it. (e.g. you probably wouldn't negate a copyright like -touhou)

-

Type-kun said:
DATA DATA DATA

cool, thanks!

also thanks to albert for providing the data dump!

  • 1
  • 2