Danbooru

Tags vs score: Analysis

Posted under General

CodeKyuubi said:

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

Yes, to be precise, it places in the 92nd percentile of score compared to other posts with x tag(s)

Harmonic mean is just a different kind of "average" than the arithmetic mean, which is what is usually intended when when we say "average" (colloquially speaking). However, harmonic mean is more appropriate because we are "averaging" percentages, which are rates.

Harmonic mean has the benefit of rewarding consistency. Compare the following simplified examples:

95%, 60%
Arithmetic mean: 77.5%
Harmonic mean: 73.548%

80%, 75%
Arithmetic mean: 77.5%
Harmonic mean: 77.419%

If you use the arithmetic mean, the two examples are indistinguishable. Using the harmonic mean, the more consistent performer does better.

Type-kun said:
score_histogram.csv

Quickly plotted a few score histograms, they generally exhibit zipf's law behavior (albeit in both positive and negative directions)
...I'm not really surprised to see this, considering it's essentially a kind of rank data.

Basically, post score is the result of three processes -- each user considers:

  • do I like the post enough to fave it?
  • do I like the post enough to vote it up?
  • do I dislike the post enough to vote it down?

Type-kun said:
It's harmonic and arithmetic means for favcount and score percentiles, as well as post count, per peer group per approver, for posts between Jan 1, 2015 and Jul 7, 2015. Basically, it's that approver performance report we discussed in topic #11837, for last half-a-year.

This is already very interesting and usable data.

Additional processing:

  • remove all rows with post count less than 10 (arbitrary threshold)
  • tweak sort order so that within each approver, sort by:
    • everything (*)
    • rating
    • comic?
    • comic? + rating
    • copyright
    • copyright + rating
    • copyright + comic? @Type-kun, these are all missing for some reason?
    • copyright + comic? + rating
  • make pretty colors

https://www.dropbox.com/s/mt16f31e56inai4/approver_condensed.xls?dl=0

-

Some interesting examples:

-

[/thead][/table][/thead][/table]
Approver Peer group Count Score percentile harmonic mean Score percentile arithmetic mean Favcount percentile harmonic mean Favcount percentile arithmetic mean
11672 love_live!_school_idol_project 148 9 23 8 21
11672 love_live!_school_idol_project rating:s 143 11 24 8 21
11672 love_live!_school_idol_project comic rating:s 95 61 65 53 57
Approver Peer group Count Score percentile harmonic mean Score percentile arithmetic mean Favcount percentile harmonic mean Favcount percentile arithmetic mean
13392 k-on! 83 12 30 6 17
13392 k-on! rating:s 82 12 31 6 18
13392 k-on! comic rating:s 56 35 36 21 24

Updated

Derived from score_histogram.csv, here are postscore-to-percentile lookup tables for, all posts, posts grouped by rating, and posts grouped by copyright. See different sheets.

https://www.dropbox.com/s/5gaf1x0uopg6t1o/score_percentile_lookup_condensed.xls?dl=0
(Edit: stupid mistakes in earlier version have been fixed)

Main part of the data are:

  • peer group
  • cumulative count
  • cumulative percentage (percentile)
  • post score that the count/percentile corresponds to

I did a little bit of work to merge together scores that had too few posts to be significant.

  • %diff is the difference in percentile from one row to the next
  • these are color coded at thresholds of >=0.5% (red), >=1% (yellow) and >=5% (green)
  • due to the long tails on either side of the histogram, there were originally many rows where %diff<0.5%
  • moving outward from the center, whenever I encountered a row that had %diff<0.5%, I would merge it with the next one, etc until the threshold of 0.5% was reached (or until I run out of rows to merge with)
  • the region of consecutive post scores colored green are rows that were kept intact, the other rows were created by merging.

-

What this suggests

  • Again, we ought to be be cautious when drawing conclusions from percentiles. Particularly in the region of around 50% or less, a difference in score of just 1 will often be reflected as a percentile difference of 10~15%, or even more in some cases.
    • Visualization of percentile data should reflect these huge jumps where possible, to prevent people from jumping to incorrect conclusions
  • Rather than fixed width intervals (0~5%, 5~10%, 10~15%, etc) we should probably consider flexible brackets based on what the data shows
  • Some tags are rather useless for analysis because they don't have enough variance in score. e.g.:

Updated

r0d3n7z said:

colored histogram

I don't think I'm understanding the graphs, unless you meant less than or equal to (<=) rather than greater than or equal to (>=)?

CodeKyuubi said:
I don't think I'm understanding the graphs, unless you meant less than or equal to (<=) rather than greater than or equal to (>=)?

Yup I caught that immediately after posting; it's been fixed - try refreshing.

I added uploaders to my analysis and ran it over 60 days of data. I've removed the names for anonymity. Tags and rating are still included, so these are the expected score difference after normalizing for those. These may not be very informative if there are strong enough correlations between user and tags (for example, the two users who uploaded over 80% of the jojo_no_kimyou_na_bouken posts got terrible score factors).

lkjh098 said:
added uploaders to my analysis
these are the expected score difference after normalizing for tags and rating

So, it's about as good of a "content-agnostic, pure-artistic-quality-base" measure as we might be able to get with a fairly simple technical approach? As far as attempting to compensate for certain copyrights or tags that inflate/deflate post scores goes.

Knowing that the median post score across all posts on the site is 3, those score differences look pretty okay for the most part. +/-2 basically puts most uploads in the score:1..5 range which is pretty much within the "normal/acceptable" range, with -2 just barely maybe borderline.
[table][thead][tr][td]score[/td][td]spans these percentiles[/td][/tr][/thead]
[tr][td]1[/td][td]16.7 ~ 30.9[/td][/tr]
[tr][td]2[/td][td]30.9 ~ 43.8[/td][/tr]
[tr][td]3[/td][td]43.8 ~ 54.8[/td][/tr]
[tr][td]4[/td][td]54.8 ~ 63.7[/td][/tr]
[tr][td]5[/td][td]63.7 ~ 70.9[/td][/tr]
[/table]

Here's a more visual look at the score distributions by rating and by copyright, using percent-stack bars:

https://www.dropbox.com/s/jtmw9gdmjopz7ak/score_percent_stack.png?dl=0

It's color-coded by score. I left the high/low extremes in shades of gray.

-

Actually I take back what I said about some tags being useless for analysis because they don't have enough variance in score. They're still informative, insofar as it tells us that posts with those tags tend to have lower scores than in general, so an approver/uploader shouldn't be penalized for low scores if such tags were the cause.

Inasmuch as it'd be fantastic if everything was distributed like love_live!_school_idol_project -- huge range with nice, almost even distribution in score -- the fact is that most tags, and all posts in general, are distributed such that it is really hard to use score as a basis to distinguish between a) posts that are just mediocre or slightly subpar versus b) outright bad. score:0 usually (with notable exceptions) places around the 15th to 20th percentile -- so I don't think most folks would be comfortable concluding across the board that posts with score:0 are terribad and shouldn't deserve to be on the site.

Posts with negative scores are generally the bottom 1% to 5%, but they're too rare to be truly useful. I don't think users care too much about going out of their way to vote down posts that they don't like, and besides, you can just blacklist stuff that you don't want to see anyway.

So the proposed percentile-vs-peers approach probably isn't going to help us do a much better job of detecting low quality, compared to what we have at the moment.

What the distribution data is truly useful for, I think, is refining how we set the bar for good- to high-quality posts. Right now, the janitor trial report uses a score:3+ threshold, presumably because 3 is the median post score. It's asking, "what fraction of the approved posts are kinda-sorta better than at least half of all other posts". But it's clear that depending on rating/copyright, the threshold for "half of posts" could be at anything from score:0 (jojo) to score:6 (lovelive).

This is where percentile-vs-peers would really shine, because a percentile of over 50~60% after taking into account inflated/deflated post scores would be a lot more meaningful than a single flat cutoff across the board.

I have to rethink my idea for graphical visualization, though, because it would have displayed using fixed width intervals, but I'm now convinced that that would not be a good representation.

Okay, I finally got around to mocking up my ideas for visually displaying the percentile-vs-peers data. I'm sure there's a bunch of js/css improvements that could be done to improve ui usability, jazz it up further etc but this should demonstrate the most important things for now. Also, I just copied the same data three times because I'm lazy. This uses some of the preliminary data from an unspecified user, but I also made up some of the numbers in there (because they weren't in Type-kun's aggregated data) so please don't actually draw conclusions from this.

Here is the mockup/demo: http://jsfiddle.net/qcp8ht1p/embedded/result/

-

How to use/read/interpret

In initial view

  • list of users, summary statistics
  • number of approvals (for context)
  • "average" quality - harmonic mean of percentile-vs-* of all posts
  • the score that it corresponds to, and the percentile range of that score
  • summary view for by-rating / by-comic
    • narrow black bar is the overall average quality from before (62% in this case)
    • horizontal bars for each peer group:
    • pips (fixed @ 10px wide) positioned according to average quality within the peer group
      • color corresponds to value 0~100%
      • opacity corresponds to number of posts (e.g. number of rating:s as a fraction of *)
      • because opacity may be very low, the pips have solid left/right borders for visibility
    • mouseover horizontal bars will display detailed information in tooltip
  • summary view for by-copyright
    • twenty colored boxes corresponding to 5% intervals
    • opacity increases for number of posts falling into that interval, when evaluated against copyright peer groups
    • mouseover each box will display detailed information in tooltip
      • total number of posts in that interval and % as fraction of *
      • information about each copyright peer group contributing to that bucket
    • using fixed intervals is not ideal, but I couldn't think of anything else better for a fast summary

Clicking the "+" beside the username expands detail view

  • first item is a repeat of data for *, but now with a visual representation
    • the average quality is displayed as a pip along a horizontal bar; number is shown
    • percentile range is displayed as the colored section of the bar behind the pip (this will be more evident in later examples)
    • colors correspond to value
    • mouseover bar for details in tooltip
    • yes, I'll admit that this part can be incorporated into the user summary row instead of duplicating the same information, but this is just how it turned out as I hacked it together. (Also, easier to ensure vertical alignment of visuals this way...)
  • breakdown into further peer groups: expanded by clicking on the small headers
    • I'm not too happy that the small headers still take up so much space, but this will do to illustrate the point for now

Expanded peer group details

  • same idea; also, the bars line up vertically for ease of comparison
  • the number of posts of the user within the peer group is shown, along with the % as fraction of the parent peer group
  • breakdown is always:
    • rating: s/q/e
    • comic: comic, -comic; each broken down by rating
    • copyright: ordered by post count; each broken down by rating, comic (and comic/rating)
    • some of the data is missing in this example, but you get the idea re: how you can drill down for more detail

.
[edit]

This might be stating the obvious, but here's a note about evaluating an "average percentile" vs the "percentile range" that it falls into for that peer group.

  • suppose we have a hypothetical tag (foobar or whatever) that is composed entirely of:
    • 25% of posts have score:0 - 25th percentile
    • 50% of posts have score:1 - 75th percentile
    • 25% of posts have score:2 - 100th percentile
  • here's what would happen to someone's average percentile:
    • if they uploaded/approved only posts that got score:1, their average percentile would be exactly 75, and fall in the 25~75 range
    • if mostly score:1 and a bit of score:2, it would be in the lower part of the 75~100 range
    • if mostly score:1 and a bit of score:0, in the upper part of 25~75
    • an average percentile of 50 or less, falling in mid or lower part of 25~75, actually means there are more score:0 posts than score:1 posts from them!

[/edit]

-

Other things that could be done

  • information that is already on the existing janitor trial report can obviously be included
  • current report lists quartile and median scores, we could also have quartile and median percentile-vs-peers for each peer group
  • the horizontal bar visualization for each peer group can be turned into a vertical bar...
    • if you run the reports, say, monthly or fortnightly, you could show change over time in a peer group by displaying the vertical bars from left to right
      • remember: it's important to include the post count for context.

-

If you want to muck around with it
http://jsfiddle.net/qcp8ht1p/
(sorry for the horrible mess, I just sorta hacked it together)

Updated

  • 1
  • 2