Tags vs score: Analysis

@r0d3n7z: Okay, I finally got around to mocking up my...

2015-07-20T16:22:59-04:00

Okay, I finally got around to mocking up my ideas for visually displaying the percentile-vs-peers data. I'm sure there's a bunch of js/css improvements that could be done to improve ui usability, jazz it up further etc but this should demonstrate the most important things for now. Also, I just copied the same data three times because I'm lazy. This uses some of the preliminary data from an unspecified user, but I also made up some of the numbers in there (because they weren't in Type-kun's aggregated data) so please don't actually draw conclusions from this.

Here is the mockup/demo: http://jsfiddle.net/qcp8ht1p/embedded/result/

How to use/read/interpret

In initial view

list of users, summary statistics
number of approvals (for context)
"average" quality - harmonic mean of percentile-vs-* of all posts
the score that it corresponds to, and the percentile range of that score

summary view for by-rating / by-comic

narrow black bar is the overall average quality from before (62% in this case)
horizontal bars for each peer group:

rating:s, rating:q, rating:e
comic, -comic

pips (fixed @ 10px wide) positioned according to average quality within the peer group

color corresponds to value 0~100%
opacity corresponds to number of posts (e.g. number of rating:s as a fraction of *)
because opacity may be very low, the pips have solid left/right borders for visibility

mouseover horizontal bars will display detailed information in tooltip

summary view for by-copyright

twenty colored boxes corresponding to 5% intervals
opacity increases for number of posts falling into that interval, when evaluated against copyright peer groups
mouseover each box will display detailed information in tooltip

total number of posts in that interval and % as fraction of *
information about each copyright peer group contributing to that bucket

using fixed intervals is not ideal, but I couldn't think of anything else better for a fast summary

Clicking the "+" beside the username expands detail view

first item is a repeat of data for *, but now with a visual representation

the average quality is displayed as a pip along a horizontal bar; number is shown
percentile range is displayed as the colored section of the bar behind the pip (this will be more evident in later examples)
colors correspond to value
mouseover bar for details in tooltip
yes, I'll admit that this part can be incorporated into the user summary row instead of duplicating the same information, but this is just how it turned out as I hacked it together. (Also, easier to ensure vertical alignment of visuals this way...)

breakdown into further peer groups: expanded by clicking on the small headers

I'm not too happy that the small headers still take up so much space, but this will do to illustrate the point for now

Expanded peer group details

same idea; also, the bars line up vertically for ease of comparison
the number of posts of the user within the peer group is shown, along with the % as fraction of the parent peer group
breakdown is always:

rating: s/q/e
comic: comic, -comic; each broken down by rating
copyright: ordered by post count; each broken down by rating, comic (and comic/rating)
some of the data is missing in this example, but you get the idea re: how you can drill down for more detail

.
[edit]

This might be stating the obvious, but here's a note about evaluating an "average percentile" vs the "percentile range" that it falls into for that peer group.

suppose we have a hypothetical tag (foobar or whatever) that is composed entirely of:

25% of posts have score:0 - 25th percentile
50% of posts have score:1 - 75th percentile
25% of posts have score:2 - 100th percentile

here's what would happen to someone's average percentile:

if they uploaded/approved only posts that got score:1, their average percentile would be exactly 75, and fall in the 25~75 range
if mostly score:1 and a bit of score:2, it would be in the lower part of the 75~100 range
if mostly score:1 and a bit of score:0, in the upper part of 25~75
an average percentile of 50 or less, falling in mid or lower part of 25~75, actually means there are more score:0 posts than score:1 posts from them!

[/edit]

Other things that could be done

information that is already on the existing janitor trial report can obviously be included
current report lists quartile and median scores, we could also have quartile and median percentile-vs-peers for each peer group
the horizontal bar visualization for each peer group can be turned into a vertical bar...

if you run the reports, say, monthly or fortnightly, you could show change over time in a peer group by displaying the vertical bars from left to right

remember: it's important to include the post count for context.

If you want to muck around with it
http://jsfiddle.net/qcp8ht1p/
(sorry for the horrible mess, I just sorta hacked it together)

@r0d3n7z: Here's a more visual look at the score...

2015-07-18T02:41:41-04:00

Here's a more visual look at the score distributions by rating and by copyright, using percent-stack bars:

https://www.dropbox.com/s/jtmw9gdmjopz7ak/score_percent_stack.png?dl=0

It's color-coded by score. I left the high/low extremes in shades of gray.

Actually I take back what I said about some tags being useless for analysis because they don't have enough variance in score. They're still informative, insofar as it tells us that posts with those tags tend to have lower scores than in general, so an approver/uploader shouldn't be penalized for low scores if such tags were the cause.

Inasmuch as it'd be fantastic if everything was distributed like love_live!_school_idol_project -- huge range with nice, almost even distribution in score -- the fact is that most tags, and all posts in general, are distributed such that it is really hard to use score as a basis to distinguish between a) posts that are just mediocre or slightly subpar versus b) outright bad. score:0 usually (with notable exceptions) places around the 15th to 20th percentile -- so I don't think most folks would be comfortable concluding across the board that posts with score:0 are terribad and shouldn't deserve to be on the site.

Posts with negative scores are generally the bottom 1% to 5%, but they're too rare to be truly useful. I don't think users care too much about going out of their way to vote down posts that they don't like, and besides, you can just blacklist stuff that you don't want to see anyway.

So the proposed percentile-vs-peers approach probably isn't going to help us do a much better job of detecting low quality, compared to what we have at the moment.

What the distribution data is truly useful for, I think, is refining how we set the bar for good- to high-quality posts. Right now, the janitor trial report uses a score:3+ threshold, presumably because 3 is the median post score. It's asking, "what fraction of the approved posts are kinda-sorta better than at least half of all other posts". But it's clear that depending on rating/copyright, the threshold for "half of posts" could be at anything from score:0 (jojo) to score:6 (lovelive).

This is where percentile-vs-peers would really shine, because a percentile of over 50~60% after taking into account inflated/deflated post scores would be a lot more meaningful than a single flat cutoff across the board.

I have to rethink my idea for graphical visualization, though, because it would have displayed using fixed width intervals, but I'm now convinced that that would not be a good representation.

@r0d3n7z: > lkjh098 said: > added uploaders to my...

2015-07-18T01:17:48-04:00

lkjh098 said:
added uploaders to my analysis
these are the expected score difference after normalizing for tags and rating

So, it's about as good of a "content-agnostic, pure-artistic-quality-base" measure as we might be able to get with a fairly simple technical approach? As far as attempting to compensate for certain copyrights or tags that inflate/deflate post scores goes.

Knowing that the median post score across all posts on the site is 3, those score differences look pretty okay for the most part. +/-2 basically puts most uploads in the score:1..5 range which is pretty much within the "normal/acceptable" range, with -2 just barely maybe borderline.
[table][thead][tr][td]score[/td][td]spans these percentiles[/td][/tr][/thead]
[tr][td]1[/td][td]16.7 ~ 30.9[/td][/tr]
[tr][td]2[/td][td]30.9 ~ 43.8[/td][/tr]
[tr][td]3[/td][td]43.8 ~ 54.8[/td][/tr]
[tr][td]4[/td][td]54.8 ~ 63.7[/td][/tr]
[tr][td]5[/td][td]63.7 ~ 70.9[/td][/tr]
[/table]

@lkjh098: I added uploaders to my analysis and ran it...

2015-07-18T00:09:26-04:00

I added uploaders to my analysis and ran it over 60 days of data. I've removed the names for anonymity. Tags and rating are still included, so these are the expected score difference after normalizing for those. These may not be very informative if there are strong enough correlations between user and tags (for example, the two users who uploaded over 80% of the jojo_no_kimyou_na_bouken posts got terrible score factors).

Level	Score factor
Builder	0.651
Builder	0.381
Builder	0.349
Builder	0.272
Builder	-0.119
Builder	-0.124
Builder	-0.558
Builder	-0.629
Builder	-0.828
Builder	-0.959
Builder	-1.635
Contributor	1.915
Contributor	1.868
Contributor	1.856
Contributor	1.554
Contributor	1.47
Contributor	1.464
Contributor	1.346
Contributor	1.322
Contributor	1.305
Contributor	1.207
Contributor	1.191
Contributor	1.149
Contributor	1.005
Contributor	0.969
Contributor	0.929
Contributor	0.819
Contributor	0.769
Contributor	0.752
Contributor	0.675
Contributor	0.635
Contributor	0.604
Contributor	0.518
Contributor	0.487
Contributor	0.366
Contributor	0.362
Contributor	0.211
Contributor	0.085
Contributor	0.054
Contributor	0.051
Contributor	-0.019
Contributor	-0.074
Contributor	-0.08
Contributor	-0.282
Contributor	-0.291
Contributor	-0.328
Contributor	-0.35
Contributor	-0.391
Contributor	-0.612
Contributor	-0.644
Contributor	-0.801
Contributor	-0.882
Contributor	-0.902
Contributor	-0.959
Contributor	-0.96
Contributor	-1.074
Gold	0.236
Gold	-0.065
Gold	-0.792
Gold	-0.945
Gold	-1.014
Gold	-1.116
Gold	-2.399
Janitor	1.921
Janitor	1.631
Janitor	0.73
Member	0.694
Member	0.339
Member	-0.423
Member	-0.668
Member	-0.674
Member	-0.721
Member	-0.725
Member	-0.789
Member	-0.823
Member	-1.087
Member	-1.405
Member	-1.552
Member	-1.599
Moderator	2.039
Moderator	1.177
Moderator	1.156
Moderator	0.882
Platinum	-0.532

@r0d3n7z: > CodeKyuubi said: > I don't think I'm...

2015-07-17T00:26:56-04:00

CodeKyuubi said:
I don't think I'm understanding the graphs, unless you meant less than or equal to (<=) rather than greater than or equal to (>=)?

Yup I caught that immediately after posting; it's been fixed - try refreshing.

@CodeKyuubi: > r0d3n7z said: > > colored histogramI don't...

2015-07-17T00:13:20-04:00

r0d3n7z said:

colored histogram

I don't think I'm understanding the graphs, unless you meant less than or equal to (<=) rather than greater than or equal to (>=)?

@r0d3n7z: Derived from score_histogram.csv, here are...

2015-07-17T00:01:09-04:00

Derived from score_histogram.csv, here are postscore-to-percentile lookup tables for, all posts, posts grouped by rating, and posts grouped by copyright. See different sheets.

https://www.dropbox.com/s/5gaf1x0uopg6t1o/score_percentile_lookup_condensed.xls?dl=0
(Edit: stupid mistakes in earlier version have been fixed)

Main part of the data are:

peer group
cumulative count
cumulative percentage (percentile)
post score that the count/percentile corresponds to

I did a little bit of work to merge together scores that had too few posts to be significant.

%diff is the difference in percentile from one row to the next
these are color coded at thresholds of >=0.5% (red), >=1% (yellow) and >=5% (green)
due to the long tails on either side of the histogram, there were originally many rows where %diff<0.5%
moving outward from the center, whenever I encountered a row that had %diff<0.5%, I would merge it with the next one, etc until the threshold of 0.5% was reached (or until I run out of rows to merge with)
the region of consecutive post scores colored green are rows that were kept intact, the other rows were created by merging.

What this suggests

Again, we ought to be be cautious when drawing conclusions from percentiles. Particularly in the region of around 50% or less, a difference in score of just 1 will often be reflected as a percentile difference of 10~15%, or even more in some cases.

Visualization of percentile data should reflect these huge jumps where possible, to prevent people from jumping to incorrect conclusions

Rather than fixed width intervals (0~5%, 5~10%, 10~15%, etc) we should probably consider flexible brackets based on what the data shows
Some tags are rather useless for analysis because they don't have enough variance in score. e.g.:

fate_(series) has almost a third of its posts in fate_(series) score:0 alone
fate/zero has 45% of its posts in fate/zero score:0
jojo_no_kimyou_na_bouken is completely hopeless, 76% of its posts are jojo_no_kimyou_na_bouken score:0

@r0d3n7z: > Type-kun said: > It's harmonic and arithmetic...

2015-07-16T20:17:24-04:00

Type-kun said:
It's harmonic and arithmetic means for favcount and score percentiles, as well as post count, per peer group per approver, for posts between Jan 1, 2015 and Jul 7, 2015. Basically, it's that approver performance report we discussed in topic #11837, for last half-a-year.

This is already very interesting and usable data.

Additional processing:

remove all rows with post count less than 10 (arbitrary threshold)
tweak sort order so that within each approver, sort by:

everything (*)
rating
comic?
comic? + rating
copyright
copyright + rating
copyright + comic? ← @Type-kun, these are all missing for some reason?
copyright + comic? + rating

make pretty colors

https://www.dropbox.com/s/mt16f31e56inai4/approver_condensed.xls?dl=0

Some interesting examples:

[/thead][/table][/thead][/table]

Approver	Peer group	Count	Score percentile harmonic mean	Score percentile arithmetic mean	Favcount percentile harmonic mean	Favcount percentile arithmetic mean
11672	love_live!_school_idol_project	148	9	23	8	21
11672	love_live!_school_idol_project rating:s	143	11	24	8	21
11672	love_live!_school_idol_project comic rating:s	95	61	65	53	57
Approver	Peer group	Count	Score percentile harmonic mean	Score percentile arithmetic mean	Favcount percentile harmonic mean	Favcount percentile arithmetic mean
13392	k-on!	83	12	30	6	17
13392	k-on! rating:s	82	12	31	6	18
13392	k-on! comic rating:s	56	35	36	21	24

@r0d3n7z: > Type-kun said: > score_histogram.csvQuickly...

2015-07-16T18:05:16-04:00

Type-kun said:
score_histogram.csv

Quickly plotted a few score histograms, they generally exhibit zipf's law behavior (albeit in both positive and negative directions)
...I'm not really surprised to see this, considering it's essentially a kind of rank data.

Basically, post score is the result of three processes -- each user considers:

do I like the post enough to fave it?
do I like the post enough to vote it up?
do I dislike the post enough to vote it down?

@r0d3n7z: > CodeKyuubi said: > > Not entirely sure what...

2015-07-16T17:22:12-04:00

CodeKyuubi said:

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

Yes, to be precise, it places in the 92nd percentile of score compared to other posts with x tag(s)

Harmonic mean is just a different kind of "average" than the arithmetic mean, which is what is usually intended when when we say "average" (colloquially speaking). However, harmonic mean is more appropriate because we are "averaging" percentages, which are rates.

Harmonic mean has the benefit of rewarding consistency. Compare the following simplified examples:

95%, 60%
Arithmetic mean: 77.5%
Harmonic mean: 73.548%

80%, 75%
Arithmetic mean: 77.5%
Harmonic mean: 77.419%

If you use the arithmetic mean, the two examples are indistinguishable. Using the harmonic mean, the more consistent performer does better.

@r0d3n7z: > Type-kun said: > > Ok, math failed me on...

2015-07-16T17:10:17-04:00

Type-kun said:

Ok, math failed me on this one. I've resorted to most common meanings and calculated "percentile" as "percent of images that have lower score than this one". However, since all histograms are skewed at zero values, and for some peer groups zero is the lowest score, it's highly possible for a post to legitimately be in 0th percentile? there are no posts with scores below zero, 0/total = 0.

It's no big deal, but then harmonic mean kicks in. If I go with wiki, it's N/(1/x1+1/x2+...+1/xN). The thing is, it's meant for positive numbers only. I learned about this the hard way - after half-an-hour of processing I got "mean percentile" of 150 zeros and three 85s to be over 4000.

So, which one do I fix, and how exactly? :3 I can treat 0s as 1s for harmonic mean calculation, but that's not exactly correct.

I'd suggest treating "percentile" as "percent of images that have equal or lower score than this one". Basically, you give the post scores the benefit of the doubt by saying "this post is at least as good as x% of others" rather than having to be "strictly better than". This way, even the lowest scoring post in a peer group will have a non-zero percentile (even that one lone post with score -116 in * would be at a miniscule, but non-zero, 1/2029669 = 4.92691173e-7 = 0.0000492691173th percentile)

Type-kun said:
Will also ceil non-integer percentiles instead of floor'ing them. This way, percentile distribution will be 0 < P <= 100, integer, which is good for harmonic mean. I will also compute arithmetic mean at the same time, just to check out how well is that working.

Hang on... why would integers be any better for harmonic mean? It's all numbers anyway.

Percentages are rates, i.e. fractions. e.g. 86% should be treated as 0.86. Now, numerically it doesn't make a difference to the result whether you plug 86 or 0.86 into the harmonic mean computation, but when you use ceil, that's consistently rounding up nearly all of your numbers -- you're not likely to get perfect integers anyway.

... actually, never mind, I paused to think about it, and this probably just means the final output is at worst going to be inflated by up to 1%, which is tolerable, I think. We're not doing mission-critical science here. :p

Type-kun said:
Also, regarding peer groups, shouldn't we also have groups with negated tags? Specifically, comic tag, no other general tags are counted in currently. There would be groups like rating:s -comic then.

Sure, in cases where it makes sense to do it. (e.g. you probably wouldn't negate a copyright like -touhou)

Type-kun said:
DATA DATA DATA

cool, thanks!

also thanks to albert for providing the data dump!

@Type-kun: > CodeKyuubi said: > > Not entirely sure what...

2015-07-16T15:27:55-04:00

CodeKyuubi said:

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

Yeah. It seems to mean that average post of x tag by y user has score greater or equal than 92% of other posts with x tag.

@CodeKyuubi: Not entirely sure what the numbers mean. Does a...

2015-07-16T15:10:29-04:00

Not entirely sure what the numbers mean. Does a harmonic mean for score percentile with a result, of say, 92, mean that the average post of x tag by y user places in the 92nd percentile of score?

@Type-kun: http://puu.sh/j1yno.zip - fixed data for...

2015-07-16T13:50:48-04:00

http://puu.sh/j1yno.zip - fixed data for contributors.

Done for now. If there's something else you want to gauge, write here.

@Type-kun: http://puu.sh/j1wGx.zip - here's the same data...

2015-07-16T13:25:17-04:00

http://puu.sh/j1wGx.zip - here's the same data per uploader instead of approver, one file ("members") for posts approved by someone, one ("contributors") for auto-approved posts.

Now that I think about it, deleted posts from regular users will also appear in contributors.csv... but those should be easy to filter out based on user ID, anyway.

E: though, no. 2/3 of contributors.csv is not, in fact, contributors. I will redo that one.

@Type-kun: Ok, here's some interesting data:...

2015-07-16T12:43:30-04:00

Ok, here's some interesting data: http://puu.sh/j1ugz.csv

It's harmonic and arithmetic means for favcount and score percentiles, as well as post count, per peer group per approver, for posts between Jan 1, 2015 and Jul 7, 2015. Basically, it's that approver performance report we discussed in topic #11837, for last half-a-year.

Similar data is currently calculated for non-contributor and contributor users. Somebody should make pretty diagrams out of it :3

@Type-kun: Looking further, I'll try "Nearest rank" method...

2015-07-16T03:56:44-04:00

Looking further, I'll try "Nearest rank" method for percentile calculation. Will also ceil non-integer percentiles instead of floor'ing them. This way, percentile distribution will be 0 < P <= 100, integer, which is good for harmonic mean. I will also compute arithmetic mean at the same time, just to check out how well is that working.

Also, regarding peer groups, shouldn't we also have groups with negated tags? Specifically, comic tag, no other general tags are counted in currently. There would be groups like rating:s -comic then.

@Type-kun: Ok, math failed me on this one. I've resorted...

2015-07-15T16:43:04-04:00

Ok, math failed me on this one. I've resorted to most common meanings and calculated "percentile" as "percent of images that have lower score than this one". However, since all histograms are skewed at zero values, and for some peer groups zero is the lowest score, it's highly possible for a post to legitimately be in 0th percentile? there are no posts with scores below zero, 0/total = 0.

It's no big deal, but then harmonic mean kicks in. If I go with wiki, it's N/(1/x1+1/x2+...+1/xN). The thing is, it's meant for positive numbers only. I learned about this the hard way - after half-an-hour of processing I got "mean percentile" of 150 zeros and three 85s to be over 4000.

So, which one do I fix, and how exactly? :3 I can treat 0s as 1s for harmonic mean calculation, but that's not exactly correct.

@Bibs: I have thought about this for awhile now too. ...

2015-07-15T15:28:04-04:00

I have thought about this for awhile now too.

Some copyrights or even obscure tag descriptions just seem to have a stronger association to higher quality art.

@Type-kun: Moving here from topic #11837 Now that DB dump...

2015-07-15T13:05:38-04:00

Moving here from topic #11837

Now that DB dump is available, here's more data: http://puu.sh/j0fCc.zip

For all data, only posts from July 07, 2015 or older are counted. Deleted posts ARE included.

peer_groups.csv shows total post count per peer group.

score_histogram.csv and favcount_histogram.csv show distribution of score and favcounts per peer group. Only peer groups with 500+ post count are included. No graphical data, it weighs a lot, easy to create in excel/openoffice/pretty much any analytical or math software. Now, if someone could tell me how to calculate percentiles and harmonic means correctly, I'll be able to get "percentile per peer group per approver for certain dates" report running on my machine, to see the results.