Blog

Managing user-generated content

2008-07-10 11:37:20 by Martynas Jusevičius

Back from some holiday, I was thinking about how to (automatically) manage user-generated content. On a website where users can add and change data, how does one make sure that it is entered correctly and that it will not be vandalized afterwards? Maybe I am paranoid, but I think if there is a possibility to ruin something online, it will be ruined.

Let us take a website where users can add events and their details as an example. What if an event is fake? Also, what about duplicates (different names for the same actual event)? Last.fm takes an interesting approach letting users vote for the right names of the artists. After some number and/or percent of positive votes, it would be possible to assume that event details are correct and lock it to prevent further editing.

But what if the event details actually need to be changed? Should only the creator be allowed to make changes, or everybody? In that case, how to make sure that a description carefully crafted by several users will not be wiped out by somebody? As far as I can see, it needs at least some version tracking to be able to roll back to the right version, as in Last.fm's description wiki and Wikipedia. But the implementation on a relational database does not seem trivial for me. Can anyone provide an example? I think it would be possible to build version tracking into existing ORMs such as Propel.

Digg Digg this! del.icio.us del.icio.us!

Comments (8)

Version Tracking

2008-07-10 13:22:07 by John Behrens

Well it´s an interesting question i never took a look on applications like that.
One possibility is having 2 tables one with all topics and and one with all edits
for the display you always show the newest edit of a topic which contains a timestamp, you can also add a flag to roll back or deactivate an edit to change it back.

More difficult would be if you have serveral edits
edit 1 is okay
edit 2 is NOT okay adds unacceptable content
edit 3 is okay and adds important info but has unacceptable content of edit 2

you don´t wanna loose edit 3 but have to rollback edit 2 this might cause a problem so what you need is a mechanism which is only saving the changes made or differ between edit 1 and edit 2 so changes of edit 2 are removed from edit 3.

2008-07-10 15:05:28 by Dave

I think the mechanics are pretty straightforward and using rest-type naming conventions can help you think about what entities actually should be tracked in what hierarchies. The real challenge is jsut deciding what algorithms you want to use for allowing edits, permissions etc.

To address John's comment the problem here is that you may be committing something whihc contradicts earlier informations. Location context is not the same as semantic context. In most source control systems you cant do this you can only rollback but not jump "bad edits" for that very reason. You need manual intervention to ensure that edit1 + edit3 stands as a valid whole.

history

2008-07-10 15:19:23 by Richard Harrison

A technique I've used in the past was to add a status field to the "content" table (eg "events"), indicating whether the record was live or historical, and use that same table to store the historical data.

Then create another table that records all the meta data regarding historical records:-

history_meta
----------
content_table // the table our content is in
content_id // the record this is historical data for, in the content table
historical_id // the id of the historical data, in the content table
date // date of edit
some_other_meta_data // etc etc

So if you want to select all historical records for "event id = 123":-

SELECT *
FROM history_meta
LEFT JOIN events
ON(events.id = history_meta.content_id AND history_meta.content_table = 'events' )
WHERE history_meta.content_id=123

(something like that anyway).

The major benefit to this method is you can record the history of arbitrary tables very easily, without having to change their structure or having to create lots of new tables / code.

It worked like an absolute charm in our CMS system. My gut says there might be problems at a massive scale (your events table grows at the rate of the number of edits!), but I'm sure they would be resolvable.

Crowdsource?

2008-07-10 17:15:53 by PaulG

I haven't come across anything that fits that bill yet. "Crowdsourcing" seems to be one label you could stick on what you want to do, and might turn up some algorithms or ideas.

I have just made a start at something similar and am going to take the view that each user carries around with them a certain level of "cred" (or karma I have seen it sometimes referred to). That cred level will depend on 1) how often they visit and the pattern of visits, 2) if they voted on something as being true (or false) which others agree with. That's been pretty easy to do.

How I imagine it working is that if the truth level of a Known Fact (KF) stands at say, 4, (so 4 qualified individuals have said that KF is is true) if someone else comes and says its false, then their cred level goes down by one**. If someone else turns up and agrees that the KF is true, they get +1 and all the other voters get +1.

[** should I keeping a record of that? This is where it could get expensive - but I rather fancy doing --$cred; and letting the logic of the crowdsource sort things out naturally ]

I can't write the sql for the query yet as I am not finished, but that's how I imagine such a thing working, maybe I will become unstuck.

I certainly feel there are such algos out there, mainly connected to the semantic web ppl I get a hunch, but they don't seem to be putting them in the public domain - or not in a form I can understand (PHP).

HTH

@Richard - that sounds like an audit trail.

2008-07-10 18:33:55 by Mark Kimsal

You have two choices:

1. Deal with versioning and recording edits.

2. Don't deal with versioning and let the crowd fix things.

If you want to record edits, SQL might not be the best way to go. You would be duplicating CVS or SVN into your app with a SQL backend. If it were easy, CVS or SVN would already have SQL backends, but they don't.

One option to keep your edits would be to simply save the current version in SQL, and queue and propagate each changed document into subversion; either by execing out, or building some direct link at the protocol level.

This would allow for diff analysis and recovery during an emergency. The only problem is that an operator would have to intervene in an emergency, you probably don't want the general public having rights to claim that someone else's changes are bad and that an old version should be pulled up from the repo.

I guess it depends on the type of site too. Something like wikipedia keeps a log of all edits, but its only job is to be a record keeping site. Tracking the edits is fundamental to the transparency required to act as an authority.

2008-07-12 12:27:07 by Pras Sarkar

As Mark Kimsal pointed out, your two choices are whether to use versioning or let a subset of users have enough authority to decide the correct version.

For option 1, depending on scalability and your cleanup process, you can use 1 or 2 tables (for vertical database scalability) to store revisions:

content
-------
cid int
rid int // revision id
content_data_html text
content_data_plain text

Your SQL to get the content by cid = 123 would be something like

SELECT * FROM content WHERE cid = 123 ORDER BY rid DESC LIMIT 1;

This way you're always getting the latest version unless you require an older one to revert to. This also allows you to provide diffs (like Wikipedia) as you're always storing the complete version and not incremental changes/updates.

For option 2, the algorithm would have to resemble a karma system where each user starts with an entropy level of 1 and deviates positively or negatively away from it. The advantage of this is that you can vectorize this value easily and find subsets of related users with high karma and only allows edits to go through if a certain threshold of these high-karma users have recommended an edit.

re

2008-07-18 17:58:22 by System

@Pras Sarkar: your idea is very good

Good & Interesting Questions

2008-12-08 01:47:58 by Cheryl Beckham

I enjoyed your post and am interested in reading all the comments that it will lead to.
Cheryl Beckham

New comment






No HTML allowed.