A recent story from Wired, "Science Puts Enron E-Mail to Use," talks about the Enron e-mail data dump that the Federal Energy Regulatory Commission made public. While some have been browsing through the emails for off-color jokes, recipes, directions and mentions of politicians, others are finding some value in the process of analyzing the data. According to Wired, scientists, students and at least two businesses are all working with the unstructured info, trying to find ways to mine this kind of dataset (called the Enron corpus) for usable info.
In 2004, professor Marti Hearst at the University of California at Berkeley School of Information Management & Systems tasked students in her natural-language-processing course with cleaning up the database to make it searchable.
"It is a way for students to see -- when they run text-classification algorithms on e-mail messages versus newsgroups -- how well those would do," Hearst said. "E-mail is one of the more difficult kinds of information to process."
While Hearst says the jury is still out on the usefulness of the Enron corpus for researchers, she argues that these kinds of shared corpuses are key to advancing computer science research rapidly, as they allow different algorithms to be compared.
What the article doesn't say is that as technology becomes more adept at teasing meaningful information out of unstructured data, the opportunity to create better privacy policies also increases.
Would you agree?
View Comments
There are currently no comments to display.
Post a Comment
To post a comment, you must be a registered user of FCW.com and be logged in. Use one of the forms below to login or register for FREE to FCW.com. To protect your privacy, you can use an alias as your username.