Friday, May 27, 2005

Mining Social Networks from Email

I recently acquired a couple new toys--an IBM Thinkpad last month and a Canon Pixma multifunction printer/copier/fax/scanner just today. I go a while between upgrades so when the new stuff comes in it really blows me away. Today's revelation is optical character recognition, or OCR. How OCR works I have no idea but here's what it can do:

My regular readers may have already detected that I am a New Yorker magazine junkie. My friends can hardly fail to notice this, since I am always saying, "Yes, and that reminds me of an article I just read in the New Yorker," at which point I take over the conversation for a few minutes. In the olden times (before today) that was more than enough for my friends. But as of today it is just the beginning. Now I can go home to my personal NYer archives (dating from 9-11), grab the issue in question, put it through my scanner, and sit back while my computer receives the entire article in the form of a Word document (with columns, pages, and cartoons all properly configured) or a PDF (with text searching). I leave the rest of the story to your imagination, since this is a copyright-friendly blog.

If any of you just happen to be thinking about email right now, let me say--that reminds me of a great article I just read in the New York Times: "Enron Offers an Unlikely Boost to E-Mail Surveillance." I am a bit embarassed to be mentioning this article now. It was published very prominently on Sunday. But I have been so preoccupied with my new ThinkPad that real life is apparently passing me by. So thanks to Jim Murphy for clipping the article and handing it to me, in a quaint nod to life before scanners. Jim's gift prompted me to check Patti Anklam's blog and see her review of the article which she wrote the day after its publication.

The gist of the story is that a huge pile of Enron email is now publically available. The email provides a detailed look at communication from before the California energy crisis right up to the final bankruptcy scandal. This is an unprecendented resource for sociologists and computer scientists, who have proceeded to demonstrate not only the power of textual analysis (how often do people say "Dynergy" or "bankruptcy" week by week) but also the power of network analysis (who sends email to whom and when, regardless of the content).

The article features a beautiful network diagram:

Note the use of a hierarchical circular layout that places people in three categories: (1) periphery, (2) mid-level, and (3) core. That's a great way not to distract people with unnecessary detail.

The Enron analysis is being led by David Skillicorn, Kathleen Carley, and Michael Berry.

Want to try this at home? You can! Investigate your own email communication network by downloading Peter Gloor's TeCFlow.

3 comments:

Anonymous said...

The New York Times
May 22, 2005
Enron Offers an Unlikely Boost to E-Mail Surveillance
By GINA KOLATA

AS an object of modern surveillance, e-mail is both reassuring and troubling. It is a potential treasure trove for investigators monitoring suspected terrorists and other criminals, but it also creates the potential for abuse, by giving businesses and government agencies an efficient means of monitoring the attitudes and activities of employees and citizens.

Now the science of e-mail tracking and analysis has been given a unlikely boost by a bitter chapter in the history of corporate malfeasance - the Enron scandal.

In 2003, the Federal Energy Regulatory Commission posted the company's e-mail on its Web site, about 1.5 million messages. After duplicates were weeded out, a half-million e-mails were left from about 150 accounts, including those of the company's top executives. Most were sent from 1999 to 2001, a period when Enron executives were manipulating financial data, making false public statements, engaging in insider trading, and the company was coming under scrutiny by regulators.

Because of privacy concerns, large e-mail collections had not previously been made publicly available, so this marked the first time scientists had a sizable e-mail network to experiment with.

"While it's sad for the people at Enron that this happened, it's a gold mine for researchers," said Dr. David Skillicorn, a computer scientist at Queen's University in Canada.

Scientists had long theorized that tracking the e-mailing and word usage patterns within a group over time - without ever actually reading a single e-mail - could reveal a lot about what that group was up to. The Enron material gave Mr. Skillicorn's group and a handful of others a chance to test that theory, by seeing, first of all, if they could spot sudden changes.

For example, would they be able to find the moment when someone's memos, which were routinely read by a long list of people who never responded, suddenly began generating private responses from some recipients? Could they spot when a new person entered a communications chain, or if old ones were suddenly shut out, and correlate it with something significant?

There may be commercial uses for the same techniques. For example, they may enable advertisers to do word searches on individual e-mail accounts and direct pitches based on word frequency.

"Will you let your e-mail be mined so some car dealer can send information to you on car deals because you are talking to your friends about cars?" asks Dr. Michael Berry, a computer scientist at the University of Tennessee who has been analyzing the data.

Working with the Enron e-mail messages, about a half-dozen research groups can report that after just a few months of study they have already learned that they can glean telling information and are refining their ability to sort and analyze it.

Dr. Kathleen Carley, a professor of computer science at Carnegie Mellon University, has been trying to figure out who were the important people at Enron by the patterns of who e-mailed whom, and when and whether these people began changing their e-mail communications when the company was being investigated.

Companies have organizational charts, but they reveal little about how things really work, Dr. Carley said. Companies actually operate through informal networks, which can be revealed by analyzing "who spends time talking to whom, who are the power brokers, who are the hidden individuals who have to know what's going on," she said.

With the Enron data, Dr. Carley continued, "what you see is that prior to the investigation there is this surge in activity among the people at the top of the corporate ladder." But she adds, "as soon as the investigation starts, they stop communicating with each other and start communicating with lawyers." It showed, she says, "that they were becoming very nervous."

The analyses also found someone so junior she did not show up on organization charts but who, whichever way the e-mail data was mined, "shows up as a person of interest," Dr. Skillicorn said, in the language of intelligence analysts. In the investigation of a terror network, pinpointing such a person could be of enormous significance.

Dr. Berry said the e-mail traffic patterns tracked major events, like the manipulation of California energy prices. "We could see how things built up right before the bankruptcy," he said.

There were e-mail surges with each crisis, pointing to a problem that was consuming Enron employees. And in each crisis, there were features of certain e-mail messages - word choices, routing patterns - that allowed the computer scientists to isolate them from the morass of irrelevant personal or business messages.

One thing that didn't show up when the researchers screened for changes in word use was guardedness, said Dr. Skillicorn, a failure that was revealing in itself. Ordinarily, he said, when people are being deceptive they are more self-conscious, and their word use becomes simpler, as though they are trying too hard to sound natural.

But that apparently never occurred at Enron because its employees remained unconcerned while they engaged in illegal activity. "It wasn't a case of keeping a low profile," Dr. Skillicorn said. "They didn't worry about the story they were telling."

The scientists who are studying the Enron data said they assumed intelligence agencies are doing similar classified analyses on international e-mail traffic. Since World War II, a five-nation consortium of the United States, Canada, Britain, Australia and New Zealand have cooperated in a vast communications collection and analysis program called Echelon, for example, one that has assumed increasing importance since the terror attacks of Sept. 11, 2001.

No one in the unclassified world knows precisely what is being done with the Echelon data. But, Dr. Berry said, surveillance in the civilian world could one day have troubling consequences. It could allow companies, without ever actually infringing on e-mail conversations, to track employee attitudes and activities closely and easily.

"They can monitor discussions without actually isolating individuals," Dr. Berry said. "They can assess morale. If they make a cut in salaries, how long does the unhappiness go on? You could track topics and get a sense of how people are responding to policies and flag potential hot spots." Or, he said, managers might be able to learn which people have too much time on their hands.

And, as Dr. Skillicorn notes, if you try to write bland e-mail messages with hidden communications, chances are the programs will pick those out, too.

"It's clearly Orwellian," Dr. Berry said. "And I know that freaks people out."

Anonymous said...

this is a nice canon bp 511 blog

Anonymous said...

Great Blog, check out this business. This is the Goose that lays you Golden Eggs! base business home mortgage

Enjoy!