World's largest events database could predict conflict
* 13 May 2013 by Douglas Heaven
A database of over 200 million global events could help understand
and forecast how conflicts will play out
[Editorial "Predictive power won't take away the big decisions" added.]
WHEN will the civil war in Syria subside? Will there be fighting on
the Korean peninsula? The answers defy the best human minds, but
there may be a tool to help them: the world's largest database of
geopolitical events has been released. And as it is refined, it
could make forecasting events - and conflicts, in particular - as
common as predicting the weather.
The Global Data on Events, Location and Tone data set (GDELT)
contains nearly a quarter of a billion events going back to 1979 and
hoovers up 100,000 new events every day. Its software scours media
sources, such as the Associated Press, Agence France Presse and
Xinhua, the main news agency in China. Collectively, the sources
monitored cover every country in the world.
The software automatically extracts information from these news
reports then uses natural-language processing to turn them into data
points. For example, if a report contains the line "Sudanese
students and police fought in the Egyptian capital" it codes the
event as "SUDEDU fought COP". Next, the system finds the nearest
mention of a city or locality in the text - in this case Cairo - and
adds its latitude and longitude to the event data. The system can
recognise different phrasings of who did what to whom and where.
This helps it avoid duplicating events when they are mentioned in
several news reports.
"The size and scope make this data set unique," says Kalev Leetaru
at the University of Illinois at Urbana-Champaign. "Nobody had ever
constructed a global event database over a long time frame." Leetaru
and co-developer Paul Schrodt at Pennsylvania State University have
plans to extend the data set back to 1800.
Jay Yonamine, who worked on an analysis of GDELT data at Penn State,
calls it a "breakthrough data set". As part of his doctoral thesis,
Yonamine used a version of a machine learning algorithm more
commonly used for financial projecting to forecast the Afghanistan
conflict, which has raged since 2001.
Yonamine fed data on the conflict up until 2008 into the algorithm.
It works by applying a statistical model to a series of data points
over a period of time and extrapolating that pattern. He found that
it accurately tracked the spread of violence across the country's
317 districts month by month between 2008 and 2012. His system made
successful predictions for which districts would witness violent
events in 47 out of the 48 months.
Better statistical models should improve the results further.
Yonamine says that a predictive model updated daily could be used by
Afghan businesses to choose the safest route to transport goods, for
Extracting information automatically is essential, says Leetaru. "A
protest is a very human thing," he says. "But if you want to look at
a pattern, you have to quantify that." The system could also let you
capture trends, such as the mood darkening in a region before the
situation boils over. "There is simply no way for a human to take in
everything that happened in Egypt and make sense of it," he says,
referring to the escalating protests that ultimately led to the
ousting of President Hosni Mubarak in 2011.
"I'm very optimistic about big data," says Nils Weidmann at the
University of Konstanz in Germany. "But the strongest predictor of
violence is previous violence. The real challenge is predicting new
Weidmann thinks we will not have truly useful event forecasters
until it is possible to mine data from social networks and other
informal sources. Mainstream media tends to cover events only after
they have happened. "Big data needs to go deeper," he says.
That's not likely to be easy. On-the-ground information is often
sparse in unstable situations. However, in the case of the Syrian
civil war, GDELT's combination of geographical data and a diverse
collection of news sources allowed New Scientist to give a broad
look at how the conflict has swept through the country since its
inception in 2011 (see maps).
As GDELT is refined and the time period it covers expands, its value
is likely to go beyond questions of international policy. The
financial world, for example, is increasingly relying on analyses of
huge tranches of information.
That information can come from seemingly unlikely sources, such as
those throwaway terms users put into search engines. Google recently
opened up its records on what people are searching for and how it
changes over time. For example, between 2004 and 2011 there was an
increase in finance-related terms such as "debt", "dow jones", and
"unemployment". Tobias Preis at the University of Warwick, UK, and
colleagues analysed this data and identified patterns that they
believe could be used as early warning signs of a future financial
crisis (Nature Scientific Reports, DOI: 10.1038/srep01684).
"Everyone is scrambling to extract value from big data," says
Yonamine, who is now a data scientist with Allstate Insurance in
Chicago. "We have underestimated how much explanatory power there is
in social dynamics."
Social networks are another obvious source, but applying predictive
techniques to the rivers of posts and tweets that flow from them
could be misleading - false information is commonplace. When a
report of an explosion in the White House recently appeared on the
hacked Twitter account of the Associated Press, for example, stock
markets briefly plummeted, even though there was no corroboration.
This illustrates the need to refine how the information is
processed. "To some extent it doesn't matter if a reported event
happened or not," says Yonamine. "Information does not have to be
real to drive future events."
People are now deliberately exploiting this side of big data, says
Yonamine. Product reviews strategically planted on retail websites
like Amazon often skew the results of data mining software, which
collects reviews to help refine the recommender systems that predict
what consumers might like, for example. Tracking down fake reviews
is already a game of cat and mouse, he says. "It's machine learning
algorithm versus machine learning algorithm."
This article appeared in print under the headline "Trouble on the
Tweeting from the heart
How happy is the world today? Now you can find out - at least
according to Twitter. Researchers from the University of Vermont in
Burlington and the non-profit MITRE Corporation in McLean, Virginia,
have developed a global happiness sensor. Dubbed the hedonometer, it
analyses daily tweets from tens of millions of people, from the past
five years. It crunches stats based on the "emotional valence" of
words, and spits out a score. Tuesday 30 April, the day the
hedonometer launched, scored 5.97 on the happiness scale, slightly
up on the previous Tuesday's 5.96 but way below the five-year high
of 6.37 on 25 December 2008. In fact, it seems we have been on a
gradual downer for the last five years. The saddest day? That was 15
April 2013 - the day of the Boston marathon bombings.
Editorial: Predictive power won't take away the big decisions
* 14 May 2013
"VIOLENCE is the last refuge of the incompetent." So says Salvor
Hardin in Isaac Asimov's celebrated 1951 novel Foundation, which
imagines an organisation dedicated to predicting and reshaping the
course of human history - in this case, the fate of the Galactic
Empire - through the use of a statistical discipline dubbed
For more than half a century, psychohistory has been no more than a
fantasy. But the power of "big data" makes it conceivable that some
modest version of these predictive powers is achievable. For
example, researchers hope that a new database of geopolitical events
will eventually help them to project how conflicts will play out
(see "World's largest events database could predict conflict").
Forecasting is one thing. Intervention, or prevention, is quite
another. Choosing the appropriate action will be tricky. In Asimov's
book, the cabal of psychohistorians trying to direct events is
reined in by hard-headed politico Hardin. It will be interesting to
find out if his real life counterparts share his view on the use of