Automated AnalyticsPosted on August 15,2017
As social media use continues to expand globally,1 the types of questions that SOCMINT (social media intelligence) can answer commensurately grows. SOCMINT gives consumers a variety of tools for forecasting future events and maintaining situational awareness on an unprecedented scale and speed within this growing domain. Generally speaking, SOCMINT techniques attempt to either isolate data points (Tweets on Twitter, Facebook posts, videos posted to YouTube, etc.) from a corpus (large set of data points) or analyze sub-groups of data points relevant to particular groups or movements. Based on the number and geographic concentration of data points, analysts can quickly answer specific questions without conducting costly and cumbersome public polls, scheduling interviews of representative individuals, or developing human sources. In the private sector, investors have used SOCMINT to transform moderate investments into hedge funds.2 Researchers have categorized indicators of homicide on Facebook3 and identified the start of pandemics faster than previous methods.4
Social media platforms produce sizable corpora that analysts can use to assess public reactions to unfolding events, but there are significant technological and methodological challenges to making sense of vast amounts of social media data in a multitude of formats. One of the defining challenges for SOCMINT is the ability to derive meaningful analysis from an extremely large corpus, which includes misspellings, sarcasm, regional patterns of linguistic expression, and constantly evolving idioms. Adding to this challenge are the complexities of human behavior in angry mobs, understanding public perceptions of foreign political leaders, and identifying complex adversary deception efforts.5 Conducting close examination of a manageable sample of social media data points as a representative of an unmanageable whole is one way to glean analytic value from these corpora. Of course, the efficacy of this approach is no different from the many methodological problems faced by sampling in voluntary polling methods. The most significant challenge for sampling is that the analytic value of any conclusions drawn from the analysis of samples is completely dependent on the representativeness of the sample to the corpus in question.
Ideally, there is a way to eliminate the need to conduct close analysis of a corpus sample in order to answer a question by discovering an automatic process that selects and conducts analysis on all data points within the corpus relevant to the analyst’s question. This automatic process must be a fusion of quantitative data science (the how) and characterization of human behavior provided by humanistic academic disciplines (the why). Generally speaking, there have been few attempts to apply the insights of academic disciplines, such as political science, psephology (the study of elections and voting trends), anthropology, and social psychology, to data aggregation techniques.6 In order to apply meaningful analysis to amounts of data far too large for human analysts, there must be a way to set the terms of algorithmic filtering and compilation of data that produces meaningful, predictive insights.
Using NLP (natural language processing) in SOCMINT is one such solution. NLP is an emerging technique in data science and computational linguistics that has the potential to provide the necessary fusion of automation and predictive analysis of human behavior to effectively analyze an entire corpus of social media data. It is a machine learning process that uses consistently-refined algorithms to select social media data points based on the language they contain. The “markup” process refines these algorithms, with analysts providing examples to the algorithm of the language to select by manually assessing the meaning of the language used within data points. After being given a number of examples by analysts, the NLP algorithm makes decisions at high speeds about the statistical correlation of data points unseen by the analyst to the baseline examples provided. The analyst continually “trains” the algorithm by manually associating phrases, hash tags, emoticons, etc., with the semantics relevant to the question he is trying to answer.7
Because this process is ultimately based on an individual’s association of data points to the meaning expressed, NLP can be conducted in any language and within any culture limited only by the analyst’s skill in interpreting individual data points. Providing the analyst is well-versed in how to interpret various data points (i.e., he understands the humanistic disciplines’ ability to answer the question at hand), NLP used in SOCMINT provides a method to apply semantic understanding to the syntactic process of filtering massive amounts of data.
Some examples of NLP techniques to automatically filter content include:
• Word sense disambiguation: using related words within a data point to derive context and subsequent semantic analysis for a word that may not immediately be recognizable to the algorithm;
• Named entity recognition: the ability of the algorithm to identify proper nouns based on capitalization (at least in English);
• Entity classification: using specific terms to categorize nouns as individuals, organizations, etc.;
• Part of speech tagging: recognizing parts of speech that can be used to cross-reference rules about sentence syntax;
• Relationship analysis: determining related terms from across sentences;
• Event analysis: automatically classifying activity in posts based on recognizing the relationship between select predicates and verbs;
• Co-reference resolution: identifying disparate terms that refer to the same entity.8
Any number of behavioral insights from humanistic academic disciplines can inform the selections made by an analyst during the markup process. The most effective contributions by a human analyst to the markup process within NLP is completely contingent on the question an analyst is trying to answer. Current commercial uses of NLP in SOCMINT are broad. Researchers from HP (Hewlett-Packard) used NLP in social media to gauge public opinion of political candidates in Singapore9 and to predict box office success for films.10 Many companies have used NLP to gauge customers’ feelings toward their products based on their posts on social media platforms using a variant of NLP called sentiment analysis. This technique uses NLP to both filter and ascribe a category to relevant data points (positive, somewhat positive, negative, etc.) to inform marketing strategies or by allocating resources for improving product quality.11 One company used NLP to find worldwide media pirating trends by applying NLP to social media content.12
This is not to say there are not significant methodological challenges to using NLP within SOCMINT. One of the most pressing challenges is the automatic filtering of data points within the corpus by privacy settings within the social media platform itself. The API (application programming interface) of each social media platform determines which data points are accessible to external computers, and most often the API restricts access based on users’ privacy settings.13 Facebook, for instance, only gives researchers or intelligence analysts access to geo-location, textual content, and association with other profiles if that information is made public by the user’s privacy settings. Clearly, the automatic filtering of Facebook’s API introduces a serious bias to the corpus before NLP can even begin to analyze. Twitter’s API, on the other hand, provides the text of the tweet, the tweet’s creation date, the tweet’s geo-location, the user’s number of followers, and the number of tweets sent by the account.14 Even setting aside the significant legal and diplomatic ramifications, bypassing the API on a social media platform could be a significant demand for resources before NLP is even employed on the corpus in question.
Another potential difficulty for effectively applying NLP to SOCMINT is the necessary requirement of intelligence analysts to be vigilant against the introduction of bias or logical fallacies during the markup process. Failure to do so may set conditions for the NLP algorithm to return an inaccurate analysis of corpus to the relevant question. Take, for example, an analyst faced with the question of whether the population of a particular community is generally sympathetic to the cause of a political dissident group. Suppose the analyst, by his own assessment of previous reporting, already thinks the population in question is very sympathetic to the dissident group. It is conceivable that he may fall subject to confirmation bias and only select data points within the corpus that indicate sympathy with the dissident group during the markup process. Consequently, the self-refinement of the algorithm would increasingly rule out data points within the corpus that indicate the population was not, in fact, sympathetic to the dissident group. Though these types of errors and biases can be corrected by rigid analytic discipline, human error is still a hindrance to the ability of NLP to accurately return an unbiased assessment of the corpus in question.
Though there are methodological and analytic difficulties associated with using NLP in SOCMINT, it also has great promise to add a measure of clarity to the amorphous domain of social media. At the cost of sidestepping a significant discussion about international political sensitivities to SOCMINT techniques operating on foreign social media platforms, potential military applications of NLP to SOCMINT are almost boundless. As a means to an end, applications of NLP to social media corpora are limited only by access to those corpora and the skill of analysts in refining algorithms to return data points that answer relevant questions. Using NLP to analyze social media traffic in relation to geographic data within posts can greatly increase emergency response time.15 Using NLP to filter geo-tagged posts in real time about failing infrastructure, crowds of displaced persons, or social unrest can greatly increase the speed of commanders in deciding where to most efficiently allocate resources during humanitarian assistance/disaster relief efforts. NLP can also be used to gauge the propensity for violence in demonstrations in the vicinity of U.S. Embassies around the world and may be a key indicator in forcing a decision about whether to evacuate or reinforce the facilities.
Within identity intelligence, using NLP to identify VEO (violent extremist organization) cells, websites, and locations can greatly assist link analysis, finding previously unidentified nodes within VEO networks based on social media content. When data points are geo-tagged, NLP can also be used to identify the areas where VEOs might have fertile recruitment communities by recognizing concentrations of sympathetic posts. This can be an invaluable tool for directing longer term collection efforts and the development of human intelligence source networks.
Using NLP to locate pockets within a population that dissent to an adversarial conventional military can be a useful in directing resource allocation for psychological operations. Using NLP to gauge geographic concentration and reasoning behind the dissent of an adversary’s population can inform lines of effort on the most effective, legal way to break that population’s will to continue supporting their government’s war effort. NLP can also be useful in corroborating other intelligence disciplines in determining which lines of effort in military deception operations will be the most effective to achieve the commander’s desired end state.
Opening the aperture wider to the strategic level, using NLP to gauge the reaction of a foreign state’s population to their government’s actions can provide a new source of intelligence to inform foreign policy on particularly contentious areas. Using NLP to analyze a population’s reaction to their government’s actions on social media can also be used to forecast that government’s response (or complications arising from their lack thereof). For example, using NLP to effectively measure the percentage of a population that disapproves of a particular parliamentary referendum can inform a decision about which cooperative strategy to pursue during diplomatic discourse. In an adversarial sense, the same type of analysis can inform the analyst which choice of rhetoric to exploit during highly publicized international diplomatic forums to achieve the desired international and domestic pressure on an adversary’s government.
It is also worth noting that there is nothing stopping a competent and determined adversary to use NLP for SOCMINT in a similar manner. Adversary governments can use NLP as a means to more efficiently develop their own human intelligence networks. There is also nothing stopping adversary governments from using NLP in SOCMINT to assist humanitarian aid missions meant to challenge U.S. regional influence and commitment. They could also certainly use it to guide their own military deception or psychological operations. VEOs could bolster recruitment efforts by locating both individuals and geographic areas where their ideology will likely resonate. It is also not outside the realm of possibility for VEOs to develop the ability to use NLP in SOCMINT to gauge where domestic terror attacks will have the most effectiveness or the greatest psychological effect. It is possible they could then use this information to maximize the effect of their limited budgets for media production.
In all these examples, it is important to note that using NLP for SOCMINT should not be counted on to provide definitive evidence or analytic insight as a single source. Based on varying accuracy rates for algorithms,16 disparate social media corpus structures interfering with the algorithm,17 and simple analyst error during the markup process can all lead to results that can misdirect situational awareness or analysis of cause and effect. Even with theoretically-perfect accuracy, NLP in SOCMINT is still analyzing the attitudes and possibly intentions of imperfectly predictable human beings. Given the unpredictable nature of SOCMINT and current state of NLP, it would be unwise to use NLP in SOCMINT without corroboration from other intelligence disciplines. That being said, NLP has the potential to be extremely effective in directing deeper analytic efforts or deciding where to allocate resources in time sensitive contingency operations. There are few indications this technology will cease to improve in both its accuracy and scope, and the ability of NLP in SOCMINT to economize intelligence efforts could provide a crucial advantage in achieving information dominance over an adversary.
Though there are certainly limitations and difficulties in using NLP for SOCMINT, it has the potential to be a game changer for commanders by rapidly providing situational awareness and serving as a basis to more efficiently allocate corroborative intelligence collection efforts in almost innumerable contexts. The good news is that creative programming innovation and an acute understanding of human behavior, not daunting overhead costs, drive the effectiveness of NLP in SOCMINT. The bad news is that creative programming innovation and an acute understanding of human behavior, not daunting overhead costs, drive the effectiveness of NLP in SOCMINT. The question is whether the Marine Corps or its adversaries will be the first to establish an advantage in both training analysts and developing NLP algorithms that will tremendously expand the capability of SOCMINT to support future operations.
1. Shannon Greenwood, Andrew Perrin, and Maeve Duggan, “Social Media Update 2016,” Pew Research Center, (Online: November 2016), available at http://www.pewinternet.org.
2. Tim O. Sprenger and Isabell M. Welpe, “Tweets and Trades: The Information Content of Stock Microblogs,” Social Science Research Network, (Online: November 2010), available at https://www.papers.ssrn.com.
3. Jan Hoffman, “Trying to Find a Cry of Desperation Amid the Facebook Drama,” The New York Times, (Online: February 2012), available at http://www.nytimes.com.
4. Alessio Signorini, Alberto Maria Segre, and Philip M. Polgreen, “The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic,” PLoS ONE, (Online: May 2011), available at http://journals.plos.org.
5. Eval Pascovich, “Intelligence Assessment Regarding Social Developments: The Israeli Experience,” International Journal of Intelligence and CounterIntelligence, (Online: 2013), available at http://lecturers.haifa.ac.il.
6. David Omand, Jamie Bartlett, and Carl Miller, “Introducing Social Media Intelligence (SOCMINT),” Intelligence and National Security, (Online: September 2012), available at https://www.researchgate.net.
7. Jamie Bartlett and Louis Reynolds, “State of the Art 2015: A Literature Review of Social Media Intelligence Capabilities for Counter-terrorism,” Demos, (Online: September 2015), available at http://apo.org.au.
8. Alan Morrison and Steve Hamby, “Natural Language Processing and Social Media Intelligence,” Pricewaterhouse Coopers, (Online: 2013), available at http://www.pwc.com.
9. Jianshu Weng, Yuxia Yao, Erwin Leonardi, and Francis Lee, “Event Detection in Twitter,” HP Laboratories, (Online: July 2011), available at http://www.hpl.hp.com.
10. Sitaram Asur and Bernardo Huberman, “Predicting the Future With Social Media,” HP Laboratories, (Online: March 2010), available at http://www.hpl.hp.com.
11. Rohini Srihari, “How can NLP Technology be used for Marketing?,” SmartFocus, (Online: Undated), available at http://www.smartfocus.com.
12. Morrison and Hamby.
13. Bartlett and Reynolds.
15. Omand, Bartlett, and Miller.
16. Bartlett and Reynolds.
17. Morrison and Hamby.