Oraclum blog https://oraclum.eu/ Election Data Analysis Wed, 11 Nov 2020 09:12:31 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://oraclum.eu/wp-content/uploads/2016/05/cropped-networkk_ilustracija_ANALIZAnetwork7-32x32.png Oraclum blog https://oraclum.eu/ 32 32 Post-election uncertainty: exactly as foretold https://oraclum.eu/post-election-uncertainty-exactly-as-foretold/ Wed, 04 Nov 2020 09:10:00 +0000 https://oraclum.eu/?p=970 What comes next? As many of our regular readers and followers know this year we did not go public with our election predictions because we had a few high-paying clients that wanted the results before anyone else. This is why we opened our predictions to anyone who wanted to buy them for as low as $50 or $100 (many […]

The post Post-election uncertainty: exactly as foretold appeared first on Oraclum blog.

]]>
What comes next?

As many of our regular readers and followers know this year we did not go public with our election predictions because we had a few high-paying clients that wanted the results before anyone else. This is why we opened our predictions to anyone who wanted to buy them for as low as $50 or $100 (many people did, which made us very happy that there is an actual market for this).

Those who bought our predictions knew this was coming. This is a quote from our election report available on November 3rd on our website:

If the results stay this way — a closer than expected Biden victory — we are looking at a high probability of post-electoral uncertainty and a contested election scenario. In fact, 64% of our users anticipate a contested election. The reason is simple: if Trump holds on to wins in FL and AZ (with NC also borderline), this will be known already on Election Night which means waiting for vote counts in PA and WI to confirm the winner of the race. In both of these states Biden is in front (and has been continuously since the start of our polling), however it will take time before this is confirmed which could result in continued market uncertainties next week.

And this is where we are right now. Waiting for PA and WI (and even MI) to clear the win for Biden. But it was always going to be a close, nail biting outcome, nowhere near a Blue Wave victory that most were predicting.

Our predictions are so far on the money. We correctly anticipated Florida (which again no mainstream pollster, model or aggregator got right), Ohio, Texas, Iowa, and most likely Georgia going for Trump. We missed thus far only Arizona (called it for Trump but is most likely going to Biden), and probably North Carolina (where Trump is in front).

We also predicted a contested election with a high likelihood of the final outcome hinging on vote counts in Pennsylvania and Wisconsin (and also Michigan). The lead that Trump currently has there is misleading as many main-in votes have not been counted (especially in PA and MI). So we have to wait and see until probably Friday before the outcome becomes more obvious.

Having said that, we are still confident in our prediction that Biden will take at least two of those three states. The Electoral college distribution will look very similar to the one we’ve predicted in our final report:

Read our full Premium report (sent to our high-paying clients) and see what you missed. We explain different possible scenarios, the probability numbers, our method, likely impact on markets, and our comparison to others.

As for the polling error — we’ve also explained this 2 months ago. It’s not the Shy Trump voter that’s the problem, it’s the non-response bias. And you cannot fix this by adjusting models (e.g. towards lower educated voters) which are wrong to begin with.

The post Post-election uncertainty: exactly as foretold appeared first on Oraclum blog.

]]>
Which swing states should we focus on in the 2020 election? https://oraclum.eu/which-swing-states-should-we-focus-on-in-the-2020-election/ Sat, 24 Oct 2020 15:06:00 +0000 https://oraclum.eu/?p=944 Oraclum’s US election survey is up and running. If you’re from the US you can access it here and give a prediction on who wins in your state. Try it, it’s fun! If you want to know who is leading the race order one of our election prediction packages. Swing states — focus on three in particular Every US election […]

The post Which swing states should we focus on in the 2020 election? appeared first on Oraclum blog.

]]>
Oraclum’s US election survey is up and running. If you’re from the US you can access it here and give a prediction on who wins in your state. Try it, it’s fun!

If you want to know who is leading the race order one of our election prediction packages.

Swing states — focus on three in particular

Every US election is decided by a handful of swing states. Typically states like Florida, Ohio, and Pennsylvania carry a significant importance because of the size of their electoral college votes. This election is no different.

In this election cycle we estimate the crucial states will be Florida (FL), Pennsylvania (PA), Ohio (OH), North Carolina (NC), Michigan (MI), and Arizona (AZ). In addition to these key 6 swing states we are also looking closely into Iowa (IA), Wisconsin (WI), New Hampshire (NH), Nevada (NV), Colorado (CO), New Mexico (NM), Georgia (GA), and Texas (TX).

There is a reason for focusing on each of these. The typically important swing states in each election are PA, FL, OH, and NC, but this time MI and WI are also in the loop due to Trump winning there back in 2016. These two states are particularly interesting as they used to be Democratic strongholds, but in 2016 they delivered one of the biggest surprises on Election Night. This cycle a reverse scenario might happen in GA or even TX, typically Republican strongholds, where polling seems to suggest a much tighter race between the two candidates than usual.

However, the crucial swing states which we will pay particularly close attention to will be the following three: Florida, North Carolina, and Arizona.

The reason: mail-in voting due to COVID81 million absentee ballots were requested by voters (which is 38% of the electorate), and thus far 13 million ballots were cast (6% of the electorate). Most of these were done in Florida (5.7m), followed by Michigan (2.8m), Pennsylvania (2.6m), etc. The problem is the partisan distribution of mail-in ballots during this election. Much more absentee ballots were requested by Democrats than by Republicans, which could bias the results on Election Night in favor of Trump for those states that count absentee ballots on or after Election Day.

But not the three aforementioned states. These states will be processing and counting their mail-in ballots long before Election Day which means there will be no delay on Election Night and that a winner in these states will already be known on November 3rd. Similar for Georgia and Nevada, although there is less uncertainty over who carries the states. Ohio — a state which every winning presidential candidate typically wins — is also starting early but they could experience a delay in results, and are more likely to have full results by Wednesday rather than Tuesday. For other states, like Pennsylvania or Wisconsin we might wait for weeks before we know the actual results.

Therefore, by focusing on the aforementioned three states (FL, NC, and AZ) in addition to OH, we will be able to anticipate whether or not the elections will cause huge post-electoral uncertainty over the outcome or not.

For example, if Biden secures a sweeping victory in each of these, it is highly unlikely that there will be uncertainty over the final outcome after Election Night. If, however, these states are split between the candidates, or the margins of victory are very low, then the electoral uncertainty might drag on for weeks, or even months.

If Trump wins in all three on Election Night, then the outcome of the election will depend on the results in PA and WI, states that will take a long time to count their mail-in ballots. In that case uncertainty could drag on for months.

To stay ahead of the curve and figure out the impact of the elections on markets and your business, order our election prediction reports.

The post Which swing states should we focus on in the 2020 election? appeared first on Oraclum blog.

]]>
Comparing the 2020 US election polls & predictions https://oraclum.eu/comparing-the-2020-us-election-polls-predictions/ Thu, 22 Oct 2020 12:04:00 +0000 https://oraclum.eu/?p=942 What the others are saying UPDATE: October 22nd 2020 In the 2016 election the accuracy of our prediction was all the more impressive given the failure of every single benchmark we compared ourselves to. This election we will once again follow the same benchmarks and compare our predictions to theirs. We are looking at the most prominent and polling […]

The post Comparing the 2020 US election polls & predictions appeared first on Oraclum blog.

]]>
What the others are saying

UPDATE: October 22nd 2020

In the 2016 election the accuracy of our prediction was all the more impressive given the failure of every single benchmark we compared ourselves to.

Polling aggregators

The benchmarks are separated into several categories. The first includes sites that use a particular polling aggregation mechanism. Namely, Nate Silver’s FiveThirtyEight, the Princeton Election ConsortiumReal Clear Politics average of polls, PollyVote, the Upshot, and The Economist. For each site we track the probability of winning for each candidate (if given), their final electoral vote projection, and their projected vote share. The specific methodology for each of these can be found on their respective websites, with each of them employing a commendable effort in the election prediction game (except RCP which is just a simple average of polls).

Source: The Upshot

Models

There are two kinds of election prediction models we look at. The first group are political-analyst based models done by reputable non-partisan websites analyzing US elections: the Cook Political Report and Sabato’s Crystal Ball. Each is based on a coherent and sensible political analysis of elections. Here we only report the electoral college predictions with the tossup seats as given in their report. These models do not give out probabilities or vote share predictions.

Prediction markets & betting odds

Next are prediction markets. Prediction markets were historically shown to be even better than regular polls in predicting the outcome (except in the previous election where they were giving Clinton on average 75% probability of winning). Their success is often attributed to the fact that they use real money so that people actually “put their money where their mouth is”, meaning they are more likely to make better predictions.

Superforcasters

Finally we compare our method against the Superforcaster crowd of the Good Judgement Project. Superforecasters are a colloquial term for participants in Phillip Tetlock’s Good Judgement Project (GJP). The GJP was a part of a wider forecasting tournament organized by the US government agency IARPA following the intelligence community fiasco regarding the WMDs in Iraq. The government wanted to find whether or not there exists a more formidable way of making predictions which would improve decision-making, particularly in foreign policy. The GJP crowd (all volunteers, regular people, seldom experts) significantly outperformed everyone else several years in a row. Hence the title — superforecasters (there’s a number of other interesting facts about them — read more here, or buy the book). However superforecatsers are only a subset of more than 5000 forecasters who participate in the GJP. Given that we cannot really calculate and average out the performance of the top predictors within that crowd, we have to take the collective consensus forecast of all the forecasters in the GJP.

The post Comparing the 2020 US election polls & predictions appeared first on Oraclum blog.

]]>
Who Wins in 2020? https://oraclum.eu/who-wins-in-2020/ Tue, 08 Sep 2020 09:25:20 +0000 https://oraclum.eu/?p=937   Back in 2016 we were a team of scientists that used a novel methodology to successfully predict Brexit and Trump, both within a single percentage point margin of error. It was a combination of our wisdom of crowds survey and a network analysis of friendship links on Facebook & Twitter (methodology described below). Four […]

The post Who Wins in 2020? appeared first on Oraclum blog.

]]>

 

Back in 2016 we were a team of scientists that used a novel methodology to successfully predict Brexit and Trump, both within a single percentage point margin of error. It was a combination of our wisdom of crowds survey and a network analysis of friendship links on Facebook & Twitter (methodology described below).

Four years later, we are once again making our prediction, this time for the 2020 US election.

Results from 2016

We will start with our survey soon, focused primarily on the key US swing states (PA, FL, OH, NC, MI, WI — all states where Trump won in 2016 and which delivered him the electoral college victory, but also GA, VA, IA, CO, AZ, NV, TX).

In the mean time, just like in 2016, we will track “the competition” both here and on our election blog: the polling aggregators (like FiveThirtyEight, Upshot, RCP, PollyVote, etc.), the prediction models (Cook Political, Sabato’s Crystal Ball), the betting markets (Iowa EM, PredictIt, ), and — my personal favourite benchmark — the superforecasters, all of which were way off in their predictions in 2016. Clinton was given an average 89% chance of winning, while not a single polling aggregator or prediction model gave PA, FL or NC to Trump, all of which we correctly called in his favour (see map below).

Our prediction in 2016: we called all the key swing states within a 1% margin!
(including PA, FL, NC, OH, VA, IA, and CO)

In addition we also correctly called all the other major states, including OH, VA, IA, AZ, NV, CO, NM. We were only wrong about WI and MI, the only reason being of not placing too much emphasis on these states in the survey.

Implications for 2020

Back in January the narrative was clear: the Trump team was running a data-savvy campaign, emulating its 2016 approach, except this time instead of social media it was all about utilizing text messages on WhatsApp and geofencing. On the other hand the Democrats were said to be losing their digital edge — particularly on social media — were targeting wrong voters, and were increasingly criticized for becoming detached from the average American. Trump’s approval ratings have started to increase, he was performing well in all the key swing states that he won back in 2016. All early signals were pointing to his victory.

Six months later, the COVID-19 outbreak and all of its consequences have seriously undermined that narrative. Now Biden is firmly in the lead across the majority of nation-wide polls, but with knowledge of past polling errors (particularly the ones in 2016), the uncertainty surrounding this election is even greater than it was in 2016. How come?

Can you trust the polls?

For one trust in pollsters has been seriously eroded ever since Brexit and Trump. Polling errors are usually magnified during election times and pollsters worldwide are still struggling to gain representative samples. Furthermore there is a prominent hypothesis about the so-called Shy Trump voters (or the silent majority), i.e. the people who conceal their true preference, either by saying they are undecided, misrepresenting who they support (one study found Republicans and Independents to be twice as likely not to give their true opinion to a pollster), or simply by choosing not to respond to the poll at all.

However according to this paper there is little evidence of “shy voters” causing any substantial polling errors:

Generally, there is little evidence that voters lying about their vote intention (so-called ‘shy’ voters) is a substantial cause of polling error. Instead, polling errors have most commonly resulted from problems with representative samples and weighting, undecided voters breaking in one direction, and to a lesser extent late swings and turnout models.

Non-response bias

This is where the main problem lies: non-response bias in standard polls. Or in simple terms: less and less people responding to polls. A response rate is the number of people who agree to give information in a survey divided by the total number of people called.

According to Pew Research Center, a prominent pollster, and Harvard Business Review response rates have declined from 36% in 1997 to as little as 9% in 2016. This means that in 1997 in order to get say 900 people in a survey you had to call about 2500 people. In 2016 in order to get the same sample size, you needed to call 10,000 people.

Random selection is crucial here (because the sample mean in random samples is very close to the population mean) and pollsters spend a lot of money and effort to achieve randomness even among those 9% who did respond. But can this be truly random is an entirely different question. Such low response rates are almost certainly making the polls subject to non-response bias. This type of bias significantly reduces the accuracy of any telephone poll, making it more likely to favour one particular candidate because they only capture the opinion of particular groups, and not the entire population. Online polls on the other hand suffer from self-selection problems and are by definition non-random and hence biased towards particular voter groups (younger, urban populations, usually also better educated).

Following the above example, assume that after calling about 10,000 people in 2016 and only getting 900 (correctly stratified and supposedly randomized) respondents, the results were the following: 450 for Clinton, 400 for Trump, and 50 undecided (assuming, for simplicity, no other candidates). This would yield the poll saying that Clinton is at 50%, Trump at 44.4%, and that 5.5% are undecided, and it would conclude that because the sampling was random (or because their model did a good job of reweighting the sample), the average of responses for each candidate in the sample is likely to be very close to the average in the population.

But it’s not. The low response rate suggests that some of those who do intend to vote simply did not want to express their preferences. Among all those 9000 non-respondents the majority are surely people who dislike politics and hence will not even bother to vote (turnout in the US is usually between 50 and 60%, meaning that almost half of the eligible voters simply don’t care about politics). However, among the rest there are certainly people who will in fact vote, but are unwilling to say this to the interviewee directly for a number of reasons (lack of trust being the main one). It was one of the reasons why we found that in 2016 pollsters systematically underestimated Trump by 4.3% on average across all 50 states.

This is posing a serious problem to the polling industry as they can no longer rely on standard statistical methods to deliver accurate predictions of trends (as they used to).

How can we fix this? Use an alternative method that does not depend on having a representative sample to predict voter (or consumer) preferences.

We just so happen to have one.

A new polling methodology

The logic of our approach is simple. Our survey asks the respondents not only who they intend to vote for, but also who they think will win, by what margin, as well as their view on who other people think will win. It is essentially a wisdom of crowds concept adjusted for the question on groupthink.

The wisdom of crowds is not a new thing, it has been tried before. But even pure wisdom of crowds is still not enough to deliver a correct prediction. The reason is because people can fall victim to group bias if their only source of information are polls and like-minded friends. We used social network analysis to overcome this effect. Using Facebook and Twitter to see how you and your friends predict the election outcome (only if your friends also solve our survey, all of it completely anonymously), we were able to recognize if you belong to groups where groupthink bias was strong. People living in bubbles (homogenous likeminded groups) tend to only see one version of the truth — their own. This means they’re likely to be bad forecasters. On the other hand, people living in more diverse, heterogenous groups are exposed to both sides of the argument. This means they are more likely to be better forecasters, so we value their opinions more. By performing this network analysis of voter preferences we are able to eliminate groupthink bias from our forecasts and therefore eliminate the bias from polling.

Our solution to the aforementioned traditional issues with online polls is the very idea of combining the wisdom of crowds with a network analysis to remove the selection and non-response bias. Asking a respondent how people around them think means that we are including a group of people instead of an individual. So all we have to do is to correct for each groups’ bias, which is easier than correcting for individual bias.

To summarize, when we do a survey on social media this is what it looks like:

  • We poll people on social media to find the best “observers” who tell us what their friends & other people think who will win.
  • Our users-observers then invite their friends to the survey, which enables us to see their preference pattern and measure their group bias (only if the friends solve the survey).
  • We then place a weight on each individual’s predictions based on their group’s bias and draw patterns of behaviour.

This methodology has enabled us to accurately predict not only election outcomes (like Brexit, Trump or Macron), but also consumer sentiment and demand, market trends, optimal pricing, and even the economic consequences of the COVID-19 pandemic.

How can you benefit?

So, how do you stay ahead of the curve? How can you reduce the uncertainty over the election outcome and better plan your investing or business strategy?

Follow our blogs, write to us on Twitter, and subscribe to our predictions.

Stay tuned for more!

The post Who Wins in 2020? appeared first on Oraclum blog.

]]>
The summer of lost hopes https://oraclum.eu/the-summer-of-lost-hopes/ Tue, 21 Jul 2020 12:01:43 +0000 https://oraclum.eu/?p=914 At one moment there was hope that summertime will naturally suppress the spread of COVID-19. This hope is now gone and the uncertainty sets in. In the beginning of June, the COVID-19 trends in the US looked quite promising. The situation was perfect for setting up long term public health policies to combat COVID-19 when fall arrives and prepare the health system for the inevitable increase of COVID-19 cases by the end of the year. Unfortunately, the situation quickly escalated thanks to political struggles and ignored warnings that the virus is on the loose over most of the country.

The post The summer of lost hopes appeared first on Oraclum blog.

]]>
At one moment there was hope that summertime will naturally suppress the spread of COVID-19. This hope is now gone and the uncertainty sets in.

In the beginning of June, the COVID-19 trends in the US looked quite promising. The incidence rate (number of cases per 100,000 people) was about 10 or less, with the pandemic hotspots in secure downward trends. The incidence doubling time was in almost every state over one month, mostly two or more months, which was a slow trend that was the goal of “curve flattening”. The summer was coming just at the right time, when viral respiratory illnesses typically get subdued. The situation was perfect for setting up long term public health policies to combat COVID-19 when fall arrives and prepare the health system for the inevitable increase of COVID-19 cases by the end of the year. Unfortunately, the situation quickly escalated thanks to political struggles and ignored warnings that the virus is on the loose over most of the country.

The economic and political pressures have been pushing toward faster exits from state lockdowns and restrictions on mobility. At the same time, a lack of consensus on public health measures, such as on wearing masks or the severity of the disease, eroded the efforts to use this precious time to promote a “new normal”, i.e. public health measures that would prevent a rapid spread of the virus. The states started to move to reopening as fast as possible. We can see the exact moment of when a state reopened in the data on restaurant reservations registered by the OpenTable network.


Figure 1: The COVID-19 incidence rate by US states as a function of the state reopening visible though the reopening of restaurants. The left panel is for two weeks in April, the right panel in July. Red points are states led by Republican governors, blue points are for Democrats.  (data from: OpenTable and the New York Times).

Figure 1 shows the incidence rate by state as a function of restaurant reopening. Three remarkable features stand out. First is how in July the incidence rate jumped dramatically in states that reopened early, with Florida and Arizona leading the pack. At the same time states that opened later maintain a low incidence rate. The second feature is the political divide: the states that opened early are almost all led by Republican governors. The two exceptions, Maryland and Massachusetts, have Republican governors who openly criticize the president and do not follow his political lead.  The third feature is the opposite situation in April, when the Republican “early opening” states were in a good shape with COVID-19, while the “later opening” states were struggling to suppress the spread of the virus. The animation in Figure 2 illustrates these trends in time.

Figure 2: The time animation (click on the image) of the change in COVID-19 incidence rate by US states as a function of the state reopening visible though the reopening of restaurants. Red points are states led by Republican governors, blue points are for Democrats. (data from: OpenTable and the New York Times).

In mid-June the situation started to deteriorate quickly as the number of cases started to rise. The key problem here was that the main guideline for reopening was a downward trend in cases, which has been largely ignored. If we look at the timeline of the number of infected people in the US (Figure 3) we can see that, when New York and New Jersey are excluded, the US had a constant number of cases. This means the virus was in a free circulation, spreading like wildfire.


Figure 3: The COVID-19 deaths, infections and various related data: credit/debit card spending, stock market trends, mobility data and restaurant reservations (data sources: NY Times, Opportunity Insights Economic Tracker, OpenTable, Google Moblity)

When these pandemic trends are shown on the map some misleading impressions can be inferred. An ordinary map does not illustrate how many people are affected as lots of people live in densely populated urban areas. Therefore, we map the trends on a cartogram – an illustration where the size of a US county on the map is proportional to the fraction of its population in the total US population. This way we can get a correct impression on how severe the pandemic is. Figure 4 shows such a map with colors depicting the doubling time of COVID-19 incidence rate calculated on the data for the two weeks prior to July 14, 2020. It is obvious from such a map how devastating it is to have a community spreading of the disease in large urban areas. Notice also how the initial hotspot in the larger New York metropolitan area is now the region with the smallest incidence rate. An animated version in Figure 5 shows an evolution of this situation over time, from a scary rise of COVID-19 cases in April to intermediate calmness by the end of May and start of June, and finally the most recent rise in cases (for an interactive version see Figure 6). The animation also helps to visually connect the shapes in an ordinary map with the cartogram.


Figure 4: A cartogram of the doubling time of COVID-19 incidence rate calculated on the data for the two weeks prior to July 14, 2020. (data source: Johns Hopkins University; cartogram by F4Carto)

Figure 5: An animated cartogram (click on the image) of the doubling time of COVID-19 incidence rate calculated on two-week timespans. An interactive version is shown in Figure 6 and a mosaic in Figure 7  (data source: Johns Hopkins University; cartogram by F4Carto)

The increase in death rate comes in the wake of rising COVID-19 cases. Figure 3 shows that August will start with over 150,000 deaths in total. The death rate is somewhat slower than the incidence rate, but this is due to there being more cases among younger population and (maybe) the summer weather. The problem is that hospitals are starting to feel the approaching tsunami of cases in need of hospitalization. The hope, now ruined, was that such a scenario could be postponed to the fall, when the respiratory illnesses typically start to rise again. Instead, the US is already under a full-blown virus attack while the options to stop it are not only limited, but also deeply politicized.

Under such conditions only total lockdowns can stop the virus, but this is not an option that anyone is willing to take any more. Instead, limited measures will be implemented, such as limits on gatherings or business activities that cannot avoid a close human contact. This means that the economy will suffer dramatically because of such measures and suppressed personal spending driven by the fear of long-term economic disruption. Until now epidemiologists were dictating the measures and economists had to adjust. Now economists will dictate the rules of the game and epidemiologists must figure out how to fight the pandemic under the given limitations. The outcome could be even more deaths and maybe political instabilities. Not only in the US, but also globally as the virus is taking its death toll almost everywhere.


Figure 6: An interactive map (click on the image) showing the doubling time of COVID-19 incidence rate calculated on two-week timespans.


Figure 7: A mosaic of cartograms (click on the image) showing the doubling time of COVID-19 incidence rate calculated on two-week timespans (data source: Johns Hopkins University; cartogram by F4Carto)

The post The summer of lost hopes appeared first on Oraclum blog.

]]>
Slowdown from lockdown https://oraclum.eu/slowdown-from-lockdown/ Sat, 09 May 2020 13:35:36 +0000 https://oraclum.eu/?p=895 COVID-19’s path of death and analysis of Google mobility data As the economic and social fallout from the “lockdown” measures takes its toll across the world the public debate has been increasingly focusing on the rationality of such drastic measures. This debate is both highly relevant and highly emotional. The problem is that everybody is […]

The post Slowdown from lockdown appeared first on Oraclum blog.

]]>
COVID-19’s path of death and analysis of Google mobility data

As the economic and social fallout from the “lockdown” measures takes its toll across the world the public debate has been increasingly focusing on the rationality of such drastic measures. This debate is both highly relevant and highly emotional. The problem is that everybody is prone to a personal bias on how we combine our risk perception and consequences of the restrictive public health measures. In the following blog we look at the data from different countries to quantify some aspects of this debate.

First, we look at the current trends in the number of deaths. In our previous blog post we explained how we approach this modelling and we gave some predictions in the early days of the rise of the number of deaths. The key dilemma was when will the “curve bending” (slowdown in the death trend) start and how fast is it going to “bend” (i.e. how fast the death rate will drop). Now that we can follow these curve bendings (sources of data: Johns Hopkins Universitythe New York Timesdata.gov.uk), we see that our predictions were accurate. Keep in mind, though, that we work only with the officially recorded COVID-19 deaths, which we now know is not the total death toll (see the examples for the US here and here), but it represents a self-consistent dataset in the sense that it tracks individuals that were identified by the health system as COVID-19 patients.

Figure 1: Cumulative number of deaths for various countries, starting from the day of 10 recorded deaths. Thick lines are the data, thin lines are models (see our previous blog for the methodology). Interactive version available HERE.

Italy is now headed toward about 33,000 deaths (we initially said it will be close to 40,000; the reason for an overestimation of the death toll here is a stronger effect of the measures put in place), France to 32,000 (we said it will be very close to Italy without giving an exact number), Spain to 28,000 (we said it will be slightly lower than 30,000), while the UK unfortunately leads the European countries with the projected final toll of 37,000 deaths (we gave it a big chance of being worse than Italy). Belgium is counting more carefully than other countries the probable deaths from COVID-19, which leads them to about 11,000 deaths. Germany is projected to reach about 9,000 deaths (we said it won’t go over 10,000), the Netherlands about 6,000, Sweden just over 4,000. However, all these projections, like the ones before, come with a caveat that the model assumes the social distancing measures to be as effective as they have been thus far. Under this assumption, those numbers will be reached during June. The next figure shows these projections in time.

Figure 2: Daily deaths for countries with the highest number of deaths. Thick lines are the data, thin lines are models (see our previous blog for the methodology). Interactive version available HERE.

The United States is above other countries in the number of deaths and the model converges to about 90,000 deaths by early July, which is consistent with the predictions coming from the White House. (EDIT: Since May 7th the US data include both confirmed and probable Covid-19 cases and deaths. This increases our projected death toll to 97,500 by July 2020). Canada will also almost stop the deaths in early July, but at about 9,000. Interestingly Canada shows the relatively longest curve bending, which means it is spreading the effects of the pandemic over a longer time period than typical for other countries. Also, the decreasing trend in the daily deaths is similar for several countries — The Netherlands, Germany, Belgium, France, UK — and this trend was almost the same in the Hubei province of China (where the pandemic started). The US, Italy and Canada have slower trends, but surprisingly Spain has a faster decline than Hubei. It is hard to decipher the exact reasons for differences in these trends (enforcement of strict measures, the demographics of deaths, the spread of virus in retirement homes, etc.) just by looking at the numbers.

Google Mobility data analysis

However, what we can do is to look at the Google Community Mobility data and compare it with the COVID-19 trends. This mobility data is derived from the aggregated data of Google users who opted in to Location History. It tells us how many people changed their behavior and reduced the travel beyond their homes. Comparison of this mobility with the COVID-19 trends for one country or one US state does not tell us much. But if we do that exercise for a bunch of countries or states then a pattern should emerge.

The debate over the lockdown can be broadly divided into three questions:

  1. Did we need a lockdown in the first place?
  2. What kind of a lockdown should we (have) implement(ed)?
  3. When and how to exit the lockdown?

The patterns that we look for cannot answer the second and third questions because this requires a detailed analysis of social, cultural, demographic and economic differences between countries. But they can give us at least a partial answer to the first question.

Figure 3 shows a comparison of mobility data between countries. We selected the mobility at transit stations as the most representative for comparing different countries. We can see that all countries exhibit a drop (notice that Italy has two drops: the first is when Northern Italy went under lockdown, the second is when this was extended to the whole country). Sweden was the most “liberal” in the regard, while New Zealand was the most radical in its social distancing implementation. Spain and Italy, two countries that were fighting a large health crisis, also show a drastic reduction in mobility of their citizens.

Figure 3: The Google Community Mobility data for different countries. The data are shifted vertically to start in average at zero. We find the middle of the declining trend for each country and set this as day zero. Interactive version available HERE.

From the mobility data we find day zero for each country — the day when the mobility drop was happening. Now we can take the COVID-19 data and calculate how quick was the increase in the number of infected and deaths. In Figure 4 below we plot the doubling time of confirmed COVID-19 cases. It shows a consistent pattern — the doubling of the cases took days before the moment of mobility drop (day zero) and then started to slow down dramatically after the introduction of the lockdown, no matter the country. Two weeks after day zero all countries have lowered the doubling time to between 5 days and 2 weeks. Three weeks after day zero this time was between 1 and 3 weeks. Nowadays the doubling is happening every 3 weeks or more.

Figure 4: The doubling time of confirmed COVID-19 cases on days relative to the drop in mobility (day zero) for each country. Interactive version available HERE.

Death trends lag behind the trends in confirmed cases since the disease takes its course. It is thus not surprising that a similar analysis for the doubling time of the number of deaths shows the same story (Figure 5 below). Notice that even with the deaths that happened after the mobility drop (day zero) the doubling time remains within the overall pattern. This has a sad consequence illustrated by one detail that we added to the plots — the line thickness corresponds to the total number of deathsCountries that did not introduce the lockdown early enough ended up with a lot of deaths. This is why lines before and around day zero are so thick. Those countries had the virus spreading like wildfire meaning that the pool of infected was already enormous by the time they introduced radical public health measures.

Figure 5: The doubling time of COVID-19 deaths on days relative to the drop in mobility (day zero) for each country. Line thickness illustrates the total number of deaths so far. Interactive version available HERE.

Mobility analysis of US states

We did the same analysis for US states, with the exception that we took the Google mobility data on retail and recreation instead of transit stations as before. It shows more consistent behavior in the US than public transport trends which is a consequence of different utilization of public transport across US states. Figure 6 illustrates this: mobility dropped almost identically across all 50 US states. This is a bit surprising giving the heated debate in the US about the lockdown policies, and different response strategies applied by different states. The data shows that the US population is far more unified on this issue than what the daily news would imply.

Figure 6: The Google Community Mobility data for different US states. The data are shifted vertically to start in average at zero. We find the middle of the declining trend for each state and set this as day zero. Interactive version available HERE.

With day zero extracted for each state we now show how the number of confirmed COVID-19 cases changes its trend. Figure 7 shows a story identical to the one we got from the cross-country comparison. The US had various issues initially with testing for COVID-19, which introduced a larger variation of doubling times before the mobility drop at day zero. Nonetheless, the overall pattern is very clear — the “lockdown” measures have consistently reduced the spread of COVID-19 across the US.

Figure 7: The doubling time of confirmed COVID-19 cases on days relative to the drop in mobility (day zero) for each US state. Interactive version available HERE.

The doubling time of COVID-19 deaths paints the same picture as before. It is a sobering visualization of how New York went from almost daily doubling of deaths at the time of lockdown to a 4 weeks doubling period today, which is one of the slowest in the US.

Figure 8: The doubling time of COVID-19 deaths on days relative to the drop in mobility (day zero) for each US state. Line thickness illustrates the total number of deaths so far. Interactive version available HERE.

What comes next?

The big question now is what comes next. Our analysis illustrates that the public health measures focused on social distancing worked, however the way how this has been achieved varies between countries. Now the main topic is the relaxation of those measures. This does not necessarily mean that social interactions will bounce back to their pre-COVID-19 levels. If people change their behavior and maintain distancing in their daily contacts the virus will not be able to return with its full strength into circulation. But if the relaxation of the social distancing measures goes too far, the number of infected and dead will start to rise.

The post Slowdown from lockdown appeared first on Oraclum blog.

]]>
COVID-19 Data Myopia https://oraclum.eu/covid-19-data-myopia/ Thu, 09 Apr 2020 19:48:33 +0000 https://oraclum.eu/?p=881 People around the world are hooked on following daily updates of confirmed COVID-19 cases and their related deaths. Extracting various conclusions and projections is omnipresent these days. However, not enough attention is given to understanding what these numbers really mean and what we can learn from them, while at the same time avoiding over-interpretation. Let’s […]

The post COVID-19 Data Myopia appeared first on Oraclum blog.

]]>
People around the world are hooked on following daily updates of confirmed COVID-19 cases and their related deaths. Extracting various conclusions and projections is omnipresent these days. However, not enough attention is given to understanding what these numbers really mean and what we can learn from them, while at the same time avoiding over-interpretation.

Let’s briefly touch upon this issue. The number of confirmed cases is not an easy number to work with. Different countries apply different strategies on how many tests to take and whom to test (e.g. only people with symptoms or anyone who was in a contact with an infected person) as well as on the capacity of a country to secure enough tests. Not surprisingly, this number can be inconsistent in time and between countries. It does not necessarily represent a consistent fraction of the total number of infected and, therefore, it is hard to make any conclusion from it except to look at its trends within each country (or US state) separately, while being informed on possible changes in the testing policy (e.g. the case of faulty tests in the US that was followed by a new round of increased testing efforts).

Unfortunately, the number of deaths is a far more informative dataset, but it too must be considered with caution. Even though there are some unified rules on how to attribute deaths to a disease, inconsistencies are possible. More importantly, not all COVID-19 deaths are registered as such. When the virus infects a large fraction of a community, the local medical services collapse under the flood of people in need of hospitalization. The lock-down is then particularly harsh on vulnerable social groups, such as the elderly population living home alone or people with risky medical preconditions. Many people die in their homes without ever being tested for the SARS-Cov-2 virus. Some people also died simply because they were not able to access their regular medical services. The extent of this effect is a matter of debate, but for illustration, the mayor of Bergamo, the hardest-hit region of Italy, claims that the number of COVID-19 victims is 4 times higher in his town than the official numbers. Similar stories come from FranceSpainUKGermanyChina and the US. On top of that, many undemocratic governments have decided to suppress the true extent of the pandemic in their countries as they fear political unrests that could topple their regimes.

Regardless of this grim warning on the validity of the official death numbers, the published numbers are intrinsically consistent in a way that deaths inspected by medical personnel will be checked for signs of COVID-19. Thanks to that we can treat the number of deaths as a proxy of how much the virus has spread through the community. Even if you do not know how deadly the virus is or if the number of confirmed cases is trustworthy, you can observe the pressure on hospitals and the number of deaths to see if the epidemic is slowing down or not. What we want to show here is how to apply a typical epidemiological fit to the death growth rate and make estimates on the final death toll and how long the state measures need to be imposed.

Factors affecting death rates

Before we embark on this exercise, there are still a lot of factors affecting the death trends that one should be aware of when discussing comparisons between countries, or even between regions within the same country. For example:

  • There is a delay between the peak of infection and the daily rate of deaths. A disease has its pace of creating symptoms and escalating to life-threatening levels. Not only that — the growth of confirmed cases is also lagging behind the actual growth of infected, which is a stark warning to countries that are still rising fast in their number of confirmed cases.
  • The death growth will strongly depend on the imposed state measures and how strictly the population is following them. Cultural differences can play a dramatic difference between countries as people react differently to the imposed restrictions on freedom of movement or privacy.
  • Different countries (or even regions within a country or a province) have different capacities of their health systems to cope with the potential tsunami of infected people in need of hospitalization. At some point the hospitals will have to start using a medical triage approach and then the death rate will increase simply because many people will die while waiting to be treated. The stories from Italy epitomizing this issue are heartbreaking and should be a strong warning to anyone ignoring the severity of this disease.
  • When the first deaths occur they are essentially random events. Typically, people with very risky preconditions, such as oncological patients. Epidemiological trends on such small random events are not plausible and should be avoided. Hence, the growth of deaths should start showing some visible trends over several days, preferably a week, to be a useful guide in modelling. Health officials obviously have all the details, including the reconstructed networks of individual contacts that allow them very detailed models of where and how fast the virus is spreading.
  • The public often puts lot of attention to certain short-term trends. One must be aware that the total number of deaths is a sum of various clusters of infected communities, which results in random fluctuations. Deaths can go up or down for several days before they return to the overall trend.
  • The demographic and socioeconomic conditions within each country or province affect the death rate.
  • Finally, local climate can also be an important factor. The virus spreads with a different rate under different climate conditions.

These and many other disease-related factors make modelling of COVID-19 extremely difficultquickly escalating into some very complicated math.

The curve fitting

This does not mean we cannot learn something from the daily counts of infected and deceased. The public is becoming aware of the concept of exponential growth. The spread of diseases is a typical example of a complex dynamics in a network of social contacts that explains the emergence of an exponential function. It describes how the epidemic starts if state measures are not imposed immediately — with an exponential growth of the number of infected. Thus, we are now painfully aware that doubling the number of cases every two days is a much more severe crisis than doubling it every week. However, this function is not how the disease will eventually evolve.

A virus needs hosts to multiply and if the number of potential hosts to infect is dropping then the disease will not be able to expand exponentially anymore. If nothing is done the virus will eventually infect as many people as it can reach, and this will be the end of the epidemic. Of course, this means a huge number of deaths — a rough prediction is 2.2 million deaths in the US alone if the government did nothing.

We therefore know that governments will do something to reduce the virus’ access to new hosts. The non-medical measures of social distancing and suppression of human gatherings are known to work as they have been used before in other situations (e.g. the responses to the Spanish flu in 1918 were following these same methods, which helped scientists later to explore their efficacy).

The simple way of how to describe this is to setup a model that includes three groups of people: susceptible to the infection, infectious, and recovered (thus, immune). The result from such a simple epidemic model gives a good sense of what kind of curves we should expect from the counts of infected, deaths and recovered. In simple terms, as the virus is losing new hosts to invade, the exponential growth starts to slow down and eventually stops. This is what we now often call as the “bending of the curve”. The problem is, of course, how to adjust the model to include all the aforementioned factors.

A quick-and-dirty trick that we can use is based on the fact that we now have two parts of the world where COVID-19 has been present for long enough to see how this curve-bending looks like in real life: the Hubei province in China and Italy. Think of it this way. Imagine you’re following scores from your favorite sports team during a season, but you are kept in the dark on whom they play against or any other statistics of their games. What you have are the scores from teams in previous seasons. What you would do then is compare the performance of your team to the trends that various teams showed in previous years. A few games in the start will not reveal much, but as the season progresses you will notice that the range of possible scenarios for the final success of your team are becoming increasingly limited.

This is what we wish to show. We first fit an asymptotic regression curve to the number of deaths as a function of time: f(t)=exp(a-(a-b)exp(-c*t)), instead of a simple exponential curve. It has been known that this curve can provide a good fit to the epidemic spread of diseases. We therefore make predictions on what kind of alternative futures lie ahead for various countries.

Then we use the cases of Hubei and Italy to get two sets of a,b,c parameters. This will represent two possible future scenarios for the US, which will give us a range for the final death toll and a time-scale of the fight against COVID-19.

The Hubei province — a radical approach to COVID-19

The story of COVID-19 started in the city of Wuhan in the Hubei province of China. We know that the lock-down of Wuhan started on January 23rd, 2020. Soon, other cities and regions within Hubei imposed the same measure and quickly the entire province was under quarantine. At the same time, a health crisis of epic proportions was happening in the Wuhan hospitals.

What we see in the official data is that the death toll follows our curve (see graph below) all the way to the end of the lock-down two months later. Hence, the case of Hubei is an extreme approach where the economy and regular daily life stopped for two months. Their concern now is how to prevent the re-emergence of COVID-19 due to “imported” cases — the problem that forced even Singapore, one of the most successful countries in the fight against COVID-19, to impose a lock-down a few days ago.

The case of Italy — a warning that many ignored

Even though the situation in China was dramatic already in January 2020, politicians around the world were not willing to risk their political popularity by spreading fear of the new disease. The virus started to spread in Italy in the first half of February or maybe even earlier. It is not clear how exactly the epidemic started in the Northern Italy, but back-tracing of cases led some experts to suspect that the UEFA Champions League game between Atalanta and Valencia may have been the reason why Bergamo became one of the epicenters of the pandemic.

Italy also had bad luck given that the virus entered their hospitals almost from the beginning. Even though the situation escalated quickly in mid-February, the state measures were not imposed quickly enough to suppress the pandemic. Politicians were faced with the vision of terrible economic loses and they hesitated with the lock-down. In the meantime, the virus had been spreading exponentially. Our fit shows (the black solid line in the graph below) that the measures now bend the curve toward the final official death toll close to 40,000 people. Hopefully the curve bending will get stronger and somewhat reduce this projection, but currently this is the number that the curve will reach when it completely flattens.

The case of Italy now is interesting for several reasons. First, it is a case of a country that has drastic measures, but their severity depends on the region. This is a more realistic scenario for other countries than Hubei. It also shows how a lock-down indeed starts to work almost immediately, but it takes weeks for the daily death toll to flatten. It will also take about two months, as it did in Hubei, since the start of the lock-down to establish conditions for lifting the restrictions.

The big European economies in race to avoid the Italian scenario

About ten days after the COVID-19 deaths started to climb in Italy, an even worse situation occurred in Spain. When the daily death toll reached 100, the government introduced a lock-down. Aggressive measures to contain the disease have been taken and more radical state measures introduced as the total death toll was approaching 10,000 people. The curve is now flattening and our projection puts Spain just below 30,000 deaths in total.

France started with its rise of deaths about the same time as Spain, but with a slower rate. Unfortunately, France is still seeing a disturbing trend, reaching record high rates of more than 1000 deaths per day — currently comparable to the US. This latest increase in deaths is due to an unfortunate spread of COVID-19 within retirement homes. The overall trend is still so strong that we cannot make a convincing prediction. This means that the next days are crucial for bending the curve in France. If the trend does not show a convincing slowing very soon, France will end up with numbers worse than Italy.

 

Germany saw the rise of deaths slightly after Spain and France, but it took an aggressive approach in testing in order to trace all the infected. Currently, about a million people have been tested in Germany for COVID-19, which makes Germany the global leader in the number of tests per capita. Their efforts are paying back as our fit shows that their curve has a potential to stop at about 10,000 deaths in total. However, the testing policy is about to get more strict as shortages of the basic testing equipment and reagents are reported.

The UK experienced a sudden rise in deaths at about the same time as Germany, but the situation in the UK is dramatically different. While Germany managed to bend the curve and performs lots of testing, the UK’s response was somewhat chaotic in the beginning as the government was trying to delay a lock-down. Unfortunately, even though the exponential growth of deaths is slower than in the beginning, it does not show a convincing bending of the curve quite yet. This means that our fit cannot predict where the final death toll will end. The last few days are encouraging, but by looking at Hubei and Italy, it is hard to expect the final death toll to be smaller than in Italy.

The dramatic events in the US

The situation in the US is surprisingly chaotic. The lack of coherent federal policy and a deep political divide created a situation where each state is devising its own COVID-19 policy and competes for medical resources with other states. This makes projections of the final death toll extremely difficult and uncertain. For example, some states have taken urgent and drastic measures that enabled them to slow down the pandemic and avoid escalation. Washington and California are such examples.

After the pandemic was slowed down on the West Coast, the situation escalated on the East Coast. The New York metropolitan area is now hit hardest by the pandemic. Their imposed social measures have started to bend the curve, but our models at this point are too uncertain to make a convincing prediction. However, if more effort is not taken to speed up the curve bending, this region will end up with a large number of deaths (50,000 or more).

We can however look at the cases of Hubei and Italy to predict the lower numbers on the total number of deaths that will accumulate before this summer. We project a theoretical curve fit we obtained from the Hubei and Italy data and observe how this would play out using US data. The graph below shows that the total number of deaths before the summer will reach anything within the range of 80,000 to 180,000 cases. This approach assumes that the curve bending starts immediately for the entire US. The plausibility of this assumption will be revealed in the upcoming days, but one should be aware that each single day in delaying curve bending increases the final death toll by probably tens of thousands of people.

These numbers are in agreement with the projections presented by the White House, as well as some more detailed epidemiological models.

The biggest challenge for this best-case scenario is that social distance and stay-at-home rules have not been implemented in a strategic manner over the entire country. Many state governors have not been introducing measures on time, which resulted with a large fraction of the US population travelling extensively just a couple of weeks ago. It remains to be seen how much this helped the virus to spread.

Also, Americans underestimate how long the crisis and the restrictions will last. The difficult period will probably last until at least June, when everyone hopes the warm summer weather will help slow down the virus in addition to the state-imposed measures. It is not clear how people will react once they realize the situation is dragging for many weeks. This is where good political leadership and social cohesion are crucial. Unfortunately, Americans are still deeply politically divided — as much as they were before the start of this crisis.

The shocking discovery of asymptomatic spread

Today, when countries are struggling to avoid the Wuhan and Italian scenarios, the focus is on “bending the curve”. Nonetheless, many ask themselves what comes after that. Obviously, lifting the restrictions is highly desirable from an economic point of view, but disturbing new insights into the COVID-19 disease are worrisome. Some studies showed that a large fraction of people go through the disease asymptomatically — from 50% to maybe even about 80%! This means that the virus can be easily re-introduced into the population once the restrictions are lifted.

One popular view on this problem is that the current measures are simply useless as the virus will return, while the restrictions cannot stay on for a long time. But the dilemma of whether to lock-down or not is a false one: there is simply no choice as the country’s health system will collapse. Only a few countries, like Singapore, Taiwan, Japan or South Korea, had the appropriate procedures in place, thanks to their previous experience with the SARS epidemic. Even they are now under threat from imported COVID-19 cases which has stirred up the local transmission again (Singapore introduced a lockdown because of this).

Thus, the problem that countries will face now is what to do once they have managed to keep the number of hospitalized cases low enough to avoid a healthcare system meltdown. It is like bleeding heavily from a big wound. The very first thing you have to do is stop the bleeding, but then you need medical help or you will, most likely, die. The problem now is that we are faced with a possibility of bleeding in the middle of a forest and help (in the form of a COVID-19 vaccine) will not come anytime soon. Under this scenario, economic and political problems start to overtake medical concerns. And that’s when things really start to get tough.

Figuring out what to do next?

All this can be mitigated. We need to be able to understand what drives people’s behavioral patterns and response strategies to the restriction measures before we can figure out the health and economic impact of the measures being implemented. We need to figure out what drives panic and fear in order to be able to prevent it during the worst weeks of the quarantine. In Italy people stopped signing from the balconies. In Wuhan, after two and a half months under the quarantine, the city is “profoundly damaged”, with its spirits shattered. We need to start figuring out what people are thinking about during isolation. Help them cope with the situation and help both governments and businesses around the world figure out what to do next.

We will update our predictions on total death tolls on a weekly basis and add survey sentiment and social network data to try and understand how people are feeling right now and what can be done to alleviate their pain. We will build an index that will measure the condition the people are in which will help us figure out what the recovery will look like: a quick and joyful one or a long and depressing one.

The post COVID-19 Data Myopia appeared first on Oraclum blog.

]]>
Visualization of the global spread of COVID-19 https://oraclum.eu/visualization-of-the-global-spread-of-covid-19/ Mon, 24 Feb 2020 14:49:25 +0000 https://oraclum.eu/?p=866 Interactive 3D visualization of geotagged time dependent data is among the tools that Oraclum Intelligence Systems develops of its visual analytics tasks. In this demo we use the data on the spread of coronavirus disease COVID-19, which contains both the geo location and the time stamp. The interactive graph shows the increase in the number […]

The post Visualization of the global spread of COVID-19 appeared first on Oraclum blog.

]]>
Interactive 3D visualization of geotagged time dependent data is among the tools that Oraclum Intelligence Systems develops of its visual analytics tasks. In this demo we use the data on the spread of coronavirus disease COVID-19, which contains both the geo location and the time stamp.

The interactive graph shows the increase in the number of confirmed CONVID-19 disease cases around the globe starting from January 22nd. Updated on a daily basis it facilitates the tracking of the spread. The infection started in China and now it is affecting 33 countries and territories. Besides China, at the moment the most critical situations are in South Korea, Italy, Japan and Iran.

The data we use is aggregated and made public by the Johns Hopkins University. The visualization uses THREE.js and D3.js libraries for drawing interactive globe and racing bar charts. This app was created for educational, non-commercial purposes.

The visualization is available at: https://oraclum.eu/Coronavirus/

UPDATE March 12, 2020: An updated version of the visualization is now online. It includes several datasets, with either total (cumulative) or daily numbers of confirmed cases, recovered cases and deaths.

Interactive visualization of COVID-19 spread

 

 

The post Visualization of the global spread of COVID-19 appeared first on Oraclum blog.

]]>
There is only one Democratic primary candidate that outperforms Trump on Twitter https://oraclum.eu/there-is-only-one-democratic-primary-candidate-that-outperforms-trump-on-twitter/ Fri, 08 Nov 2019 15:49:30 +0000 https://oraclum.eu/?p=831 Analysis of Twitter hashtags of candidates for US elections done by Prospectus Research in collaboration with Oraclum Intelligence Systems uncovered a host of interesting results. Judging solely by online activity over the past two weeks, only a single candidate in the Democratic primary race outperformed Trump on Twitter – Tulsi Gabbard. Prospectus Research performed an […]

The post There is only one Democratic primary candidate that outperforms Trump on Twitter appeared first on Oraclum blog.

]]>
Analysis of Twitter hashtags of candidates for US elections done by Prospectus Research in collaboration with Oraclum Intelligence Systems uncovered a host of interesting results.

Judging solely by online activity over the past two weeks, only a single candidate in the Democratic primary race outperformed Trump on Twitter – Tulsi Gabbard.

Prospectus Research performed an in-depth text analysis of all tweets and retweets of official candidate hashtags from 21st to the 30th of October. This included recognizing the behavioral pattern of each tweet; what are people talking about when referring to each candidate, what are the key emotional cues, what are the moral concerns, and the personalities of the supporter base for each candidate.

The analysis only included official campaign hashtags to gauge the actual support for a candidate, hence eliminating potential negative comments whilst using a separate method to eliminate the impact of bots and fake accounts. Official hashtags used were for example, #tulsi2020 or #kamalaharris2020 or #Trump2020, rather than just #Gabbard, #KamalaHarris or #Trump. The official campaign tags are much more likely to be used by actual supporters or people who have something constructive to say about the campaign, rather than smear campaigns or hate messages. 

Gabbard’s official campaign hashtag, #Tulsi2020, had more than 12m retweets during the 9 observed days, Trump’s campaign hashtag, #Trump2020 around 8m, while #JoeBiden (the official Biden campaign hashtag) came third, slightly short of 6m. All other primary candidates were not even close to these numbers. It is also interesting to note that, comparing their relative position in the polls to their Twitter performance last week, it seems that only Gabbard and Yang were overperforming their relative positions. This means they were being more active on social media than their polling numbers rank them. Joe Biden was about the same, whereas all others, including Warren and Sanders were severely underperforming. 

Nevertheless, even with Tulsi outpreforming Trump this week, compared to online activity of Trump supporters the Democrats are collectively taking a heavy beating on Twitter. 

Emotional cues, moral concerns, and personalities of voters

Observing total retweets was just the first step. We then dived deeper into the data and uncovered some of the key moral concerns addressed by voters when discussing their candidates. First we looked at the average markers for the five moral domains*. The dominant moral domain was the ingroup loyalty.

The moral concern for ingroup got a huge burst on the 27th, the day that Gabbard’s retweet storm started, triggered by allegations from Clinton. Purity followed a day later, presumably in response to the ongoing discussion, however it also coincided with Yang’s retweet bump.

Both in-group Loyalty and Purity are what is called binding moral foundations that create group cohesion and are usually more prominent in conservative voters. Loyalty especially translates into strong feelings of attachment and obligation to the group we identify with. Overarching group to identify with being the nation, whilst it can also signal loyalty to the party or the candidate. People high on this moral foundation tend to approve actions that bring cohesion, advantage, benefits and well being to the group even if those actions are costly. In a sense, in the context of partisanship, this is the most important moral foundation and research has indicated how Loyalty not only predicts partisan strength but also predicts voting intentions. Loyalty along with Authority and Sanctity or Purity are more prevalent in Conservative voters, and this finding can signal that either the conservative voting body was engaged in this tweet bump or that Gabbard’s supporters tend to to hold a distinct moral profile within the Democrats party. More precisely, as a recent study has shown, people whose identities are fused with a group and have a deep visceral feeling of oneness with a group, be that a nation or their party, equally support binding foundations (Loyalty, Authority, Purity) as do the conservatives. Meaning that it is possible that Gabbard has, through her own actions and character but also through recent accusations that posed a threat to the ingroup, gained a very powerful and dedicated base of followers that will be engaged to support and defend her online and also be motivated  to vote on the upcoming elections.

Next came the social concerns that were key for the campaign. Tulsi Gabbard’s supporters spoke a lot about friends and family, most of all the candidates, that is in lieu with in group loyalty domain. On the opposite end Kamala Harris’ and Elizabeth Warren’s campaigns or supporters were not at all interested in friends. Given that family and friends are often key transmitters of trustworthy information via word of mouth a successful campaign should start placing a greater emphasis on these dimensions.  

Moral concerns by hashtag show that Trump and Gabbard are the most concerned with ingroup morality. None of the candidates besides Tulsi Gabbard are significantly pro ingroup. Identity is a well-established psychological predictor of voting behavior, but this interaction works for all parties and for independent voters. Given Trump’s begrudging acceptance by Republicans, who are often offended by his disregard to the military (which is a key identity marker for American moderates and conservatives), and Gabbard’s status in the Armed forces, her ability to engage him in the general election looks overwhelmingly positive. She is the only person coming close among her other primary contenders in this category. 

When it comes to personalities of the analyzed postings markers for agreeableness and conscientiousness are both extremely low across all candidates followers. 

In terms of Neuroticism, Kamala Harris’ and Pete Buttigieg’s campaign followers are the most neurotic, while Biden and Warren’s are the least. 

As for Openness, a typical characteristic of political liberals, it is interesting to note that Trump’s campaign is low in openness (thus possibly attracting a lot of conservatives), but no lower than Tulsi Gabbard’s. However, Biden and Buttigieg are the lowest in openness, meaning that they attract the biggest pool of conservatives for the Democrats. The Sanders supporters are, by this category the most progressive and liberal. That is, persons high Openness apart from holding more liberal and progressive attitudes have broad interests and prefer novelty over convention.

Most of the supporters for different campaigns are indistinguishable when you look at their personalities. Trump supporters look nearly identical statistically to Sanders’ supporters. The only clear difference is that Tulsi Gabbard’s supporters appear to be far more conscientious than any other group. Consciousness again, is related to orderliness and thoroughness and preference for structure. It has also been related to conservative attitudes and traditional religiosity for instance.

What does it all mean for the campaign?

These results are particularly worrying for all Democratic party candidates who want to take on Trump in 2020 – none, except for Gabbard, has the social media prowess to seriously compete with him online. A recent article by the New York Times confirms this intuition suggesting that the Democrats are seriously lagging behind Trump on social media and online in general. While the Trump campaign is spending massively to raise supporters online, the Democratic candidates are struggling to adapt to the new political landscape.  

Gabbard’s Twitter overperformance shouldn’t be surprising given that on the 27th of October she was called out, albeit indirectly, by Hillary Clinton accusing her of being groomed by the Russians (which was later corrected as she was referring to the Republicans, not the Russians) to run as a third-party candidate and thus undermine the party’s nominee to benefit Trump in 2020. Her row with Clinton garnered much attention, both in the media and especially on social media, as exemplified by what we see in our data.

One explanation behind such dramatic overperformance of Gabbard on Twitter is that this is indeed a covert operation by online trolls or even foreign entities which target social media accounts of candidates polling at lower numbers in order to infiltrate and radicalize their supporter base and potentially disrupt the frontrunners. Even if this is not true in the case of Gabbard it still presents a potential threat to the party which seems to have no effective digital strategy to combat Trump or his supporters.  

In general it is very hard to explain where exactly this is all coming from, as also noted by FactBase:

Tulsi Gabbard has a social media presence that doesn’t correlate with any data outside social media. She’s seen a huge increase, statistically, in followers and subscribers, but those increases don’t match up with any similar-size news event or in her polling visibility.”

The vast online activity from last week has yet to translate into polling numbers for Gabbard. She is still polling at around 2% nationally, with very low probability of upsetting the race. However, if online activity is any indication of actual voting behavior, as we’re led to believe by the hypothesis being thrown around for the 2016 election, Gabbard could be the only candidate with wide enough online support to challenge Trump on social media.

This type of analysis could be done on a weekly basis tracking the performance of each candidate and figuring out which messages resonate and which do not among supporters. Clearly Gabbard’s messages in the wake of Clinton’s accusations touched a positive nerve among supporters online. More importantly they resonated among moderates and swing voters (and perhaps even leaning Republicans), the key demographics that need to be courted in order to win an election. 

——————–

The analysis was conducted in collaboration between Prospectus Research and Oraclum Intelligence Systems.
Prospectus Solutions AS is a Norwegian based AI and simulation design company that has developed new platforms around its Multi-Agent AI technology. One of its key current projects is the VOSA system (Virtual Online Society Analytics) which creates digital twins of real world social networks to allow for market and message testing using their Multi-Agent AI architectures at scale. More at
www.prospectussolutions.com . Earlier work by Prospectus has been used to predict religious extremism, Trump, the Catalonian Referendum, and global social stability.
Oraclum is a data science and market research company that uses the power of social networks and machine learning to predict election outcomes, market movements, product demand, and consumer behaviour. In 2016 we have successfully predicted
both Brexit and Trump. Our work includes doing survey experiments, data science modelling, and complex network and social network surveys. 

Authors: Justin Lane (PhD Cognitive Anthropology, University of Oxford), Igor Mikloušić (PhD Personality Psychology, University of Zagreb), Vuk Vuković (PhD Political Economy, University of Oxford)

______________________________________________________________________

* Moral foundations theory is a theoretical framework created by Jonathan Haidt, Rav Iyer, Sena Koleva, Craig Joseph and Jesse Graham. They drew from the rich history of morality research within both psychology and anthropology in order to identify the full scope of human moral landscape and the reasons behind cross cultural differences and similarities in morality. Their theory proposes the existence of five universal, innate and evolved psychological modules that make us value certain traits as virtues and view certain behaviors as morally commendable or reprehensible. 
– The care module, that expanded from our kin attachment system, makes us concerned with the wellbeing of others and responsive to signs of distress and harm. It is represented through the virtues of nurturance, kindness, empathy, and compassion.
– The fairness module, evolved as both a means to avoid exploitation and to enable reciprocally altruistic relationships, makes us sensitive to inequality, non-proportional compensations and cheating. It is manifested through virtues such as justice and righteousness.
– The loyalty module, evolved through forming and maintaining strong coalitions, binds us to a group. It is best represented through the virtues such as patriotism, self-sacrifice for the group heightened feeling of group loyalty and sensitivity to betrayal.
– The authority module, evolved through a long history of hierarchical social structures, manifests itself through the respect and desire for social structure and authority, valuing leadership or followership, hierarchy, traditionalist and contempt for either illegitimate authority or disrespect for authority, societal rules and/or roles.
– The purity module, evolved on top of our disgust and pathogen avoidance modules, is represented by promoting behaviors that suppress our biological desires and preserve of our minds and bodies from harmful ideas or pathogens. It is represented through values such as chastity, purity, temperance.
And although the foundations are thought to be universal how much emphasis a person will put on certain foundation, or which foundation will be central for understanding a moral issue will depend on the environment and culture as well as personality and temperament of the person. Theory gained prominence through findings which demonstrated that liberals (or in their case Democrats) and conservatives (or Republicans) differ in the weight they put on each of these foundations. Individual oriented moral foundations tend to be more represented with liberals (Care/Fairness) whilst group oriented moral concerns are more salient in the conservatives, although conservatives seem to value individual oriented moral concerns as well. For more information visit yourmorals.org

The post There is only one Democratic primary candidate that outperforms Trump on Twitter appeared first on Oraclum blog.

]]>
Bias in approval ratings https://oraclum.eu/bias-in-approval-ratings/ Mon, 09 Apr 2018 08:43:19 +0000 https://oraclum.eu/?p=823 This post is part of the Oraclum White Paper 09/2018, published on our website. Oraclum White Papers are analytical reports on Oraclum’s predictions and prediction methods. They are designed to be informative, provide an in-depth statistical analysis of a given issue, call for a proposal to action, or introduce a unique solution based on one of […]

The post Bias in approval ratings appeared first on Oraclum blog.

]]>
This post is part of the Oraclum White Paper 09/2018, published on our website. Oraclum White Papers are analytical reports on Oraclum’s predictions and prediction methods. They are designed to be informative, provide an in-depth statistical analysis of a given issue, call for a proposal to action, or introduce a unique solution based on one of Oraclum’s products.

Trump’s approval ratings have the same problem as his pre-election polls – they are biased!

Since the beginning of his presidency Donald Trump has been experiencing the lowest recorded presidential approval ratings in US history. According to FiveThirtyEight the aggregate numbers for March, after a year and two months in office, are at around 41-42%, which is lower than any US president since WWII. Usually presidential approval ratings are to some extent correlated with the probability of re-election for the second term, but even more importantly less popular Presidents tend to drag their parties down in midterm elections (see Figure 1). Notice, however, that no matter how popular they were; only two presidents, Bush in his first term, and Clinton in his second, have helped their parties gain House seats in the midterm elections. All the others have seen their parties lose seats, but the size of the loss was inversely proportional to the president’s popularity immediately prior to the midterm election. In other words, a more popular president helped his party lose less House seats.

Figure 1: Presidential approval ratings and their parties’ House midterm results

Although at first glance this might sound concerning or reassuring (depending on whether you’re a Republican or a Democrat), bear in mind that Trump has had a stern record in defying both polls and historical political trends. He remains a very divisive president, just as much he was a divisive presidential candidate; however he still managed to carry the national victory, while exercising a very strong coattail effect with only 46.1% of the final vote share. His approval ratings therefore need to be taken with a pinch of salt, and should certainly not be examined at face value. The reason is similar as to why his polling numbers were wrong in 2016 – an increasing number of non-respondents.

Non-response bias in polls

Pollsters in many countries have been subject to a lot of bad press over the past few years. One of the main reasons was their failure to accurately grasp voter preferences in election times. The most prominent ones were the big misses in three consecutive UK elections, the 2015 and 2017 generals and the 2016 Brexit referendum, and of course the 2016 Trump victory in the US.

One reason for this is the rapidly decreasing number of response rates for traditional telephone polls. A response rate is the number of people who agree to give information in a survey divided by the total number of people called. According to Pew Research Center, a prominent pollster, and Harvard Business Review response rates have declined from 36% in 1997 to as little as 9% in 2016. This means that in 1997 in order to get say 900 people in a survey you had to call about 2500 people. In 2016 in order to get the same sample size, you needed to call 10,000 people. Random selection is crucial here (because the sample mean in random samples is very close to the population mean) and pollsters spend a lot of money and a lot of effort to achieve randomness even among those 9% who did respond. But can this be truly random is an entirely different question. Such low response rates are almost certainly making the polls subject to non-response bias. This type of bias significantly reduces the accuracy of any telephone poll, making it more likely to favor one particular candidate because they only capture the opinion of particular groups, and not the entire population. Online polls on the other hand suffer from self-selection problems and are by definition non-random and hence biased towards particular voter groups (younger, urban populations, usually also better educated).

Following the above example, assume that after calling about 10,000 people and only getting 900 (correctly stratified and supposedly randomized) respondents, the results were the following: 450 for Clinton, 400 for Trump, and 50 undecided (assuming, for simplicity, no other candidates). This would yield the poll saying that Clinton is at 50%, Trump at 44.4%, and that 5.5% are undecided, and it would conclude that because the sampling was random, the average of responses for each candidate in the sample is likely to be very close to the average in the population.

But it’s not. The low response rate suggests that some of those who do intend to vote simply did not want to express their preferences. Among all those 9000 non-respondents the majority are surely people who dislike politics and hence will not even bother to vote (turnout in the US is usually between 50 and 60%, meaning that almost half of the eligible voters simply don’t care about politics). However, among the rest there are certainly people who will in fact vote, some of which will probably support Trump, but are unwilling to say this to the interviewee directly. Why people do this is still unknown. There are two plausible expiations of why a potential Trump supporter would refuse to give an answer to a poll: 1) they are embarrassed or afraid to say they support Trump to a live phone interviewer, or 2) they distrust the pollsters and view them in the same context as the “fake news” media. There could be a number of other reasons, but one thing is sure – voters have started to avoid expressing their opinions in surveys. And this is posing a serious problem to the industry, and hence to anyone who depends on information from survey research.

Before offering potential remedies, how can we be so sure that the non-respondents in polls are Shy Trump voters? Why shouldn’t they potentially be Shy Hillary voters?

Shy Trump voters

There are several reasons suggesting that non-respondents in polls are in fact Trump rather than Hillary voters.

The first one is the recent finding that Trump’s approval ratings tend to be higher in Interactive Voice Response (IVR) or online polls as opposed to telephone polls, i.e. when polls are not being done by live human interviewers, but when people are talking to machines, or are just filling out an online poll. The difference is as large as 10 percentage points (48.7% Trump support in IVR surveys versus 38.2% in Internet surveys), which is a huge difference, much larger than the usual margin of error.

Furthermore, some pollsters survey the entire adult US population, while others focus only on those who are likely to vote (more on this below). In terms of election polls this can make a big difference. Some people might dislike a candidate so much that they will give them a negative rating, but they have no intention of voting at all (perhaps they are fed up with politics). On the other hand anyone rating a political candidate highly will surely vote for them. This implies that the responses from the entire adult population will be less accurate than responses from likely voters. This won’t make much of a difference in general market research surveys for products or services, but it will make a difference in political polling.

Finally and most importantly, our own polling during the 2016 election uncovered a systematic anti-Trump bias within the 30 states for which we ran our BASON survey.

Figure 2 compares the success of our method (x-axis) with the success of the polling average (y-axis) for the difference between the predicted and actual vote share for Donald Trump. For the polling average any dots beyond the horizontal line overestimate Trump, while any dots under the horizontal line underestimate him. For our model the overestimation is to the right of the vertical line, and the underestimation is to the left of it.

It is clear that our model under and overestimates Trump to a relatively equal extent for all states, being most precise in the most important swing states (PA, FL, NC, VA, CO, etc.). On the other hand the polls consistently underestimate Trump in almost every state. The only outlier where they overestimated Trump by almost 6%, was – DC. This implies that the polls systematically and significantly underestimated Donald Trump.

Figure 2: Oraclum’s BASON Survey vs. polling average for Trump

Looking at the same numbers for Hillary Clinton we can see that the polls were relatively good in estimating her chances. For most states they fall within a 2% margin of error, where for about 10 states the polling average was spot on. Our method once again over and underestimated Clinton to an equal extent, being the most precise where it mattered the most.

Figure 3: Oraclum’s BASON Survey vs. polling average for Clinton

Taking all this into account, the key to understating the underestimation of Trump by the pollsters was in the undecided voters. Therefore the hypothesis of a ‘Shy Trump’ voter could be true – many Trump voters simply did not want to identify themselves as such in the polls. Or they really were undecided until the very last minute, making the final decision in the polling both itself.

Finally, let’s examine this systematic bias a bit further by comparing the calibration of the BASON Survey versus the polling average (calibration is the difference between prediction and actual results). The following graph shows the difference between predictions (y-axis) and the actual results (x-axis) for our method (blue dots) and the polling average (orange dots). A good prediction should be close to having a slope of 1, which is exactly what our method proved to be (a slope of 1.1). The polling averages on the other hand experienced a flatter slope of 0.77 which confirms a systematic underestimation of Trump even in states which Clinton easily won.

Figure 4: Calibration of the BASON Survey vs. polling average

So how is Trump doing right now?

If we look at things on a state-by-state level, Gallup has this data for the entire past year. They show that Trump is unpopular countrywide and that he is still underperforming his electoral result in almost all states. In the red states his net approval is still positive, but in the swing states, including all which he had won in 2016 his current approval is worse than his electoral result (see Table 1, column “Diff from 2016”).

However there are a few important caveats here. First, the data reports averages for the entire last year, from January 20th to December 30th 2017, so it does not account for the recent upward change in trend. Second, for the whole country on average, using a year-long sample of over 170,000 people, the approval rating was 38%, and disapproval was 56% (this is accounting for different state size). This is a bit lower than what the current polling average for March gives him, which is between 41 and 42%. If we account for this evenly across all states it suggests that Trump is doing slightly better in the swing states that he won, however he is still underperforming in almost all of them by at least 3 to 4 percentage points (instead of 6 to 8 p.p.).

The third caveat is concerning Gallup’s methodology. Gallup is one of those pollsters that uses telephone interviews and calls a representative sample of all over-18 Americans to see if they approve or disapprove of the President. The first issue here, as emphasized previously, is that not all of these people eventually vote. When looking at Rasmussen polls which take into account only likely voters, Trump’s approval ratings are much higher – around 46-47% in March. The second issue is that the Gallup polls are done using live telephone interviews which make them more subject to anti-Trump bias. Rasmussen uses an automated polling methodology (IVR) where respondents give their opinions to an automated machine, making them more likely to be truthful.

Table 1: Trump 2017 approval ratings, 2016 pre-election polls, and 2016 results

Finally, the fair comparison in this case would not be Trump’s election result versus his approval ratings, but rather his 2016 pre-election polls versus his 2017 approval ratings (the final column in Table 1). According to these his performance in 2017 was not too far off from his pre-election polling. In fact, for a few key swing states, like Ohio or Pennsylvania, or the surprises he pulled in Michigan and Wisconsin, he is very much in the same position he was in 2016 before the election. He is underperforming in Florida, North Carolina, Georgia, and Iowa (of the states that he’d won), however when taking into account that his nation-wide trend has improved and is now at 41-42% instead of 38% that Gallup reported for 2017, Trump is very likely not doing any worse than he was in 2016.

Bearing in mind that the current polls are still underestimating Trump, the current face values of his approval ratings will not be too informative of the actual state of his popular support. His approval rating is certainly low, but so was his general election vote share, yet he still managed to scrape a victory in almost every key swing state.

What does this suggest about the midterm House and Senate races? The recent election results do offer a glimpse of hope to the Democrats as they imply that Trump’s coattail effect has waned for his fellow Republicans down the ballot. Given that he himself is not running and that opinion on politicians is at an all time low in the US, there shouldn’t be any coattail effect this time, and the House races will probably repeat the historical trend when low approval ratings of a President suggest a House net loss for his party. However, when designing pre-election prediction models on specific House and Senate races, the Democrats would be wise not to count too much on the current Trump approval ratings. The suggestion would be to either avoid placing a high emphasis on the approval ratings, as they tend to overestimate the chances of the Democratic party’s candidates, or use an alternative method that can be much better in correctly and precisely estimating the Trump approval rating.

Oraclum’s BASON Survey – the only poll to successfully solve the sampling bias problem

Oraclum’s BASON Survey is just that type of method. It is proven to yield much more accurate estimates of election outcomes when taking into account that people distrust polls and have a tendency to be less truthful. The BASON Survey asks people what they think who will win, and how they feel about who other people think will win.

It is based on a wisdom of crowds (WoC) approach to polling, accompanied by a network analysis of survey participants and their friendship networks in order to eliminate groupthink bias (see the Box for further explanation). By doing so it is able to generate much more accurate results than regular polls which struggle to find the right sampling methodology in times when response rates are at their historical lows.

By asking people to express their opinions on what others in their neighborhoods or states would do, we avoid the issue of respondents not truthfully reporting their opinions. After all, the information we seek to find out is making a prediction who will win, not who you will vote for, and it’s about thinking what other people would do. The BASON Survey leaves people well within their own comfort zones and gives them a chance to think and without pressure express their opinion on a subject.

The way we ask the questions also gives people further incentive to think about the questions and self-correct their own answers. This delayed judgment has been proven by behavioral scientists to improve accuracy of people’s own forecasts. By asking our questions this way we deliberately sacrifice large samples for accuracy. It is important to stress out that the BASON Survey does not use any private information of its respondents, and has no way of knowing who they are. We base our predictions purely on what people tell us.

The BASON Survey has been tested on a number of elections and market research problems, and has yielded incredible accuracy every time. It is the single best prediction tool available on the market, guaranteed to correctly identify what voters (or customers) want and why, without invading anyone’s privacy.

Read more about the BASON Survey here, or in the White Paper.

The post Bias in approval ratings appeared first on Oraclum blog.

]]>