The Influence of Media on Tesla’s Stock Price

Introduction
Exploratory Data Analysis
- Stock Data
- Twitter Data
Regression
- Daily Number Regession
- Multivariate Regression
Conclusion
Future Analysis
Shiny App

Introduction

Motivation: For those of you who have been following the financial news for the past year, there are a lot of discussions about what happened to the company, what the CEO Elon Musk said and did and how its stock price has changed accordingly. So we wonder: does what happened in the media– Elon Musk’s twitter, the attitude of the press and investors–has a significant influence on Tesla’s stock price? This is the motivation of our project.

Data: Here is a brief summary of the datasets we used. We downloaded the daily historical price of Tesla for the past two years.We also got the minutely prices from Bloomberg just in case that we need to look into the changes in more details. We scrapped the news and tweets data from twitter. The two datasets “CNBC_TSLA” and “Elon_Musk” are the tweets of CNBC’S twitter account regarding Tesla and the past tweets of Elon Musk himself. All data are stored in the “data” folder.

Methodologically, We extract data from popular news sources and social media, such as Twitter, Facebook, CNN, WSJ, etc from from 11/10/2016 to 11/10/2016. After conducting sentiment analysis (through R libraries), we use the sentiment, the number of related news, and the speed of transmission for the quanlitative analysis. We analyze Tesla’s daily stock prices and volumes from 11/10 /2016 to 11/10/2018 and minutely stock prices and volumes from from 04/11/2018 to 11/10/2018.

Hypothesis: We hypothesize that news from major media outlets such as Twitter and Wall Street Journal may have a strong influence (such as correlation) on the stock price of Tesla, Inc. One might imagine that stock prices are particularly susceptible to breaking news on social media since the news reflect new market information. For instance, when Elon Musk (Tesla’s CEO) himself gives out the unplanned news, it seems like the market would experience extreme momentary fluctuations.

The purpose of the project is to analyze and (explore correlation) between social media and news platforms with stock price. We believe that if our research finds a correlation between stock movements and news, then we might be able to create trading strategies. Otherwise, the absence of correlation should teach us not to trade stocks purely based on breaking news.

Other: The tool we majorly use is R language by Rstudio. We might also utilize Python if needed for specific libraries.

Exploratory Data Analysis

Stock Data

We get the historical data for Tesla stock from Yahoo! Finance. We take the close and open price and volume of Teala stock from 2016-11-10 to 2018-11-10.

Daily Price Data

We import the daily price and plot the log of close price over time to see how Tesla’s stock price has changed over time. We also computed the basic summary statistics(mean, range, variance, etc.) of the stock price.

library(tidyverse)

daily_price<-read.csv('data/Stock_Data/TSLA.csv')%>%
  transmute(Date=as.character(Date),Open,Close,Volume)
problems(daily_price)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

daily_price%>%ggplot()+
  geom_point(aes(Date,log(Close)), alpha=.5)

as.tibble(list(mean=mean(daily_price$Close),min=min(daily_price$Close),
               max=max(daily_price$Close),variance=var(daily_price$Close), standard_deviation=sd(daily_price$Close)))

# A tibble: 1 x 5
   mean   min   max variance standard_deviation
  <dbl> <dbl> <dbl>    <dbl>              <dbl>
1  306.  181.   385    2084.               45.7

As we can see from the graph, Tesla’s stock price exprienced a relatively steady growth at first but started to have more volatility during the past one year or so. This is the period (07/10/2017 to 11/10/2018) in which we are mainly interested in. The mean price is $305 dollars. This relatively high price increases the barrier of investment and should theoratically decrease the volatility. The highest price is 385 while the min is only 181. The standard deviation is 45.

We also made a shiny app so that you can explore with the more detailed change in stock price. [Shiny App: https://zacklight.com/shiny/news_stocks_sentiment_analysis/ ]

Minutely data

Considering that we may need to look into the change in stock price in more detail to see how it responded to news on social media, we also obtained the minutely price and volume data for the past 7 months from Bloomberg and imported it.

library(readxl)
minutely_price<- read_excel("data/Stock_Data/bloomberg_tsla_minutely_price_04252018_11072018.xlsx", 
                            sheet = "Sheet1")%>%transmute(Date=Dates,Open,Close,Volume)
problems(minutely_price)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

minutely_price%>%ggplot()+geom_line(aes(Date,log(Close)))

ggplot(minutely_price)+geom_line(aes(x=Date,y=Volume))

as.tibble(list(mean=mean(minutely_price$Close),min=min(minutely_price$Close),
               max=max(minutely_price$Close),variance=var(minutely_price$Close), standard_deviation=sd(minutely_price$Close)))

# A tibble: 1 x 5
   mean   min   max variance standard_deviation
  <dbl> <dbl> <dbl>    <dbl>              <dbl>
1  308.  248.  386.     806.               28.4

The minutely data (04/11/2018 to 11/10/2018) have similar statistics (a lower standard deviation).

Comparing to S&P500

To see how Tesla’s stock change is related to the change in stock market, we compared it to the price of S&P 500.

SP500<-read.csv('data/GSPC.csv')%>%
  transmute(Date=as.character(Date),Open_SP500=Open,Close_SP500=Close,Volume_SP500=Volume)

daily_price%>%left_join(SP500,by="Date")%>%
  ggplot(aes(Date))+geom_point(aes(y=log(Close)))+
  geom_point(aes(y=log(Close_SP500)-2,color="red"))+
  scale_y_continuous(sec.axis = sec_axis(~.+2,name="log(Close_SP500)"))

As shown in the graph, the correlation between S&P500 and Tesla is not very strong. While S&P500 is generally growing over time, there is more fluctuation in the stock price of Tesla. So, there must be other reasons driven the change and we believe the news and releases on social media can be an explanation for that. Our assumption about the influence of social media can be a possible explanation for that.

Twitter Data

Next we scrapped data of Elon Musk’s own twitter account and relative reports about Tesla from the twitter account of a news source (CNBC). The source code is in scrape_twitter_data.py.

elon_musk is all the tweets from Elon Musk’s tweeter account. (Note there is a high chance that he would delete undesired tweets.)
CNBC_TSLA_News represents all tweets related to Tesla from the CNBC tweeter account.
tesla_elon represents all the publish tweets with hashtags related to Tesla.

After performing the analysis on all three datasets, we have found while all having similar characteristics, the CNBC one to have the best correlation with the stock movements. Thus, we would use it for most of our analysis below.

library(tidyr)
library(stringr)
library(dplyr)
library(gtable)

First, we cleaned the tweets of CNBC’s twitter account and did some analysis on it.

cnbc_tsla <- read_csv("data/Twitter_Data/CNBC_TSLA_News.csv") 
cnbc_tsla <- cnbc_tsla %>% filter(!duplicated(text) == TRUE)
problems(cnbc_tsla)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

(cnbc_by_date <- cnbc_tsla %>%  separate(time, into = c("year", "month", "day"), sep = "-" ) )

# A tibble: 2,244 x 7
   year  month day   text                           replies retweets likes
   <chr> <chr> <chr> <chr>                            <int>    <int> <int>
 1 2016  11    10    Tesla shares downshift into u~       4       10     8
 2 2016  11    10    Cramer explains why investors~       3        8    12
 3 2016  11    12    Elon Musk: Robots will take y~      30       55    56
 4 2016  11    16    Tesla's ludicrously fast car ~       1        9    26
 5 2016  11    16    How billionaire tech mogul El~       1       16    17
 6 2016  11    17    BREAKING: Tesla's acquisition~       1       58    45
 7 2016  11    17    Tesla and SolarCity sharehold~       0       10     7
 8 2016  11    19    How billionaire tech mogul El~       3       12    19
 9 2016  11    21    JUST IN: Tesla's acquisition ~       4       33    29
10 2016  11    21    Musk got what he wanted in Te~       0       10     6
# ... with 2,234 more rows

(cnbc_daily_res_table <- cnbc_by_date %>% group_by(year, month, day) %>% summarise(
  c_dailyLikes = sum(likes),
  c_dailyRep = sum(replies),
  c_dailyRet = sum(retweets)
  )  )

# A tibble: 541 x 6
# Groups:   year, month [?]
   year  month day   c_dailyLikes c_dailyRep c_dailyRet
   <chr> <chr> <chr>        <int>      <int>      <int>
 1 2016  11    10              20          7         18
 2 2016  11    12              56         30         55
 3 2016  11    16              43          2         25
 4 2016  11    17              52          1         68
 5 2016  11    19              19          3         12
 6 2016  11    21              35          4         43
 7 2016  11    22              82          8         62
 8 2016  11    28              13          1          6
 9 2016  11    29              15          2         15
10 2016  12    01              17          5         17
# ... with 531 more rows

(cnbc_ret_likes <- cnbc_daily_res_table %>%  ggplot(aes( log(c_dailyRet), log(c_dailyLikes) ) ) + geom_point(aes(color = year), position = "jitter") + geom_smooth() )

(cnbc_rep_likes <-  cnbc_daily_res_table %>%  ggplot(aes( log(c_dailyRep), log(c_dailyLikes) ) ) + geom_point(aes(color = year), position = "jitter") + geom_smooth()  )

cnbc_tsla<-cnbc_tsla%>%filter(!(duplicated(text)==TRUE))
daily_number_cnbc<-cnbc_tsla%>%mutate(Date=as.character.Date(time))%>%group_by(Date)%>%count()
ggplot(daily_number_cnbc)+geom_histogram(aes(n))

elon_musk <- read_csv("data/Twitter_Data/Elon_Musk_(@elonmusk)_Twitter.csv") %>%  distinct()
problems(elon_musk)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

by_date <- elon_musk %>%  separate(time, into = c("year", "month", "day"), sep = "-" ) 
daily_res <- by_date %>% group_by(year, month, day) %>% summarise(
  dailyLikes = sum(likes),
  dailyReplies = sum(replies),
  dailyRetweets = sum(retweets)
  )

tesla_elon <- read_csv("data/Twitter_Data/(_) @elonmusk @Tesla - Twitter Search_with_scores.csv")
problems(tesla_elon)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

by_date <- tesla_elon %>%  separate(time, into = c("year", "month", "day"), sep = "-" ) 
daily_res <- by_date %>% group_by(year, month, day) %>% summarise(
  dailyLikes = sum(likes),
  dailyReplies = sum(replies),
  dailyRetweets = sum(retweets)
  )

As we expected, the log values of the number of likes, retweets, and replies are correlated. The number of tweets in a day are right-skewed so we would apply log transformation a lot in the later analysis.

We also found out that there some outliers with the number of tweets with #Tesla. So, we filtered these outliers to see what happened on these dates. These just happen to be the dates when there are major news about Tesla, for example, when Tesla released its Q2 earings call or when Elon Musk said that Tesla would go private. This tells us tweets do have critical business information.

daily_number_cnbc%>%arrange(desc(n))%>%filter(n>20)

# A tibble: 4 x 2
# Groups:   Date [4]
  Date           n
  <chr>      <int>
1 2018-05-03    31
2 2018-08-08    29
3 2018-09-28    27
4 2018-08-07    22

cnbc_tsla%>%filter(time==c("2018-05-03","2018-08-07","2018-08-08","2018-09-28"))

# A tibble: 27 x 5
   time       text                                  replies retweets likes
   <date>     <chr>                                   <int>    <int> <int>
 1 2018-05-03 Sacconaghi on Musk earnings call: 'T~       6       26    55
 2 2018-05-03 Tesla bull sounds off after conferen~       1        2     5
 3 2018-05-03 Tesla opens 7 percent down the day a~       5       10     8
 4 2018-05-03 Musk's bizarre earnings call was 'th~       6       18    20
 5 2018-05-03 Elon Musk is acting like he 'plans t~       1        8    17
 6 2018-05-03 Short-sellers have been looking for ~       1        9    15
 7 2018-05-03 Tesla saw its worst day in more than~       3        3     2
 8 2018-05-03 .@JimCramer thanks Tesla CEO Elon Mu~       6       11    21
 9 2018-08-07 A Tesla leveraged buyout would be 'b~       9       24    34
10 2018-08-07 Securities lawyers shocked by Elon M~       2       13    19
# ... with 17 more rows

So we filtered out the days that has more than 20 tweets. The four dates are May 5, August 7 and 8, September 28.

May 5 is the day when Elon Musk cut off Wall Street analysts and call their questions boring, And in August he said on his personal twitter account that he will take Tesla private. Then in September he was sued by SEC for fraud. And all of this were followed by a significant drop in Tesla’s stock price. You can find the relates tweets from our shiny app.

Regression

Daily Number Regession

So, a natural thought would be that: Would the change of Tesla’s stock price be explained by the numbers of twitters on that day? Does more discussions on social media leads to more changes in stock price? We have done the same set of analyses on all three tweet dataset and they have all yeiled similar results. So we would just present the @elonmusk one here.

daily_number<-elon_musk%>%mutate(Date=as.character.Date(time))%>%group_by(Date)%>%count()
daily_number_price<-daily_number%>%left_join(daily_price,by=c("Date"))

mod_num1<-lm(log(Close)~n,data=daily_number_price)
coef1<-coef(mod_num1)
summary(mod_num1)


Call:
lm(formula = log(Close) ~ n, data = daily_number_price)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22290 -0.06323  0.00715  0.07811  0.19183 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.7514908  0.0101514 566.571   <2e-16 ***
n           -0.0006115  0.0026716  -0.229    0.819    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09093 on 199 degrees of freedom
  (79 observations deleted due to missingness)
Multiple R-squared:  0.0002632, Adjusted R-squared:  -0.004761 
F-statistic: 0.05238 on 1 and 199 DF,  p-value: 0.8192

ggplot(daily_number_price,aes(x=log(n),y=log(Close)))+geom_point()+
  geom_abline(intercept = coef1[1],slope = coef1[2],color="red")

daily_number_price_cnbc<-daily_number_cnbc%>%left_join(daily_price,by=c("Date"))

mod_num2<-lm(log(Close)~n,data=daily_number_price_cnbc)
coef2<-coef(mod_num2)
summary(mod_num2)


Call:
lm(formula = log(Close) ~ n, data = daily_number_price_cnbc)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51254 -0.06600  0.02346  0.10664  0.23487 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.713405   0.010263 556.712   <2e-16 ***
n           0.002485   0.001664   1.493    0.136    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1448 on 436 degrees of freedom
  (103 observations deleted due to missingness)
Multiple R-squared:  0.00509,   Adjusted R-squared:  0.002808 
F-statistic:  2.23 on 1 and 436 DF,  p-value: 0.136

ggplot(daily_number_price_cnbc,aes(x=log(n),y=log(Close)))+geom_point()+
  geom_abline(intercept = coef2[1],slope = coef2[2],color="red")

price_lag<-daily_price%>%mutate(change=(log(Close)-lag(log(Close),1)))
lag<-daily_number_price_cnbc%>%left_join(price_lag,by=c("Date"))
mod_num3<-lm(abs(change)~log(n),data=lag)
coef3<-coef(mod_num3)

ggplot(lag,aes(x=log(n),y=log(abs(change))))+geom_point()+
  geom_abline(intercept = coef3[1],slope = coef3[2],color="red")

summary(mod_num3)


Call:
lm(formula = abs(change) ~ log(n), data = lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.032147 -0.013294 -0.004232  0.007267  0.131593 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.011123   0.001789   6.216 1.19e-09 ***
log(n)      0.008865   0.001246   7.116 4.61e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02062 on 435 degrees of freedom
  (104 observations deleted due to missingness)
Multiple R-squared:  0.1043,    Adjusted R-squared:  0.1022 
F-statistic: 50.64 on 1 and 435 DF,  p-value: 4.607e-12

Then we also ran regression of the number of likes on the change of price.

elon_musk<-read_csv("data/Twitter_Data/Elon_Musk_(@elonmusk)_Twitter.csv") %>% distinct()%>%
  mutate(Date=as.character(time))
problems(elon_musk)

# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

daily_likes <- aggregate(elon_musk$likes, by=list(elon_musk$Date), sum) %>%
  rename( Dates = Group.1, likes = x)  %>% mutate( Date = as.character(Dates)) %>%
  select(Date, likes) 
  
daily_price<-read_csv('data/Stock_Data/TSLA.csv')%>%
    transmute(Date=as.character.Date(Date),Open,Close,Volume) 

t_elon_S_price <- inner_join( daily_price, daily_likes, by =  "Date" ) 
beta <- coef(lm( log(Close) ~ likes , data =  t_elon_S_price)) 

ggplot( t_elon_S_price ) + geom_point( aes(log(likes),log(Close)  )) +
  geom_abline(aes( intercept = beta[1], slope = beta[2]), color = 'red',
      alpha = 0.3, size = 1 )

We can see there is no clear correlations between any of the aforementioned variables and the stock price.

Multivariate Regression

Since the change of stock price cannot be simply explained by the numbers of tweets, we thought it may be better explained by the actual content of the text of each tweet.

We assigned sentiment scores to each twitter text and regress them on the price change.

Sentiment Scores

library(syuzhet)
library(SnowballC)
library(NLP)
library(tm)
library(wordcloud)

So we decided to do some natural language processing, we washed off hashtags, and links, and used an NLP library to assign tweets with 8 different emotion scores.

name <- "Elon_Musk_(@elonmusk)_Twitter"
name <- "CNBC_TSLA_News"
# name <- "@elonmusk_@Tesla"
import_path <- paste("data/Twitter_Data/", name, ".csv", sep='')
tweets <- read_csv(import_path) %>% 
  mutate(text = str_to_lower(text)) %>% 
  mutate(text = str_replace_all(text, "https?.*\\s?","")) %>% 
  mutate(text = str_replace_all(text, "[#@].*\\s?","")) %>% 
  mutate(text = str_replace_all(text, "pic.twitter.*\\s?","")) %>% 
  mutate(text = str_replace_all(text, "rt","")) %>% 
  mutate(text = str_replace_all(text, "[[:punct:]]","")) %>% 
  mutate(text = str_replace_all(text, "[ |\t]{2,}","")) %>% 
  mutate(text = str_replace_all(text, "^\\s","")) %>% # Remove blank spaces at the beginning
  mutate(text = str_replace_all(text, "\\s$", "")) %>% #Remove blank spaces at the end
  mutate(text = removeWords(text,stopwords()))
 
sentiment_scores <- get_nrc_sentiment(tweets$text) 

tweets_with_scores <- merge(tweets, sentiment_scores, by="row.names", all.x=TRUE) %>% 
  select(-1)

export_path <- paste("data/Twitter_Data/", name, "_with_scores.csv", sep='')
write.table(tweets_with_scores, file = export_path, sep=",", row.names=FALSE)

cnbc_score<-read_csv("data/Twitter_Data/CNBC_TSLA_News_with_scores.csv")

sentiment_cnbc<-list(x=c("anger","anticipation","disgust","fear","joy","sadness","surprise","trust","negative","positive"),y=cnbc_score%>%
  select(anger,anticipation,disgust,fear,joy,sadness,surprise,trust,negative,positive)%>%
  colSums())
sentiment_cnbc<-as_tibble(sentiment_cnbc)
colnames(sentiment_cnbc)<-c("sentiment","value")

ggplot(sentiment_cnbc)+geom_col(aes(sentiment,value,fill=sentiment))

As you can see all eight emotions increase in intensity as time goes by. It is explained by the fact that the number of CNBC tweets related to Tesla gradually increases.

Multivariate Regression

Then we can plot the emotions against the stock price fluctuations.

price_lag <- daily_price %>% 
  mutate(log_close = log(Close)) %>% 
  mutate(change=(Close/lag(Close,1)-1))

tweets_with_scores <- tweets_with_scores %>%
  mutate(Date=as.Date(time, format = "%d.%m.%Y")) %>% 
  select(Date, everything()) %>% 
  select(-time)

emotions <- colnames(tweets_with_scores)[6:15] %>% 
  paste(shQuote(., type="sh"), collapse=", ")

summarized_tweets_with_scores <- tweets_with_scores %>% 
  group_by(Date) %>% 
  summarise(
    anger = sum(anger),     
    anticipation = sum(anticipation),
    disgust = sum(disgust),   
    fear = sum(fear),     
    joy = sum(joy),
    sadness = sum(sadness),
    surprise = sum(surprise), 
    trust = sum(trust),
    negative = sum(negative),
    positive = sum(positive)
  )

daily_price_with_scores <- price_lag %>% 
  mutate(Date=as.Date(Date)) %>% 
  left_join(summarized_tweets_with_scores, by="Date") %>% 
  .[complete.cases(.), ]

daily_price_with_scores %>% gather("id", "value", 7:16) %>% 
  ggplot(., aes(Date, value))+
  geom_point(position = "jitter")+
  geom_smooth(method = "lm", se=FALSE, color="blue")+
  facet_wrap(~id)

Note that the emotion scores do not add up to a fixed number like 100% but are discrete numbers based on the average emotion scores of all tweets on a given day. This also increases the difficulty of our research since the independent variables of emotions might be quite correlated with each other.

(daily_price_with_scores %>%  
    filter(negative>2.5) %>% 
    filter(Date>as.Date("2018-03-20")) %>% 
    ggplot(aes(x=Date)) + 
    geom_line(aes(y=negative,color="blue"))+
    geom_line(aes(y=change*200,color="red"))+
    scale_y_continuous(sec.axis = sec_axis(~.*10,name="stock")))

After filtering out the low-level changes in emotions, we can see a relatively nicely kinda symmetric chart reflecting how negativity influences stock prices.

#We cleaned the data and ran a multivariate regression on multiple sentiment scores. 
mod_multi<-lm(log_close~anger+anticipation+disgust+fear+joy+sadness+surprise+trust+negative+positive,data=daily_price_with_scores)
summary(mod_multi)


Call:
lm(formula = log_close ~ anger + anticipation + disgust + fear + 
    joy + sadness + surprise + trust + negative + positive, data = daily_price_with_scores)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51553 -0.06532  0.02118  0.10198  0.23274 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.7188756  0.0091740 623.379   <2e-16 ***
anger        -0.0099632  0.0069658  -1.430    0.153    
anticipation  0.0035029  0.0057313   0.611    0.541    
disgust       0.0140225  0.0117404   1.194    0.233    
fear         -0.0016951  0.0077855  -0.218    0.828    
joy           0.0085384  0.0081186   1.052    0.294    
sadness      -0.0062242  0.0084448  -0.737    0.462    
surprise     -0.0083060  0.0075390  -1.102    0.271    
trust         0.0048249  0.0051459   0.938    0.349    
negative      0.0006514  0.0056095   0.116    0.908    
positive      0.0003931  0.0033989   0.116    0.908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1428 on 426 degrees of freedom
Multiple R-squared:  0.0282,    Adjusted R-squared:  0.005383 
F-statistic: 1.236 on 10 and 426 DF,  p-value: 0.2656

The correlations for the multivaraite independent variables are quite low that we believe they are not great indicators for the price variale.

(Binary) Logistic Regression

Finally, we also attempted to simplify the stock movements as up or down, a binary variable and conducted logistic regress. We take negative and positive and ran a logistic regression to see how these attitudes would influence the probality that the stock price would go up.

# We also take negative and positive and ran a logistic regression to see how these attitudes would influence the probality that the stock price would go up.
binary<-daily_price_with_scores%>%mutate(binary=as.integer(change>0))
mod_binomial<-glm(binary~positive+negative,data=binary,family=binomial)
coef(mod_binomial)

(Intercept)    positive    negative 
 0.24266726  0.01417427 -0.14016648

The result also confirms our hypothesis that the positive sentiment are positively correlated with the stock prices and negative negatively correlated.

Correlation Matrix

Lastly, we would use a correlation matrix to analyze the effect of every variable on every other variable. This table visualizes the correlations by showing positive as blue, negative as red, and depth and size of the bubbles as the magnitude of correlations.

library(Hmisc)
library(corrplot)

mydata_daily_price_with_scores <- daily_price_with_scores[, c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)]

(res <- cor(mydata_daily_price_with_scores) %>% round( 2))

              Open Close Volume log_close change anger anticipation
Open          1.00  0.98   0.07      0.98  -0.07 -0.04         0.06
Close         0.98  1.00   0.08      0.99   0.08 -0.06         0.05
Volume        0.07  0.08   1.00      0.09   0.05  0.37         0.42
log_close     0.98  0.99   0.09      1.00   0.08 -0.04         0.06
change       -0.07  0.08   0.05      0.08   1.00 -0.24        -0.13
             disgust  fear   joy sadness surprise trust negative positive
Open            0.00 -0.04  0.10   -0.03     0.00  0.07    -0.03     0.07
Close           0.00 -0.06  0.08   -0.06    -0.01  0.06    -0.05     0.05
Volume          0.38  0.40  0.19    0.39     0.27  0.40     0.46     0.41
log_close       0.01 -0.04  0.10   -0.04     0.00  0.07    -0.03     0.06
change         -0.07 -0.20 -0.09   -0.21    -0.11 -0.13    -0.23    -0.12
 [ reached getOption("max.print") -- omitted 10 rows ]

res2 <- rcorr(as.matrix(res))

corrplot(res, type="upper", order="hclust", p.mat = res2$P, sig.level = 0.01, insig = "blank")

We found that stock price is weakly negatively correlated with emotion intensities. We hypothesize that people tend to get more emotional about low stock prices than the high ones.

However, we also find that emotion intensities have a mostly moderately positive correlation with the trading volume, which is the number of stocks people buy or sell in a day. It seems intuitive since when people are emotional, they are more `. All optimistic emotions such as anticipation and trust are slightly more positively correlated with the trading volume than all pessimistic emotions.

Conclusion

Now we conclude with the following points:

The numbers of tweets/likes is not a good predictor of the change in stock price.

The attitudes of the twitter have a influence on the probability of the increase of stock price. Postive tweets will rise the probability of the rice of stock price while negative tweets will lower the probability.

Using minutely (instead of daily) stock prices do not have a significant influence

The numbers of optimistic and pessimistic tweets tend to experience similar changes, reflecting people’s splitted view on Tesla.

All optimistic emotions such as anticipation and trust in tweets are weakly positively correlated with the trading volume of the Tesla stock, while all pessimistic emotions are less positively correlated

The explanatory power of our model may subject to the accuracy of the sentiment score we assigned and the way we measure the changes. The auto correlation between each sentiment can also influece the outcome of our regression.So there are still improvement that can be done to our model.

Future Analysis

Of course, different time-series analysis methodology might bring up new findings on the relationship between tweet emotions and stock prices.

Since TSLA stock surged in popularity both on Wall Street and on social media in recent years, our analysis window is likely limited and biased.

Shiny App

We also made a [Shiny App: zackLight.com ] app. It displays the news as you select the date range in the stock price chart. Please check it out.