Robertson Wang
Washington D.C., United States
Office: 1850 K Street
Email: robertsonwang [at] gmail.com
Phone: +1.(908).745.8075
Home     Projects     Research     CV     Blog



Example single doc summarization using AIG's 2009 Earning Announcement

Back-end code for creating single doc summaries using TextRank, Naive BOW methods, and LSI

      In this project, I use a corpus of earnings announcements sourced from the websites of large financial institutions. For each of these institutions, I attempt to automatically generate a short summarization of their quarterly earnings announcements. The academic financial literature has shown that the textual information embedded in firm earnings announcements significantly explains firm stock returns, volatility, and earnings. I try three approaches to generate single document summarizations:

  • Naive Bag of Word methods - This method, in general, involves weighting sentences within a document based on some underlying bag of word measure and returning high ranking sentences in chronological order. Bag of Word measures I used include keywords based on tf-idf ranking, summation based selection, and density-based selection.
  • TextRank - This method follows the algorithm put forth in Mihalcea and Tarau, 2004, which is a modified version of Google's PageRank where nodes are sentences and edges are weighted by sentence similarity.
  • Latent Semantic Indexing - This method follows the algorithm put forth in Gong and Liu, 2001. This is essentially Principal Components Analysis applied to term-document matrices.

Recommendation model overview and a short explanation of the features/ML models used

An example recommendation using a user from the 2017 Yelp Academic Dataset

Back-end code for model testing, feature engineering, and making recommendations

      I used the 2017 Yelp Academic dataset in order to build out a recommendation system using only NLP methods. In the past, Yelp recommendation systems have been built using matrix completion methods and restaurant attributes. However, I wanted to show that it was possible to achieve similiar recommendation results using measures of linguistic tone and topic models. Specifically, I use sentiment analysis based on the Hu & Liu 2004 word dictionary, Latent Dirichlet Allocation, Latent Semantic Analysis, and a 2 word n-gram TF-IDF feature matrix. The machine learning models that I test on each user are Random Forest, Linear Support Vector Machine, and Naive Bayes.

Corporate Earnings Report Crawler for NASDAQ Listed Companies

      A quick little script that pulls the entire history of corporate earnings for any user-specified companies listed on the NASDAQ stock exchange. The script takes stock tickers as the user input. Note that this script relies on the selenium, BeautifulSoup, and pdfkit modules. These can be installed using the line 'pip install [module_name]'. The usual opensource/noncommercial license legalese apply.

UK Inflation and Unemployment: A Time Series Analysis

      A brief exercise in which we explore the time series properties of UK Inflation in the period from 1986-2015. We plot the annual and quarterly break downs of the series, look at the autocorrelation, and estimate the power spectral density. We use the Phillips-Perron and augmented Dickey-Fuller tests and find that we cannot reject the null hypothesis of the presence of a unit root in the series at a 5% significance level. We use a VAR and OLS techniques to develop an understanding of the relationship between UK unemployment and inflation. We also run Granger-Causality tests for the relationship of inflation on unemployment and vice versa. We find that we can reject the null of each series not Granger-causing the other at a significance level of 5%. In addition, we plot the impulse response functions of each series responding to a shock in the other series as well as decomposing the forecast error variance for both series. We conclude by developing a state space time-varying parameter model, setting inflation as the state and unemployment as the latent variable.

UK Unemployment: A Time Series Analysis

      We test basic time series models against UK Unemployment data from 1986 to 2015. We first note that the null hypothesis, the presence of a unit root in the time series, cannot be rejected. This indicates that the time series is not stationary, which precludes the use of ARMA models without first integrating the data. Nonetheless, as an exercise, we demonstrate the results using various AR and ARMA processes. We used Monte Carlo methods to verify our estimation procedures. Next, we performed a model selection exercise using the ARMA(p,q) model. We tested a combination of lag structures, p and q, from 0 to 5. We find that the AIC and AICC criteria indicate the use of an ARMA(5,5) model with ARMA(4,3) having very close values. The BIC criterion indicates very large lag structures but the ARMA(2,1) model is not too far off from the minimum value. Finally, we conclude by performing a forecasting exercise of the data using the AR(2) and ARMA(1,1) processes.