Technologies used: PostgreSQL, Python, Scrapy, SQLAlchemy, WEKA.
Use machine learning to predict whether a penny stock will close up or down the next day.
This section will go over the tools used, and the tasks that were completed to see this project through.
At a high level, penny stock data needed to be collected. Then it was cleaned to remove inconsistencies and anomalies. After that, features were chosen that were thought to produce more accurate predictions. Then prediction models were tested on the data, and finally the results were evaluated.
To perform this project, it was decided to use the following technologies
First it was necessary to get a set of tickers to collect data on. The following set of tickers was produced using Google’s Stock Screener with criteria of $5M minimum market cap, $0 to $1 price range, and average volume of $500,000+
XXII, NDN, AMOVF, ANAD, AXAS, ACOL, BIRT, AEA, AATI, ARO, ERBB, HCT, ATRS, AREX, ACIIQ, MWIP, ASTI, ARP, ATOS, AONEQ, BCCI, BLTA, BAA, BIOD, BIEL, BLSP, BDBD, BBEP, IVIXF, CEC, CVM, CTIC, CRC, CANV, CBIS, CVEO, CLOW, CSA, FITX, CMLS, CYTX, DFG, DEWM, DRYS, EPGL, XCO, ECR, EBIO, ELTP, EXXI, FFMGF, GTATQ, GALE, GST, GNKOQ, GEQU, GSAT, GOL, GSS, GDPM, GBSN, GPL, PHOT, ZINCQ, HK, HEB, HEMP, HYSR, IDTA, IDCN, INSV, ICLD, KEG, LVSDF, LLBO, LF, LINE, LNCO, LQMT, MDMN, MONIF, MNKD, MCOA, MSTX, MJNA, MCPIQ, MYEC, NXTD, NBGGY, NAVB, NM, NNUP, ORIG, PMCS, PACD, PGNPF, PGH, PWE, PMCB, PSID, RXMD, PPCH, PVHO, PSWS, QTM, RMHB, RRR, REXX, RBYCF, SDOC, SSE, SGLB, SXE, SUG, SFSF, STPFQ, SNTA, BNHLF, TTCM, TEA, TNE, TRTC, TNB, TCPTF, THLD, TGD, TPAC, TPLM, UPL, UNIS, UEC, VLDI, VSPC, VMEM, VPRO, VTCQ, WRES, WHEN, XOMA
During the first iteration of the project, Scrapy was used to scrape the NASDAQ site for historical price data on the above tickers. However, after some more searching, it was decided to use the yahoo_finance python package instead of scraping the NASDAQ site. The main reasons for that change were that it simplified the codebase, and it made the system more reliable. The data was collected for one year prior, ranging from 2015-02-24 to 2016-02-23.
After collecting the data, storing it in the database, and looking at it with WEKA, it was time to clean the data.
30707 records were left after cleanups.
The records collected had the following attributes date, ticker, day_open, day_close, day_high, day_low, day_volume, and avg_daily_volume.
This data was used to derive more features (attributes) such as two_day_volume, three_day_volume, and more. The reason for this was to see if a certain set of features would better predict whether a stock would close a winner or a loser. Therefore, in total, the following attributes were derived
id, date, ticker, day_open, day_close, day_high, day_low, prev_day_high, prev_day_low, day_volume, two_day_volume, three_day_volume, four_day_volume, five_day_volume, six_day_volume, seven_day_volume, eight_day_volume, nine_day_volume, ten_day_volume, three_week_volume, one_month_volume, avg_daily_volume, day_winner
Next, it was time to choose a subset of the above features that would help predict whether a stock would close a winner or loser. WEKA’s attribute selection tool includes several algorithms to help choose a subset of features. A few notable ones were GainRatioAttributeEval, InfoGainAttributeEval, and ClassifierSubsetEval.
ClassifierSubsetEval evaluates attribute subsets on training data or a seperate hold out testing set. Since it was thought that a classification tree may be best for predicting results in this project, choosing attributes by testing them out on a classification tree model sounded like a good idea - and that is what ClassifierSubsetEval does. The evaluator was told to test attributes on a J48 Tree exhaustively, using training data. By using ExhaustiveSearch as the search method, the evaluator performed an exhaustive search through the space of attribute subsets starting from the empty set of attributes. This meant that it would be slow but exhaustive. Based on the results, the following attributes were chosen
Now it was time to choose a classifier. The training data set was initially run through the ZeroR rule classifier, in order to come up with baseline results that can be used to compare other classifiers to. ZeroR is good for this, because it takes into consideration 0 features, and just returns the value of the class. So if a classifier/feature set combination can beat the results of ZeroR, then it is a worthy model.
Next, several classifiers were trained on the same training data, and tested on the same testing data as ZeroR above. The one with the best results was a J48 Tree with WEKA’s default parameters, including confidenceFactor of 0.25 and minNumObj of 2. The following confusion matrix briefly summarizes the results
=== Confusion Matrix === a b <-- classified as 194 772 | a = t 263 1837 | b = f
Looking at the confusion matrix, one can see that the model produced 194 true positives, 772 false negatives, 263 false positives, and 1837 true negatives. With regard to our application of the model, we could not afford false positives (model says buy, and stock loses), because we would end up spending money on losing stocks. It was OK to have false negatives (model says do not buy, and stock wins), although we wish they were predicted correctly to not miss out on those gains. It was also good to predict true negatives because those were potential losses averted.
Comparing this model’s results with the results of ZeroR previously, the model performed worse than ZeroR. This meant that it was better to just not play penny stocks with our model, since ZeroR showed us that we would end up with a losing stock around 68.5% of the time.
The results showed that historical price data alone is not enough to predict stock performance. Next steps are to collect promotional data from online forums and newsletters and combine it with the previously collected price data in an attempt to produce better results.
Thank you for reading along! I hope you enjoyed. Feel free to email me feedback and see more projects.