For the new project to analyze Edgar document, which is where companies in the US announces their annual report, quarterly report, etc by law. Many investors consider this data as a valuable investment data resource.
Challenge
- Volume: To analyze the Edgar document, the volume is the first challenge. The number of announcements in Edgar is about 3,000 per day. The actual files for the day are up to 40k files. Analyzing 40k files per day can be hard work if you need to do it for the last 3 months’ data. In particular, if you need to repeat the analysis with a modified parameter, 40k per day can be quite a performance burden.
- Cracking Document: The documents consist of various file types, including HTML, PDF, Excel, jpg, png, gif, js, etc. OCR(Optical Character Recognition) required for PDF, JPG, etc.
My first choice of tool for this project was Azure cognitive search. After looking at JFK sample, it was very clear that Azure cognitive search is the right solution. The most appealing feature was that Azure cognitive search can crack any type of document including handwriting scanned one to a text file.
A typical flow is 1) upload file to blob 2) Azure cognitive search grab the file, crack (include OCR) and index it 3) service the indexed data.
We can achieve this by just configuring the Azure cognitive search without any coding. The configuration includes 1) index fields 2) skillset and 3) indexer which manages data flow between data source to skillset and over to any destination. e.g. another blob, Cosmos DB, you name it.
To apply some custom analysis features, I wrote Azure functions as a custom skillset. This is the only coding I’ve done.
Everything went well and worked well. I could upload any type of documents to Azure cognitive search to crack it. It’s just capable out of the box.
One of the great features of the Search index is easy to scale out. By changing the number of Unit options up to 36, I could easily scale it out.
However, these great feature comes at a cost.
- The storage: With the standard S1 tier, the index provides 25GB of storage. This only lasts a few months for my project purpose. The standard S2 tier gives us 100GB. Still, as I need to continuously add new documents into the Search, the 100GB will be filled up eventually.
- Cost: The standard S1 costs US$245.28 per month, per Unit. It’s important. It’s per Unit price. If you use 4 Units to increase the performance, the cost will be US$981.12. The standard S2 with 4 Units costs US$3,924.48.
Cloud is known as productive and economic. What I’ve learned after these experiences is that Cloud can be productive but is not cheap.