I spent the day at a data science event held by the Minneanalytics community. I summoned my inner sponge and absorbed as much information as possible. The event was called Data Tech 2019! I listened to a few talks on machine learning, creating data lakes, and natural language processing. I took these themes away from the event:
Just like technical writing, knowing your audience is very important. Think about who will use the data. How quickly will they need it and what do they need it for? Is this data to be used for machine learning models or is it data a business analyst needs in a report every morning? These are the types of questions you should be asking to avoid turning your data lakes into data swamps, and to create a positive data culture from which to structure and plan your data governance.
Data Governance is key for successful data lakes and big data storage. Things like cataloging and securing data are of utmost importance. The data should also be structured in ways to make searching it simple and auditable. If the data cannot be easily audited, things like duplication can occur and human error might not get caught quickly.
Automation is essential to scale big data. Managing data lakes with hundreds or thousands of pipelines is not possible to scale manually. The cleaning and loading techniques need to be automated to make incorporating future data easier.
Ensembles are being increasingly used. Whether it is for evaluating features or comparing outcomes, it seems like more and more tools were incorporating the ability to generate different kinds/sets of models and allow you to select the one with the best fit. For example, one of the sessions discussed using a set of 90+ features to find machine learned models that could predict SP500 future prices. Unfortunately the speaker concluded beating the market using models trained on features derived from technical indicators might still be beyond AI.