For 8 months I have been working as a Data Scientist / Data Analyst at Scalarr. This is my first full-time job in IT and this was an awesome experience!
As a Data Scientist in a tech startup, I participated in various stages of DS&ML development process:
- Analysis
- data cleaning
- visualization
- Research
- feature engineering
- data preprocessing
- models training
- parameters tuning
- validation and evaluation
- Deployment
More details described here.
Research
Our main job is to prevent fraud in advertisement market. We constantly update the models, look for new features and metrics. So far I experimented with unsupervised learning:
- clusterization
- outliers detection
- isolation forests
classical supervised learning:
- decisions trees
- random forests
- Gaussian Mixture Models
- Support Vector Machines
- Naive Bayes models
- Gradient Boosting Machines:
- xgboost
- adaboost
deep learning approaches:
- Feed-forward Neural Nets
- Recurrent Neural Nets
Each experiment involved data preparation, models training, parameters tuning and evaluation.
Models evaluation
As fraudsters invent new techniques to imitate real users (that is one example of their activities) our models should react correctly. That is somewhat similar to time series prediction validation problems.
ETL processes
Advertisement activities generate huge amount of data - thousands of advertisement clicks every day. Arguably, plain Python does not fit for manipulating large databases.
I worked with Dask and ClickHouse alongside with standard SQL to perform basic ETL tasks.
Analysis optimization
Validating machine learning models often requires additional visualization that go beyond, let’s say, F1-score or confusion matrix numbers. I worked on an interactive tool for Jupyter Notebook to simplify this stage.
Stay tuned for more!