Introduction to Data Science
Data is a commodity which is wrapped with a process to make it valuable. Data science is a process to extract value from data in all its forms. Under this process, data from all its forms is compared and fine data is fetched for the further action. It basically refers to the collective processes, scientific theories, technologies, analysis, knowledge base and tools.
Using Data Science approach, scientists apply machine learning algorithms to numbers, text, audio, video, images and more to produce artificial intelligence systems.
There is an overall process of data science where engineering is performed over the raw data by manipulating and cleansing it to make it valuable and then validated model is deployed by using this processed data. Refer to the image of Data Science pipeline.
If we talk about high level forms of raw data, it usually comes in 3 forms: Structured, Semi-Structured and Unstructured
Structured Data: Highly organized data which exists within a repository such as a database (or a csv file). It is easily accessible and format is appropriate to queries and computation
Semi-structured: It includes metadata or the data which can be easily processed than unstructured data.
Unstructured Data: It is non-coded data which cannot be understood by machines like natural language text or audio stream. Refer to the image below of the forms of data.
It is most important part of Data Science process. According to a survey, Data Scientists use around 80 percent of their time to collect, manipulate, clean and prepare data and rest of the time is spent in data modeling with the help of algorithms
Data manipulation: In this process, raw data is manipulated to make it useful for data analytics or to train a machine learning model
Data cleansing: Generally manipulated data is typically messy and infected with common issues like missing values, inconsistent records or insufficient parameters. So in this step, data is made syntactically and semantically correct.
Data Preparation: This is also called preprocessing where cleanse data is normalized. Using normalization, an input feature is transformed to distribute the data evenly into an acceptable range for the machine learning algorithm.
This is the next stage after data engineering. The data after processing via data engineering techniques is sent to automate. This stage of Data science process is called machine learning. In terms of Data Science terminology, machine learning is basically an artificial intelligence tool that automates the processed data. It integrates advanced algorithms that learns on their own from human inputs and processes that massive data. There are basic classifications of machine learning:
Supervised learning: An algorithm which learns from example data and their associated target responses that consists of variables, numeric values comes under supervised learning. It is like a human learning e.g. my senior gave me a good example to teach me something and I memorized that to derive general rules from that example.
Un-supervised learning: An algorithm which learns from example data but out of their associated responses. It resembles the techniques used by humans to figure out that certain objects or events are from the same class.
Reinforcement learning: An un-supervised algorithm with negative and positive feedback comes under reinforcement learning. As per this learning, algorithm takes decision by its own as per the feedback and these decisions bear the consequences. In general terms, it is like trial and error.
Semi supervised learning: This learning is like incomplete training where some target outs are missing.
All above learnings are used in sensitivity and cost analysis as from the different model evaluation, fine data is fetched for further action.
This is the last stage after machine learning where target responses from the machine learning are translated to the action items and these results are applied to research pipelines. After automating the data from artificial intelligence, it is used to translate in the result so that we now have best solution for the further process.
In order to enhance the business by being more data driven within the company, Data science plays very important role. Some shipment companies like FedEx, DHL, UPS use data science approach to find out the best routes, time and mode of transport for the delivery of their shipment. The projects processed using data science give multiplicative return on investment both from fine data and development of the product. Organizations who want to stay competitive in the age of big data need to effectively develop and implement data science capabilities.
Author – Zafar Ali