top of page
Writer's pictureAnsiya Nasar

STEPS IN DATA SCIENCE PROCESS








  1. SETTING THE RESEARCH GOAL


  • Spend time on setting the research goal. It should start with thinking about what, how, and why of the project.

  • This states the purpose of your assignment in a clear and focused manner.

  • The outcome should be a clear research goal, a good understanding of the context, well-defined deliverables, and a plan of action with a timetable.

  • This information is then best placed in a project charter. The length and formality may differ between projects and companies.



2. RETRIEVING DATA


  • Retrieving the required data is the second phase of the project. Many companies will have data stored in repositories like databases, data marts, data warehouses, etc. What they don't have can often be bought by third parties.

  • Most of the high-quality data are available for public and commercial use and are in different formats, like text-file format or table format. IT institute in Kerala. SAP institute in Kerala.

  • Two types of data are internal data and external data. Internal data are the data that are generated and used by the company and are usually private. So it can't be assessed by third parties without permission. External data are the data that are collected from outside the company and are available to the public.

  • So here, the first thing to do is verify the internal data. Most companies have programs for maintaining key data, so much of the cleaning process may already be done. This data can be stored in data repositories.

  • If the required data is not available, we can seek help from other companies that provide this type of database.

  • Then spend some time on the data cleaning process because most of the errors are easy to spot during the data gathering phase, but being too careless will take more hours for data scientists to solve the data issues.


3. Data Preprocessing


  • Real-world datasets contain lots of entry errors, missing values, inconsistent values, etc. Data preprocessing is the process of converting raw data into understandable and usable form.

  • The main objective of this step is to ensure the quality of the data before applying any Machine Learning or Data mining method.


    A few key steps involved are:

  • Data cleaning: Uses methods to handle incorrect, incomplete, inconsistent, or missing values.

  • Data integration: Combining data from multiple sources.

  • Data reduction: Reduces the volume and size of the input data.

  • Data transformation: Converting data into a format that helps in building an ML model.



4. Data Exploration


  • The information becomes much easier to understand when it is shown in pictures; therefore, we mainly use graphical techniques to gain an understanding of our data and the interaction between the variables.

  • It helps to identify errors, as well as better understanding of the patterns within the data, and detect outliers.

  • The Exploratory Data Analysis phase is understanding the hidden data, potential issues, etc., so we should be very attentive to avoid further issues during analysis.

  • The visualization techniques can range from simple line graphs or histograms to more complex diagrams.

  • The graphs used here are bar charts, line charts, histograms, box plots, and brushing and linking.


5. Data Modeling


  • With clean data and good understanding, we are now ready to build models using Machine Learning.

  • First, split the dataset into a training set and a testing set. Based on the type of problem and nature of the dataset, we need to select a model. There are three types of models: regression model, classification model, and clustering model.

  • Train the selected model by fitting it to the training set.

  • Then we need to evaluate. The main goal of evaluation is to check how well the model performs on unseen data.

  • If the model performs well and meets the desired performance metrics. It's ready for deployment.


6. Presentation and Automation

  • We successfully analyze the data and build a well-performing model. Now we are ready to present our findings to the world.

Comments


bottom of page