Data science is a concept that is continuing to gain popularity in mainstream media. It can often be in discussions of AI, machine learning, data analytics, predictive analytics, or other related terms. Whether it is the recommended shows on your Netflix account, the creation of digital faces that are indistinguishable from those of real human beings, or even the candidacy of a data scientist in a recent U.S. election, data science is continually revolutionizing our world.
Data science is a combination of mathematics, programming, and the scientific process. Specialized blocks of code are developed to run large amounts of data through mathematical processes to find notable trends, answer complex questions, or develop solutions to a wide range of problems. Applications for data science may vary widely, but any business, governmental agency, or other institution can use data science to find quantitatively determined opportunities for growth and efficiency.
How Data Science Answers Tough Questions
Data science begins with a question. Regardless of whether the question is curious (e.g., “Can you tell the difference between a goldendoodle puppy and a piece of fried chicken?”) or complicated (e.g., “Can I use AI to determine if cancer exists in an image from a patient?”), the goal is to create a solution that is accurate, repeatable, and timely.
Once the question has been determined, a data scientist begins a multistep process to create the necessary solution. The first step in this process is to gather a large amount of data. For some questions, data has already been collected for others to use. However, other questions require data scientists to collect data through surveys or experiments or to “scrape” data from websites when allowed.
The collected data must be made usable before any solutions can be created. A significant portion of the world’s data is unstructured. Unstructured data, such as video and audio files, is data that is not stored in a traditional database format and requires much more manipulation to become usable. Even in structured data, duplicate and other erroneous information needs to be removed.
Cleaning data often requires specific scripts to remove unnecessary values. Common programming languages that are used in data science to write scripts include Python and R. These programming languages are usually run in a modular format through environments such as Jupyter Notebooks. This allows data scientists to work in an incremental process as well as quickly view data as cleaning occurs.
I Have Data—What’s Next?
After the data has been collected and cleaned, data scientists begin exploring it for any noticeable trends through visualization. Data visualizations such as graphs can be created directly within the data scientist’s programming environment. These visualizations give data scientists the initial leads on how to build a solution for the original question. For example, if a data scientist at an ice cream company was asked what month the most ice cream was sold, a line chart of ice cream sales over the last few years may show that July had the highest sales volume. Data scientists may even develop their data visualizations in specific software such as Tableau or Microsoft Power BI because these applications allow users to dynamically interact with data in a much more user-friendly way.
Depending on the question, the data scientist may discover the necessary solution once the data visualizations have been made. However, complex questions often require more thorough analyses. If the ice cream company had instead asked, “Why does mint chocolate chip sell more than vanilla?” there could be several factors involved in why this would occur. An even more complicated question, such as “Can we predict which flavor will sell the most next year?” is often the starting point for many data science projects.
To answer these questions, data scientists can use Python and R to also start creating new data, find how different factors interact with each other, or even apply specific mathematical procedures (or algorithms) to the data. By utilizing these algorithms, the data scientist can build scripts that allow the underlying computer to “learn” how to use the data in a way that shows useful insights (a process known as machine learning). Ultimately, data scientists could answer these complex questions by forecasting accurate data, building AI systems, or encountering other possible solutions that are produced off the backbone of the machine learning process.
Data Science for Any Information Professional
While some information services have started offering more robust analytics, any information professional can freely harness the power of data science for their specific use case. Potential information professional questions could include, “What resources are used across our institution the most?” or “Can I create an AI-driven chatbot to help users navigate our website?” The solutions that are created by information professionals are only limited by the imagination (and, of course, the data).
If you would like to begin your data science journey, consider utilizing both paid courses as well as free online media. Sites such as Mendeley Data and the Registry of Open Data on AWS contain free curated datasets for practicing data science concepts. Python, R, and Jupyter Notebooks are also open source, meaning these coding languages and environments are free to both use and change in whatever way you need. Data scientists will often share their code for their projects online on GitHub or will troubleshoot with other data scientists across a range of online forums.
Once you have completed a few projects, you can volunteer your newfound skills through social impact communities, such as Data Science for Social Good. You can even check out competitions on Kaggle, a site that allows data scientists to compete, sometimes even for cash prizes.
Here are some books, YouTube channels, and websites for those looking to learn more about data science.
Automate the Boring Stuff With Python: Practical Programming for Total Beginners by Al Sweigart
An Introduction to Statistical Learning With Applications in R, Second Edition by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Storytelling With Data by Cole Nussbaumer Knaflic
The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t by Nate Silver
Algorithms of Oppression: How Search Engines Reinforce Racism by Safiya Umoja Noble
Race After Technology: Abolitionist Tools for the New Jim Code by Ruha Benjamin
Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor by Virginia Eubanks
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil
3Blue1Brown (complex mathematics)
Guy in a Cube (Microsoft Power BI software)
Towards Data Science (a Medium publication featuring concepts, ideas, and codes)
Data Science Central (a community for data science practitioners)