Data Science
r/datascience
who's applying and what are you planning to build??? https://www.axios.com/2023/03/15/mozilla-responsible-ai-challenge
I'm new to this and so I've been wanting to know what other people have been using to make their work feel as smooth as butter. Since I've been learning lots and not just the industry standard stuff, I wanted to share what little I found to be valuable which others may want to try. The main goal of this post is to share, critique, and provide suggestions so that we can all find the setup we like most. I am also looking for new, up and coming tech, and definitely not afraid to try new things!
IDE: VSCode with the Jupyter Notebook Extension. What I like about it is that I can view data structures like series/dataframes in a table format by clicking the variable in the Jupyter: Variables pane at the bottom. I started with plain vanilla jupyter notebooks from Anaconda so this was pretty nice. I have seen demos that Jupyter Lab has something like this, so if anyone has used both VSCode's notebooks and used Lab, your input would be appreciated. I hear good things about PyCharm and Spyder. Some people also use Google Collab, DataSpell, and DeepNote but I don't know enough about it. I did play around with DeepNote, and it was very cool but I didn't feel compelled to switch (and you have to pay for it!).
Tools:
A code helper: A few months back I was googling everything and I would've listed Stackoverflow. I might actually use that occasionally, but these days I use ChatGPT and Bing AI. For more current info or news-based I'll use Bing AI since it uses live search results, and for information that is knowledge based I might use ChatGPT. ChatGPT saves conversations so it's great for exploring topics in depth and referencing that conversation later. For those who have used both, maybe you know what I'm talking about and can provide a better explanation as to which is better for what purpose.
Software: Excel is an obvious one. For instance, if I have a huge dataset and I just want to delete out columns that I don't need with Ctrl+click to select, it's easier and quicker than copy + pasting or typing out each of the string column names I want to "df.drop()". Excel is great for quick and simple stuff. Some software I have been learning about are I guess what I would consider as no- or low-code data analytics platforms, such as Alteryx, KNIME, and Orange. These software let you practically run an entire ETL pipeline. I believe Alteryx and KNIME are the gold-standard in this category, and Orange is a "lite" version of the two and is available in Anaconda. I think these are pretty cool, and I personally haven't found a huge use case for them since I've been chugging away in my notebooks with Python, but I can see the value. Would love for someone to chime in on these tools and how they compare to manually doing stuff in code, especially for large datasets.
Version Control: This is where I'm primarily lacking, but I know that Github is the go-to. I don't use this but I know that a ton of people do. I don't even know where to start to be honest. I usually just create a new .ipynb file for each analysis or phase of an ETL pipeline haha. I'm also not too aware of what other innovative tools for version control exist.
Python Libraries: Besides the obvious stuff like Pandas/NumPy, MatplotLib/Seaborn, and your popular ML libraries, I've recently found out about this library called Polars. It's basically a Rust version of Pandas, and it's super powerful. Some operations that I've run, that would've taken hours with Pandas, took me minutes. But I've been hearing that Pandas 2.0 which will be released some time this month, has been looking at using PyArrow dtypes (if I recall correctly) and the speed is comparable to Polars. I mean these two are FAST. Another contender is DuckDB but I think the new Pandas and Polars are still faster. I mostly use Pandas but if there is some heavy lifting, I'll swap the dataframe to a polars one with a quick function, run it with polars, then back to pandas.
Anyway, that's just some things I can immediately think of. Looking forward to your suggestions! Bonus points for anything new and innovative. Cheers.
https://preview.redd.it/qj2cywt1r4oa1.png?width=1920&format=png&auto=webp&v=enabled&s=01626413e867c03a18d40309ea3a7fdd16c4064aAbout Community
Members
Online
Filter by flair
Subreddit News
We're updating the wiki! Contribute here!
The Future of the Subreddit and Its Moderation
How to get user flair
r/datascience Rules
Related Communities
2,600,733 members
287,378 members
527,938 members
61,602 members
695,689 members
70,917 members
976,687 members
93,357 members