Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
Log In
Found the internet!

Data Science

r/datascience

Posts
Read the Wiki
7
pinned by moderators
Posted by4 days ago
7
77 comments
259
Posted by15 hours ago
259
55 comments
47
Crossposted by17 hours ago
Posted by2 days ago

[N] Mozilla launched a responsible AI challenge and I'm stoked about it

who's applying and what are you planning to build??? https://www.axios.com/2023/03/15/mozilla-responsible-ai-challenge

30 points
47
0 comments
10
Posted by11 hours ago
10
11 comments
32
Posted by18 hours ago

I'm new to this and so I've been wanting to know what other people have been using to make their work feel as smooth as butter. Since I've been learning lots and not just the industry standard stuff, I wanted to share what little I found to be valuable which others may want to try. The main goal of this post is to share, critique, and provide suggestions so that we can all find the setup we like most. I am also looking for new, up and coming tech, and definitely not afraid to try new things!

IDE: VSCode with the Jupyter Notebook Extension. What I like about it is that I can view data structures like series/dataframes in a table format by clicking the variable in the Jupyter: Variables pane at the bottom. I started with plain vanilla jupyter notebooks from Anaconda so this was pretty nice. I have seen demos that Jupyter Lab has something like this, so if anyone has used both VSCode's notebooks and used Lab, your input would be appreciated. I hear good things about PyCharm and Spyder. Some people also use Google Collab, DataSpell, and DeepNote but I don't know enough about it. I did play around with DeepNote, and it was very cool but I didn't feel compelled to switch (and you have to pay for it!).

Tools:

  • A code helper: A few months back I was googling everything and I would've listed Stackoverflow. I might actually use that occasionally, but these days I use ChatGPT and Bing AI. For more current info or news-based I'll use Bing AI since it uses live search results, and for information that is knowledge based I might use ChatGPT. ChatGPT saves conversations so it's great for exploring topics in depth and referencing that conversation later. For those who have used both, maybe you know what I'm talking about and can provide a better explanation as to which is better for what purpose.

  • Software: Excel is an obvious one. For instance, if I have a huge dataset and I just want to delete out columns that I don't need with Ctrl+click to select, it's easier and quicker than copy + pasting or typing out each of the string column names I want to "df.drop()". Excel is great for quick and simple stuff. Some software I have been learning about are I guess what I would consider as no- or low-code data analytics platforms, such as Alteryx, KNIME, and Orange. These software let you practically run an entire ETL pipeline. I believe Alteryx and KNIME are the gold-standard in this category, and Orange is a "lite" version of the two and is available in Anaconda. I think these are pretty cool, and I personally haven't found a huge use case for them since I've been chugging away in my notebooks with Python, but I can see the value. Would love for someone to chime in on these tools and how they compare to manually doing stuff in code, especially for large datasets.

  • Version Control: This is where I'm primarily lacking, but I know that Github is the go-to. I don't use this but I know that a ton of people do. I don't even know where to start to be honest. I usually just create a new .ipynb file for each analysis or phase of an ETL pipeline haha. I'm also not too aware of what other innovative tools for version control exist.

  • Python Libraries: Besides the obvious stuff like Pandas/NumPy, MatplotLib/Seaborn, and your popular ML libraries, I've recently found out about this library called Polars. It's basically a Rust version of Pandas, and it's super powerful. Some operations that I've run, that would've taken hours with Pandas, took me minutes. But I've been hearing that Pandas 2.0 which will be released some time this month, has been looking at using PyArrow dtypes (if I recall correctly) and the speed is comparable to Polars. I mean these two are FAST. Another contender is DuckDB but I think the new Pandas and Polars are still faster. I mostly use Pandas but if there is some heavy lifting, I'll swap the dataframe to a polars one with a quick function, run it with polars, then back to pandas.

Anyway, that's just some things I can immediately think of. Looking forward to your suggestions! Bonus points for anything new and innovative. Cheers.

https://preview.redd.it/qj2cywt1r4oa1.png?width=1920&format=png&auto=webp&v=enabled&s=01626413e867c03a18d40309ea3a7fdd16c4064a
32
13 comments

About Community

A place for data science practitioners and professionals to discuss and debate data science career questions.
Created Aug 6, 2011

858k

Members

424

Online

r/datascience Rules

1.
Be Fair. Be Patient. Be Helpful.
2.
Stay On Topic
3.
Use the Weekly Thread
4.
No Video Links
5.
No Listicles
6.
No Surveys
7.
Limit Self-Promotion
8.
/r/datascience is not stack overflow
9.
/r/datascience is not a homework helper
10.
/r/datascience is not a crowd-sourced Google
11.
Memes are only allowed on Mondays

Related Communities

r/MachineLearning

2,600,733 members

r/learnmachinelearning

287,378 members

r/statistics

527,938 members

r/AskStatistics

61,602 members

r/learnpython

695,689 members

r/rstats

70,917 members

r/cscareerquestions

976,687 members

r/dataengineering

93,357 members

Moderators

Moderator list hidden. Learn More