Machine learning and other gibberish
See also: https://sharing.leima.is
Archives: https://datumorphism.leima.is/amneumarkt/
#data

Last year a colleague introduced Marimo to me. It was surprisingly good after the first few days but then I was annoyed by the auto rerun on expensive computations.

Then I bailed out.

This morning I read a blog on this topic and realized there's a switch...

Time to jump in again.



https://docs.marimo.io/guides/configuration/runtime_configuration/#disable-autorun-on-cell-change-lazy-execution
#data

Do Not Use Azure. To be honest, I can't even name one good thing from Microsoft.
- By a person who is frustrated by Microsoft products including the sh*tty Windows OS.

https://www.reddit.com/r/dataengineering/s/G9uTQmVxWC From the dataengineering community on Reddit: MS Fabric destroyed 3 months of work
#data

In physics, people claim that more is different. In the data world, more is very different. I'm no expert in big data, but I learned the scaling problem only when I started working for corporates.

I like the following from the author.

> data sizes increase much faster than compute sizes.

In deep learning, many models are following a scaling law of performance and dataset size. Indeed, more data brings in better performance. But the increase in performance becomes really slow. Business doesn't need a perfect model. We also know computation costs money. At some point, we simply have to cut the dataset, even if we have all the data in the world.

So ..., data hoarding is probably fine, but our models might not need that much.

https://motherduck.com/blog/big-data-is-dead/ Big Data is Dead - MotherDuck Blog
#data

Just got my ticket.

I have been reviewing proposals for PyData this year. I saw some really cool proposals so I finally decided to attend the conference.

https://2023.pycon.de/blog/pyconde-pydata-berlin-tickets/ PyConDE & PyData Berlin 2023 Tickets
#data

https://evidence.dev/

I like the idea. My last dashboarding tool for work was streamlit. Streamlit is lightweight and fast. But it requires Python code and a Python server.

Evidence is mostly markdown and SQL. For many lightweight dashboarding tasks, this is just sweet.

Evidence is built on node. I could run a server and provide live updates but I can already build a static website by running npm run build.

Played with it a bit. Nothing to complain about at this point.
#data

If you are building a simple dashboard using python, streamlit is a great tool to get started. One of the problems in the past was to create multipage apps.

To solve this problem, I created a template for multipage apps a year ago.
https://github.com/emptymalei/streamlit-multipage-template

But today, streamlit officially introduced multipage support. And it looks great. I haven’t built any dashboards for a while, but to me, this is still the go-to solution for a dashboard.
https://blog.streamlit.io/introducing-multipage-apps/
#data #ds

Disclaimer: I'm no expert in state diagram nor statecharts.

It might be something trivial but I find this useful: Combined with some techniques in statecharts (something frontend people like a lot), state diagram is a great way to document what our data is going through in data (pre)processing.

For complicated data transformations, we can make the corresponding state diagram and follow your code to make sure it is working as expected. The only thing is that we are focusing on the state of data not any other system.

We can use some techniques from statecharts, such as hierarchies and parallels.

State diagram is better than flowchart in this scenario because we are more interested in the different states of the data. State diagrams automatically highlights the states and we can easily spot the relevant part in the diagram and we don’t have to start from the beginning.

I documented some data transformations using state diagrams already. I haven't tired but it might also help us document our ML models.


References:
1. https://statecharts.dev
2. https://en.wikipedia.org/wiki/State_diagram
 
 
Back to Top