Am Neumarkt 😱

Machine learning and other gibberish
See also: https://sharing.leima.is
Archives: https://datumorphism.leima.is/amneumarkt/

08:35 · Nov 21, 2025 · Fri

#data

Last year a colleague introduced Marimo to me. It was surprisingly good after the first few days but then I was annoyed by the auto rerun on expensive computations.

Then I bailed out.

This morning I read a blog on this topic and realized there's a switch...

Time to jump in again.

https://docs.marimo.io/guides/configuration/runtime_configuration/#disable-autorun-on-cell-change-lazy-execution

data

18:57 · Sep 25, 2025 · Thu

#data

https://blog.cloudflare.com/cloudflare-data-platform/

The Cloudflare Blog

Announcing the Cloudflare Data Platform: ingest, store, and query your data directly on Cloudflare

The Cloudflare Data Platform, launching today, is a fully-managed suite of products for ingesting, transforming, storing, and querying analytical data, built on Apache Iceberg and R2 storage.

data

05:13 · Apr 4, 2025 · Fri

#data

Reciprocal Tariff Calculations | United States Trade Representative
https://ustr.gov/issue-areas/reciprocal-tariff-calculations

United States Trade Representative

Presidential Tariff Actions

For the Presidential actions taken pursuant to his authority under the International Emergency Economic Powers Act (50 U.S.C. 1701 et seq.) (IEEPA), see below:

data

19:45 · Feb 21, 2025 · Fri

#data

Do Not Use Azure. To be honest, I can't even name one good thing from Microsoft.
- By a person who is frustrated by Microsoft products including the sh*tty Windows OS.

https://www.reddit.com/r/dataengineering/s/G9uTQmVxWC

From the dataengineering community on Reddit: MS Fabric destroyed 3 months of work

Explore this post and more from the dataengineering community

data

15:27 · Feb 12, 2025 · Wed

#data

This is a cool concept.

https://pola.rs/posts/polars-cloud-what-we-are-building/

pola.rs

Polars Cloud; the distributed Cloud Architecture to run Polars anywhere

DataFrames for the new era

data

20:13 · Dec 7, 2024 · Sat

#data

Nice. Observable's new data app generator.

https://observablehq.com/framework/

data

17:48 · Oct 24, 2024 · Thu

#data

How to run data science projects | Science & technology experiments
https://dzidas.com/ml/2024/10/22/implementing-data-science-projects/

Science & technology experiments

How to run data science projects

The article describes a framework on how to run and implement a data science project

data

08:09 · Mar 27, 2024 · Wed

#data

https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-big-data-age.html

DuckDB

42.parquet – A Zip Bomb for the Big Data Age

A 42 kB Parquet file can contain over 4 PB of data.

data

08:58 · Mar 23, 2024 · Sat

#data

Interesting read

> we propose DATALORE, a framework that explains data changes between an initial dataset and its augmented version to improves traceability

https://www.amazon.science/publications/datalore-can-a-large-language-model-find-all-lost-scrolls-in-a-data-repository

Amazon Science

DataLore: Can a large language model find all lost scrolls in a data repository?

How can we effectively generate missing data transformations among tables in a data repository? Multiple versions of the same tables are generated from the iterative process when data scientists and machine learning engineers fine-tune their ML pipelines…

data

07:32 · Mar 7, 2024 · Thu

#data

Never tried dbt but it is definitely popular judging by the amount of people talking about.

I read these discussions on Reddit and I think it is worth sharing.

https://www.reddit.com/r/dataengineering/s/tphxT7p0kI

and

https://www.reddit.com/r/dataengineering/s/YpOXP3y6av

From the dataengineering community on Reddit

Explore this post and more from the dataengineering community

data

04:45 · Apr 7, 2023 · Fri

#data

Quite useful.

I use pyarrow a lot and also a bit of polars. Mostly because pandas is slow. With the new 2.0 release, all three libraries are seamlessly connected to each other.

https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

datapythonista blog

pandas 2.0 and the Arrow revolution (part I)

Introduction At the time of writing this post, we are in the process of releasing pandas 2.0. The project has a large number of users,...

data

10:28 · Feb 10, 2023 · Fri

#data

In physics, people claim that more is different. In the data world, more is very different. I'm no expert in big data, but I learned the scaling problem only when I started working for corporates.

I like the following from the author.

> data sizes increase much faster than compute sizes.

In deep learning, many models are following a scaling law of performance and dataset size. Indeed, more data brings in better performance. But the increase in performance becomes really slow. Business doesn't need a perfect model. We also know computation costs money. At some point, we simply have to cut the dataset, even if we have all the data in the world.

So ..., data hoarding is probably fine, but our models might not need that much.

https://motherduck.com/blog/big-data-is-dead/

MotherDuck

Big Data is Dead - MotherDuck Blog

Big data is dead. Long live easy data.

data

13:59 · Feb 7, 2023 · Tue

#data

This is gold.

https://youtu.be/pjq3QOxl9Ok

YouTube

So You Wanna Be a Pandas Expert? (Tutorial) - James Powell | PyData Global 2021

So You Wanna Be a Pandas Expert? | (Pre-recorded Tutorial)
Speaker: James Powell

So… you want to be a Pandas expert.

What’s it going to take? Should you memorize the Pandas API? Should you read through the source code, line-by-line, file-by-file? Should…

data

19:55 · Jan 16, 2023 · Mon

#data

Just got my ticket.

I have been reviewing proposals for PyData this year. I saw some really cool proposals so I finally decided to attend the conference.

https://2023.pycon.de/blog/pyconde-pydata-berlin-tickets/

2023.pycon.de

PyConDE & PyData Berlin 2023 Tickets

Tickets for PyConDE & PyData Berlin 2023

data

09:25 · Dec 18, 2022 · Sun

#data

https://evidence.dev/

I like the idea. My last dashboarding tool for work was streamlit. Streamlit is lightweight and fast. But it requires Python code and a Python server.

Evidence is mostly markdown and SQL. For many lightweight dashboarding tasks, this is just sweet.

Evidence is built on node. I could run a server and provide live updates but I can already build a static website by running npm run build.

Played with it a bit. Nothing to complain about at this point.

data

22:42 · Nov 23, 2022 · Wed

#data

Played with polars a bit. It's actually quite fast.

https://www.pola.rs/

pola.rs

Polars

DataFrames for the new era

data

19:32 · Jun 3, 2022 · Fri

#data

If you are building a simple dashboard using python, streamlit is a great tool to get started. One of the problems in the past was to create multipage apps.

To solve this problem, I created a template for multipage apps a year ago.
https://github.com/emptymalei/streamlit-multipage-template

But today, streamlit officially introduced multipage support. And it looks great. I haven’t built any dashboards for a while, but to me, this is still the go-to solution for a dashboard.
https://blog.streamlit.io/introducing-multipage-apps/

data

07:00 · May 26, 2022 · Thu

#data

https://www.theviolenceproject.org/mass-shooter-database/

data

07:25 · May 11, 2022 · Wed

#data

Stop squandering data: make units of measurement machine-readable
https://www.nature.com/articles/d41586-022-01233-w

Nature

Stop squandering data: make units of measurement machine-readable

Nature - In the age of big data, it is time to ensure that units are routinely documented for easy, unambiguous exchange of information.

data

21:26 · Dec 30, 2021 · Thu

#data #ds

Disclaimer: I'm no expert in state diagram nor statecharts.

It might be something trivial but I find this useful: Combined with some techniques in statecharts (something frontend people like a lot), state diagram is a great way to document what our data is going through in data (pre)processing.

For complicated data transformations, we can make the corresponding state diagram and follow your code to make sure it is working as expected. The only thing is that we are focusing on the state of data not any other system.

We can use some techniques from statecharts, such as hierarchies and parallels.

State diagram is better than flowchart in this scenario because we are more interested in the different states of the data. State diagrams automatically highlights the states and we can easily spot the relevant part in the diagram and we don’t have to start from the beginning.

I documented some data transformations using state diagrams already. I haven't tired but it might also help us document our ML models.

References:
1. https://statecharts.dev
2. https://en.wikipedia.org/wiki/State_diagram

statecharts.dev

Welcome to the world of Statecharts

The world of statecharts describes what statecharts are, their benefits and drawbacks, how they differ from state machines, and practical examples on how to use them.

data ds