Am Neumarkt 😱

Machine learning and other gibberish
See also: https://sharing.leima.is
Archives: https://datumorphism.leima.is/amneumarkt/

07:39 · Jan 20, 2022 · Thu

#ds

Deepnote supports Great Expectations (GE) now.

I ran their template notebook:

https://deepnote.com/project/Reduce-Pipeline-Debt-With-Great-Expectations-mLT9DFCQSpW4kUBAzzdhBw/%2Fnotebook.ipynb/#00000-e170fae0-7e06-4a7a-85f3-343584ec4b94

23:49 · Jan 3, 2022 · Mon

#ds

https://2022.pycon.de/blog/pyconde-pydata-berlin-tickets/

2022.pycon.de

PyConDE & PyData Berlin 2022 Tickets

Tickets for PyConDE & PyData Berlin 2022

21:26 · Dec 30, 2021 · Thu

#data #ds

Disclaimer: I'm no expert in state diagram nor statecharts.

It might be something trivial but I find this useful: Combined with some techniques in statecharts (something frontend people like a lot), state diagram is a great way to document what our data is going through in data (pre)processing.

For complicated data transformations, we can make the corresponding state diagram and follow your code to make sure it is working as expected. The only thing is that we are focusing on the state of data not any other system.

We can use some techniques from statecharts, such as hierarchies and parallels.

State diagram is better than flowchart in this scenario because we are more interested in the different states of the data. State diagrams automatically highlights the states and we can easily spot the relevant part in the diagram and we don’t have to start from the beginning.

I documented some data transformations using state diagrams already. I haven't tired but it might also help us document our ML models.

References:
1. https://statecharts.dev
2. https://en.wikipedia.org/wiki/State_diagram

statecharts.dev

Welcome to the world of Statecharts

The world of statecharts describes what statecharts are, their benefits and drawbacks, how they differ from state machines, and practical examples on how to use them.

data ds

07:40 · Dec 13, 2021 · Mon

#DS #visualization

https://percival.ink/

A new lightweight language for data analysis and visualization. It looks promising.

I hate jupyter notebooks and I don't use them on most of my projects. One of the reasons is low reproducibility due to its non-reative nature. You changed some old cells and forgot to run a cell below, you may read wrong results.
This new language is reactive. If old cells are changed, related results are also updated.

percival.ink

Percival • Web-based, reactive Datalog notebooks

Percival is a declarative data query and visualization language for exploring complex datasets, producing interactive graphics, and sharing results.

DS visualization

10:36 · Dec 2, 2021 · Thu

#DS

Just in case you are also struggling with Python packages on Apple M1 Macs

I am using the third option: anaconda + miniforge.

https://www.anaconda.com/blog/apple-silicon-transition

Anaconda

A Python Data Scientist’s Guide to the Apple Silicon Transition | Anaconda

Even if you are not a Mac user, you have likely heard Apple is switching from Intel CPUs to their own custom CPUs, which they refer to collectively as “Apple Silicon.” The last time Apple changed its computer architecture this dramatically was 15 years ago…

14:14 · Nov 12, 2021 · Fri

#DS #Visualization

Okay, I'll tell you the reason I wrote this post. It is because xkcd made [this](https://xkcd.com/2537/).

---

Choosing proper colormaps for our visualizations is important. It's almost like shooting a photo using your phone. Some phones capture details in every corner, while some phones give us overexposed photos and we get no details in the bright regions.

A proper colormap should make sure we see the details we need to see. To address the importance of colormaps, we use the two examples shown on the website of colorcet[^colorcet]. The two colormaps, hot, and fire, can be found in matplotlib and colorcet, respectively.

I can not post multiple images in one message, please see the full post for the comparisons of the two colormaps. Really, it is amazing. Find the link below:
https://github.com/kausalflow/community/discussions/20

It is clear that "hot" brings in some overexposure. The other colormap, "fire", is a so-called perceptually uniform colormap. More experiments are performed in colorcet. Glasbey et al showed some examples of inspecting different properties using different colormaps[^Glasbey2007].

One of the methods to make sure the colormap shows enough details is to use perceptually uniform colrmaps[^Kovesi2015]. Kovesi provides a method to validate if a color map has uniform perceptual contrast[^Kovesi2015].

---
References and links mentioned in this post:

[^colorcet]: Anaconda. colorcet 1.0.0 documentation. [cited 12 Nov 2021]. Available: https://colorcet.holoviz.org/
[^colorcet-github]: holoviz. colorcet/index.ipynb at master · holoviz/colorcet. In: GitHub [Internet]. [cited 12 Nov 2021]. Available: https://github.com/holoviz/colorcet/blob/master/examples/index.ipynb
[^Kovesi2015]: Kovesi P. Good Colour Maps: How to Design Them. arXiv [cs.GR]. 2015. Available: http://arxiv.org/abs/1509.03700
[^Glasbey2007]: Glasbey C, van der Heijden G, Toh VFK, Gray A. Colour displays for categorical images. Color Research & Application. 2007. pp. 304–309. doi:10.1002/col.20327
[^matplotlib-colormaps]: Choosing Colormaps in Matplotlib — Matplotlib 3.4.3 documentation. [cited 12 Nov 2021]. Available: https://matplotlib.org/stable/tutorials/colors/colormaps.html

xkcd

Painbow Award

DS Visualization

16:06 · Nov 8, 2021 · Mon

#DS #news

This is a post about Zillow's Zetimate Model.

Zillow (https://zillow.com/ ) is an online real-estate marketplace and it is a big player. But last week, Zillow withdrew from the house flipping market and planned to layoff a handful of employees.

There are rumors indicating that this action is related to their machine learning based price estimation tool, Zestimate ( https://www.zillow.com/z/zestimate/ ).

At a first glance, Zestimate seems fine. Though the metrics shown on the website may not be that convincing, I am sure they've benchmarked more metrics than those shown on the website.
There are some discussions on reddit.

Anyways, this is not the best story for data scientists.

1. News: https://www.reddit.com/r/MachineLearning/comments/qlilnf/n_zillows_nnbased_zestimate_leads_to_massive/
2. This is Zestimate: https://www.zillow.com/z/zestimate/
3. https://www.wired.com/story/zillow-ibuyer-real-estate/

Zillow

Zillow: Real Estate, Apartments, Mortgages & Home Values

The leading real estate marketplace. Search millions of for-sale and rental listings, compare Zestimate® home values and connect with local professionals.

DS news

07:26 · Nov 3, 2021 · Wed

#DS #fun

Looks familiar.

https://xkcd.com/2533/

xkcd

Slope Hypothesis Testing

DS fun

06:56 · Nov 3, 2021 · Wed

#DS #ML

Microsoft created two depositories for Machine Learning and Data Science beginners. They created many sketches. I love this style.

https://github.com/microsoft/Data-Science-For-Beginners

https://github.com/microsoft/ML-For-Beginners

DS ML

07:40 · Sep 12, 2021 · Sun

#DS

Cute comics on interactive data visualization

https://hdsr.mitpress.mit.edu/pub/49opxv6v/release/1

Harvard Data Science Review

Why Do We Plot Data? · Harvard Data Science Review

Accompanying text for the “Designing for interactive exploratory data analysis requires theory of graphical inference” Explainer Zine

21:04 · Sep 8, 2021 · Wed

#DS

Jetbrains released a new IDE for data scientist.

https://www.jetbrains.com/dataspell/

JetBrains

JetBrains DataSpell: The IDE for Data Scientists.

JetBrains DataSpell is an IDE for data science with intelligent Jupyter notebooks, interactive Python scripts, and lots of other built-in tools.

10:05 · Aug 26, 2021 · Thu

#DS

Hullman J, Gelman A. Designing for interactive exploratory data analysis requires theories of graphical inference. Harvard Data Science Review. 2021. doi:10.1162/99608f92.3ab8a587
https://hdsr.mitpress.mit.edu/pub/w075glo6/release/2

Creating visualizations seems to be a creative task. At least for entry-level visualization tasks, we follow our hearts and build whatever is needed. However, visualizations are made for different purposes. Some visualizations are simply explorations and for us to get some feelings on the data. Some others are built for the validation of hypotheses. These are very different things.

Confirmation of an idea using charts is usually hard. In most cases, we need statistical tests to (dis)prove a hypothesis instead of just looking at the charts. Thus, visualizations become a tool to help us formulate a good question.

However, not everyone is using charts as hints only. Instead, many use charts to conclude. As a result, even experienced analysts draw spurious conclusions. These so-called insights are not going to be too solid.

The visual analysis seems to be an adversarial game between humans and the visualizations. There are many different models for this process. A crude and probably stupid model can be illustrated through an example of analysis by the histogram of a variable.
The histogram looks like a bell. It is symmetric. It is centered at 10 with an FWHM of 2.6. I guess this is a Gaussian distribution with a mean 10 and sigma 1. This is the posterior p(model | chart).
Imagine a curve like what was just guessed on top of the original curve. Would my guess and the actual curve overlap with each other?
If not, what do we have to adjust? Do we need to introduce another parameter?
Guess the parameter of the new distribution model and compare it with the actual curve again.
The above process is very similar to a repetitive Bayesian inference. Though, the actual analysis may be much more complicated as the analysts would carrier a lot of prior knowledge about the generating process of the data.

Through this example, we see that integrating explorations with preliminary model building as Confirmatory Data Analysis may bring in more confidence in drawing insights from charts.

On the other hand, including complicated statistical models leads to misinterpretations since not everyone is familiar with statistical hypothesis testing. So the complexity has to be balanced.

Harvard Data Science Review

Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference · Issue 3.3, Summer 2021

21:38 · Jul 29, 2021 · Thu

#DS

This is an interesting report by anaconda. We can kind of confirm from this that Python is still the king of languages for data science. SQL is right following Python.

Quote from the report:
> Between March 2020 to February 2021, the pandemic economic period, we saw 4.6 billion package downloads, a 48% increase from the previous year.
We have no data for other languages so no predictions can be made but it is interesting to see Python growing so fast.

The roadblocks different data professionals facing are quite different. If the professional is a cloud engineer or mlops, then they do not mention that skills gap in the organization that many times. But for data scientists/analysts, skills gaps (e.g., data engineering, docker, k8s) is mentioned a lot. This might be related to the cases when the organization doesn't even have cloud engineers/ops or mlops.

See the next message for the PDF file.

https://www.anaconda.com/state-of-data-science-2021

Anaconda

Anaconda | State of Data Science 2021

Anaconda is the birthplace of Python data science. We are a movement of data scientists, data-driven enterprises, and open source communities.

08:15 · Jul 15, 2021 · Thu

#DS

PyData goes virtual this year.

https://pydata.org/global2021/present/

PyData Global 2021

Present | PyData Global 2021

21:23 · Jun 14, 2021 · Mon

#DS

A library for interactive visualization directly from pandas.

https://github.com/santosjorge/cufflinks

07:33 · May 25, 2021 · Tue

#DS

This paper serves as a good introduction to the declarative data analytics tools.

Declarative analytics performs data analysis using a declarative syntax instead of functions for specific algorithms. Using declarative syntax, one can “describe what you want the program to achieve rather than how to achieve it”.
To be declarative, the declarative language has to be specific on the tasks. With this, we can only turn the knobs of some predefined model. To me, this is a deal-breaker.

Anyways, this paper is still a good read.

Makrynioti N, Vassalos V. Declarative Data Analytics: A Survey. IEEE Trans Knowl Data Eng. 2021;33: 2392–2411. doi:10.1109/TKDE.2019.2958084
http://dx.doi.org/10.1109/TKDE.2019.2958084

ieeexplore.ieee.org

Declarative Data Analytics: A Survey

The area of declarative data analytics explores the application of the declarative paradigm on data science and machine learning. It proposes declarative languages for expressing data analysis tasks and develops systems which optimize programs written in…

05:13 · May 21, 2021 · Fri

#DS

https://octo.github.com/projects/flat-data

Hmmm, so they gave it a name.
I've built so many projects using this approach. I started building such data repos using CI/CD services way before github actions was born. Of course github actions made it much easier.
One of them is the EU covid data tracking project ( https://github.com/covid19-eu-zh/covid19-eu-data ). It's been running for more than a year with very little maintenance. Some covid projects even copied our EU covid data tracking setup.

I actually built a system (https://dataherb.github.io) to pull such github actions based data scraping repos together.

GitHub Next

GitHub Next | Flat Data

GitHub Next Project: Flat explores how to make it easy to work with data in git and GitHub, offering a simple pattern for bringing working datasets into your repositories and versioning them.

20:14 · May 11, 2021 · Tue

#DS

“Don’t pull down the data. Do it with SQL.”

https://hakibenita.com/sql-for-data-analysis

Hakibenita

Practical SQL for Data Analysis

What you can do without Pandas

13:59 · May 10, 2021 · Mon

#career #DS

I believe this article is relevant.
Most data scientists have very good academic records. These experiences of excellence compete with another required quality in the industry: The ability to survive in a less ideal yet competitive environment.
We could be stubborn and find the environment that we fit well in or adapt based on the business playbook. Either way is good for us as long as we find the path that we love.

(I have a joke about this article: To reasoning productively, we do not need references for our claims at all.)

https://hbr.org/1991/05/teaching-smart-people-how-to-learn#

Harvard Business Review

Teaching Smart People How to Learn

Every company faces a learning dilemma: the smartest people find it the hardest to learn.

career DS

08:54 · May 10, 2021 · Mon

#DS #EDA #Visualization

If you are keen on data visualization, the new Observable Plot is something exciting for you.
Observable Plot is based on d3 but it is easier to use in Observable Notebook. It also follows the guidelines of the layered grammar of graphics (e.g., marks, scales, transforms, facets.).

https://observablehq.com/@observablehq/plot

Observable

Observable Plot

The JavaScript library for exploratory data visualization

DS EDA Visualization