Astronomy & Music: January 2025

Thursday, January 30, 2025

Data For Foundation Models

It turns out that while I do retain ownership of the transcripts of my YouTube livestreams, there's nothing I can do with them without causing trouble. This is from YT's TOS:

So -- onward. I didn't expect that to go very far. I'll keep the listings up on Etsy for a while, but I'll soon be able to curate labeled dataset. Text is still an excellent data source, but harvesting and labeling is challenging.

The next kind of data that can be fairly easily harvested and curated are labeled images. This is what I've turned my attention to. The task at hand is to not only do the harvesting, but also the post-processing that will package the data up with excellent labeling.

The labeling will initially consist of the usual statistics on the data (min, max, mean, median, mode, standard deviation), and also an 'information content' parameter (entropy). There will also be text labels that describe the image data in English.

Initially I'll package up labeled data into datasets which will be available for purchase on my website (yes, I'm off of Etsy). I can see a time pretty soon when there will be 'made-to-order' datasets based on various parameters set by the client. For example, a dataset that contains high-quality images of various red balls.

----------------

This plot is the 'information content' (the y-axis in units of bits) of the individual frames of a 66.31 second (x-axis) video of a partly cloudy sky. The color of the lines indicates the most common color in the image. That's a very nice sky blue, and contains more information when there's a cloud (like near the beginning). An interesting rabbit hole, but this is one of the ways I'll 'label' this data -- by information content:

Here's a single image about mid-way through the video.

(gotta *love* that Arizona sky!!!)

The labels for this set of images look pretty good:

cumulus_clouds, azure_blue_sky, daytime_outdoor, white_cloud_formations, scattered_cloud_pattern, high_contrast_sky, natural_lighting, wispy_cloud_texture, atmospheric_scene, clear_weather_conditions, cloud_movement_visible, bright_sunlit_clouds, cirrus_cloud_wisps, deep_blue_atmosphere, horizontal_cloud_composition, mid_day_lighting, weather_photography, dynamic_cloud_shapes, stratified_cloud_layers, negative_space_composition

Information content in bits (y-axis) for H (top plot), S (middle), and V (bottom) of these color images. Note the scale and values of the y-axes are different for each plot. x-axis is still time in seconds. Wow what a rabbit hole!

In this plot, there are the same number of images for each color (1/3 of the total). Blue is the top third of information content, red is the middle third, and black is the bottom third. This keeps the size of each section (in this case, the number of images) the same, which might be useful to someone carefully training or testing a foundation model.

There are many ways to organize and extract this data for training and testing foundation models.

Monday, January 27, 2025

First Foundation Model Dataset Listed on Etsy

I wanted to get something out quickly, and mainly for the record.

I'd hoped to be able to list my curated Foundation Model Datasets (FMDs) on my own website, but that'll take longer than what I want to get this thing going. So instead of fiddling and fighting to get some kind of payment thing going on my own site, I've begun posting FMDs on my Etsy store. I can, in all honesty, call this artwork -- so it fits right in to a site like Esty:

AstroArtPrints

My first set of offerings are going to be the transcripts of my ADL livestreams. I'm still the owner of this content (including video and transcripts) and can do with it as I please (ahem, so can YouTube). I'm still working on the price, but the initial price will be $7.77 for one dataset that usually contains about 10,000 total words and 10-20% unique words. LOL gotta find buyers who'll say 'No' so I can make the necessary price adjustments.

So in the case of the FMDs, I've copied the transcript from YT and edited it down to just the text without any other kind of markup. I then make a few measurements of the text and present that as a JSON formatted file that I use in the Description of the item for sale. The format of the JSON file is:

{

"dataset_metadata": {

"type": "text_transcript_analysis",

"version": "1.0",

"purpose": "Language Model Training Features"

"word_level_features": {

"total_word_count": "integer",

"unique_word_count": "integer",

"average_word_length": "float",

"vocabulary_diversity": {

"type_token_ratio": "float"

"word_frequency_distribution": {

"top_n_words": ["string"],

"word_frequency_curve": "array_of_floats"

}

"letter_level_features": {

"total_letter_count": "integer",

"case_distribution": {

"uppercase_percentage": "float",

"lowercase_percentage": "float"

"letter_frequency": {

"absolute_counts": {"a-z": "integer"},

"relative_frequencies": {"a-z": "float"}

}

"complexity_metrics": {

"word_length_distribution": "array_of_integers",

"entropy_measures": {

"word_entropy": "float",

"letter_entropy": "float"

}

At first, these datasets will be best used in the supervised (using any of the above metadata as labels) and non-supervised training of models like many Large Language Models.

Saturday, January 25, 2025

Labeled Data Is All We Need

(I realize that this has very little to directly do with astronomy or music, but I wanted to post this so there's a record. I should also mention that the text below was generated by claude.ai, with substantial prompting from myself. It reflects my sentiments very well.)

Labeled Data Is All We Need

Preface: Looking Forward – Short-Term

In the coming days, I will establish a website where Foundation Models (FMs) can purchase and utilize data I am personally recording and curating. The labeled datasets will match the quality and format of those available from established sources like Hugging Face and AWS. Upon purchase, buyers will receive non-exclusive rights to use the data for training and testing purposes. New data will be continuously added to support ongoing Foundation Model development. Pricing details are currently to be determined.

While selling datasets to Foundation Models is not a novel concept, I observe a lack of individual contributors and a scarcity of unique, high-quality data. Though I often find myself ahead of technological trends, this particular initiative seems somewhat more aligned with current market movements.

I've long recognized my tendency to conceptualize ideas well before their widespread adoption. Typically, my innovative thinking outpaces my resources to fully develop these concepts. In this instance, however, I feel more synchronized with emerging industry trends. Comments from thought leaders like Ilya Sutskever and the recent Stargate Project announcement have validated my approach.

The Stargate Project's proposed $500 billion budget—approximately $342 million daily—intrigued me. I contemplated their potential expenditures: power infrastructure (potentially nuclear), computational resources for model training and prediction generation, and data storage. The critical question emerged: Why massive storage if, as Ilya suggests, no new data exists?

The answer became clear: In the same talk that Ilya said that we were out of data, he also said that "data is the fossil fuel of AI”. I would put it another way, but that's the basic idea. Foundation Models require substantial new labeled datasets from high-quality sources. My work directly addresses this emerging market need.

This whole idea has been compelling me for about a year. It crystallized after hearing the Stargate Project announcement, making me realize the potential market for carefully curated, high-quality data.

My professional focus will now be collecting and curating labeled data for Foundation Models.

Other major work that I'm doing -- astronomical research, and advanced robotics -- will continue. No stopping either of these from happening.