On the brink of infinite data

How institutions manage digital asset challenges

Home Blog On the brink of infinite data
Amanda Moore at AMA 2019
Amanda Moore and James Cole presenting at DAS AMIA 2019.

Perhaps the most intriguing finding I took away from this years’ Digital Asset Symposium (DAS) is the concept of DNA storage for “infinite” data archives, discussed in the show’s first session.

DAS 2019 was held June 5th at the Museum of Modern Art in New York. This annual conference is sponsored by AMIA (Association of Moving Image Archivists), of which LAC Group is a Gold Sponsor.

The dawn of “infinite data archives”

Hyunjun Park, co-founder and CEO of CatalogDNA, began the day’s events by delivering a fascinating presentation, DNA-Based Storage and Computation of Digital Data, about the latest developments and state of the company’s work using synthetic DNA as a data storage medium. The company is a venture-funded startup that is advancing the technology, but it is still not ready for primetime. They’ve proven the concept; now they’re about scaling the technology to a commercially viable platform.

Seems crazy, but the potential is enormous —indeed, infinite—which is the data storage capacity needed in this “zettabyte era” of digital data. CatalogDNA states on its website that the world will generate 160 zettabytes of data by 2025. For those wondering how big a zettabyte is: 1,000,000,000,000,000,000,000 bytes.

With synthetic DNA, an exabyte of storage—a step down from a zettabyte at 1,000,000,000,000,000,000 bytes—could fit on a medium the size of a sugar cube!

The idea of storing information using DNA is promising, but it has been constrained by the high cost. According to The Economist, “Encoding a single gigabyte in DNA would run up a bill of several million dollars.”

CatalogDNA hopes to bring it down to under $10 per gigabyte, which is still cost-prohibitive for many circumstances, but at least in the realm of possibility.

Ultimately, I think the vision is a deep storage archive in the Arctic filled with tube-like containers that provide high-density storage of digitized content for posterity. This is a fascinating approach to the digital storage problem but it also has me dwelling on the notion of storage en masse. This solution could provide an answer to storage on a massive scale. But in parallel, shouldn’t there be some consideration to address the underlying question around what and why something is archived? Should we save everything and why? Is there a place for data curation in this process somewhere as well?

Managing oceans of media

The second session of the day was on Amazon Studios’ approach to managing their “oceans” of media and media data titled, “Fishing in your Data Lake: You Never Know What You’re Going to Catch.” Presented by Dave Ginsberg and Callum Hughes from Amazon Studios, they started their presentation with a funny anecdote from a vendor partner years ago who said, “This cloud thing—it’s just a fad.”

Ginsberg and Hughes proceed to describe their big challenges as a global company and managing enormous volumes of metadata and languages coming from their internal and external production partners.  In many cases, English is not a given, neither in the production deliverables nor on the distribution side.  They have to address lots of different languages with dubbing, subtitles, or both.

They shared their reliance on AI tools such as Amazon Rekognition, which can be trained for tagging and other metadata requirements, like their deliverable requirement for metadata headers within files.  Deliverables from production partners come back to them via Snowballs: Amazon’s transport solution for secure transfer of large amounts of data into and out of the AWS Cloud, at a lower cost than transferring data via high-speed internet.

It was refreshing to hear about Dave and Callum’s approach and their pragmatic methods. They focused on the core problem and the core solution, always coming at their problems from the user perspective and working backward. They also admitted that they are never really finished, always refining and being iterative to address evolving needs.

Dave Ginsberg AMIA 2019
Dave Ginsberg presenting at DAS AMIA 2019.

Linked and open data approach to historical archives

One unfortunate side effect of infinite data is the amount of misinformation that is being created and shared with lightning-speed and reaches. The session Tyranny in Triples was about the importance of reliable data and sources and making them accessible to keep the memory alive, so that destructive aspects of history, like wars, are not repeated. Session presenters were Tom De Smet and Lizzy Jongma from the Dutch Network for War Collections WWII.

Tom De Smet DAS AMIA 2019.
Tom De Smet presenting at DAS AMIA 2019.

Jongma is a data evangelist, deep in the weeds on the work and the processes. She spoke at length about different challenges and successes and showed lots of war photos and footage, much of it chilling imagery, like Nazi forces gathering Jews onto trains. While used primarily by historians and researchers, this deep well of information is important to all societies and civilizations.

They have worked with the goal to have all linked and open data to support sharing and increase the rate of new and refined information that is returned, adding to the overall dataset.

The idea of open and linked data is based on:

  • Agile development that is smart, connected and open to be more flexible.
  • LOCKSS, meaning lots of copies keep stuff safe.

GLAMS—galleries, libraries, artists and museums—have turned to open linked data because it offers a way to share and pool data in order to grow and connect records and datasets in new ways.

Demystifying digital asset acronyms and concepts

Two of the sessions after a lunch break were about shedding a light on technical terms and concepts. One was a session called Acronym Bingo, presented by John Footen from Deloitte’s media and entertainment group. He led us in a game of bingo using special cards with tech acronyms, for which he would ask for someone whose card included the acronym to offer a definition. Just to demonstrate that even people within the industry aren’t fully cognizant of the acronyms used every day, some opted for alternate questions, which were usually silly and funny. It was a light and entertaining way to learn acronyms and definitions.

In a different session, Demystifying Language Metadata, Yonah Levenson from HBO presented her efforts across the industry to establish LMT (Language Metadata Table) as a standard for supporting language identifiers in media. Speaking of acronyms, she discussed the creation of IETF BCP 47, which is an industry standard to identify human languages by standardizing terminology, naming, rules and other details to support exchange and distribution.

Yonah Levenson AMIA 2019
Yonah Levenson presenting at DAS AMIA 2019.

This effort is driven by the need to have an industry-wide standard for languages.

Organizations like the Linguistic Society of America state that there is no definitive count of how many languages there are in the world, but managing the volume and variance of spoken and written languages is a metadata challenge.

Specific to the media world, standardized language terminology improves workflow and supports efforts to easily exchange, manage and distribute localized international versions from major content creators. This can get especially tricky languages such as Chinese (i.e . Mandarin vs. Cantonese, audio vs. written text), etc.

I know, it’s only rock and roll

Fans of Austin City Limits (ACL), shown on many PBS stations, would have loved the session presented by Amanda Moore and James Cole of KLRU TV in Austin, Texas. Titled Miles and Miles of Texas: Preserving Millions of Feet of Tape from Austin City Limits, Moore and Cole gave a great presentation on their project of digitizing nearly 45 years of ACL videotape in a variety of formats from over the decades.

The duo played lots and lots of video clips and also did a deep dive demon on their quality control process with examples of video artifacts they encountered (beyond the “usual suspects) and how they worked to mitigate issues as much as possible.

Registering with EIDR, the Entertainment Identifier Registry Association that provides an industry standard universal identifier registry for effective distribution and monetization of assets, revealed gaps in their dataset that needed filling and fixing.

Austin City Limits operates with a mix of cloud and on-premise storage. Their goal is to create the best quality masters for preserving all assets, working with an offsite digitization vendor. Though unsaid, the project was probably a part of the American Archive collaborative initiative involving Public Broadcasting (PBS) and the Library of Congress (LoC).

Preservation copies are made for LoC and also for themselves, along with an intermediate access file for day-to-day needs. Materials are also provided to the Rock and Roll Hall of Fame. In addition to media, the Austin City Limits collection also includes physical objects like equipment, wardrobes, awards, contracts, tickets, programs and promotional materials.

Thank you AMIA

I personally found DAS 2019 another outstanding event sponsored by AMIA. All of us at LAC Group are pleased to support this organization’s work in promoting the need and the mechanisms for preserving high-value content and archives. Please contact me if you have any questions on any of the sessions I attended, or other questions on media asset management, media storage and preservation.

More information on the DAS 2019 program

Phil Spiegel

Phil Spiegel

Phil Spiegel is Vice President of Corporate Client Engagement at LAC Group. Phil delivers insights and advice based on more than 20 years of media archive and asset management experience gained from companies like National Geographic Television, Corbis Motion, Image Bank and Getty Images.
Phil Spiegel

Latest posts by Phil Spiegel (see all)

Related posts

Advantages to monetized data

In order to fully realize the monetary value of your digitized assets, and essentially find and maximize your opportunities in…

Read more
Data visualization tips

Humans have been presenting information in a pictorial or graphical format since the dawn of civilization—e.g., cave drawings. At the…

Read more